[[FrontPage]]

**Understanding Statistics in Corpus Linguistics [#cf852098]

-Variables
--categorical
---binary (2 categories)
---multiple (n > 2): (a) nominal vs. (b) ordinal

--quantitative
---interval
---discrete

**Another aspect of variables: [#x6bfd7ee]

-explanatory/predictor/independent variables
-response/outcome/dependent variables

**Univariate vs. Bivariate analyses [#c85829ac]

-univariate
--an examination of a single variable

**Concept of "significance" [#u10790cf]
-Difference in proportions
-significance = a difference that is sufficiently large enough to trust it

***Null Hypothesis [#lcc1139e]
-There is no particular difference

***Expected frequencies [#a23f67b4]
-the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5

**Chi-square [#t18709d3]
-sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells
-The probability of chi-square statistic is known for each number of degrees of freedom [number of groups -1]

-Advantages
--easy to understand
--used widely

-Disadvantages
--For small O in 2x2, apply Yate's correction or user Fisher exact test
--Dunning shows chi-square is not a good test when O are small and N is large.
--Log-likelihood test does basically the same job without these limitations


*Collocation statistics [#n7a538d1]

-Notation borrowed from Stefan Evert
--[[association measures>http://www.collocations.de/]]

-Effect size
--Observed/Expected
---Could be very high effect size with only a single instance

-Evidence floor
--f(node, collocate)
--

-Mutual Information
--Formal definition: log(p(n,c)/(p(n)p(c)))
--MI= log (Observed/Expected)
--MI of 3 = oft-recommended cut-off = observed 8 times greater than expected
--frequency floors = recommended value is 10 (Andrew Hardie)

-Significance testing
--problems
---random samples from two populations

-Chi-square test
--sum of (Obs-Exp)^2 / Exp
-Log-likelihood test
--2 x (sum of (Obs * log(Obs/Exp))

-MI and LL
-grammatical patterns & function words --> LL
-lexical words and semantics --> MI

-LL
--biased towards words where there is lots of evidence due to high overall frequency

-MI
--biased towards words where effect size is huge due to low overall frequency

-MI3
--log(Obs^3/Exp)
--it over-corrects MI: its high-frequency focus is too great.

**Problems [#ua00c807]

-Windows and sentence boundaries

-No hope of seeing how they are related to each other
--The formulae often come in multiple versions...

-Overlapping windows
--Martin Amis problem (Hardie)

-Possibilities (speculative proposal by Hardie)
--sig test like LL
--rank the list by MI or effect size















**Multivariate Analysis [#t7720dba]

-Control of variables
-Interaction of multiple predictor variables over response variables
--Log-linear analysis
--Generalized linear model


-


トップ   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS