FrontPage
Understanding Statistics in Corpus Linguistics †
 Variables
 categorical
 binary (2 categories)
 multiple (n > 2): (a) nominal vs. (b) ordinal
Another aspect of variables: †
 explanatory/predictor/independent variables
 response/outcome/dependent variables
Univariate vs. Bivariate analyses †
 univariate
 an examination of a single variable
Concept of "significance" †
 Difference in proportions
 significance = a difference that is sufficiently large enough to trust it
Null Hypothesis †
 There is no particular difference
Expected frequencies †
 the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5
Chisquare †
 sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells
 The probability of chisquare statistic is known for each number of degrees of freedom [number of groups 1]
 Advantages
 easy to understand
 used widely
 Disadvantages
 For small O in 2x2, apply Yate's correction or user Fisher exact test
 Dunning shows chisquare is not a good test when O are small and N is large.
 Loglikelihood test does basically the same job without these limitations
Collocation statistics †
 Notation borrowed from Stefan Evert
 Effect size
 Observed/Expected
 Could be very high effect size with only a single instance
 Mutual Information
 Formal definition: log(p(n,c)/(p(n)p(c)))
 MI= log (Observed/Expected)
 MI of 3 = oftrecommended cutoff = observed 8 times greater than expected
 frequency floors = recommended value is 10 (Andrew Hardie)
 Significance testing
 problems
 random samples from two populations
 Chisquare test
 Loglikelihood test
 2 x (sum of (Obs * log(Obs/Exp))
 MI and LL
 grammatical patterns & function words > LL
 lexical words and semantics > MI
 LL
 biased towards words where there is lots of evidence due to high overall frequency
 MI
 biased towards words where effect size is huge due to low overall frequency
 MI3
 log(Obs^3/Exp)
 it overcorrects MI: its highfrequency focus is too great.
Problems †
 Windows and sentence boundaries
 No hope of seeing how they are related to each other
 The formulae often come in multiple versions...
 Overlapping windows
 Martin Amis problem (Hardie)
 Possibilities (speculative proposal by Hardie)
 sig test like LL
 rank the list by MI or effect size
Multivariate Analysis †
 Control of variables
 Interaction of multiple predictor variables over response variables
 Loglinear analysis
 Generalized linear model
