FrontPage
Understanding Statistics in Corpus Linguistics †
- Variables
- categorical
- binary (2 categories)
- multiple (n > 2): (a) nominal vs. (b) ordinal
Another aspect of variables: †
- explanatory/predictor/independent variables
- response/outcome/dependent variables
Univariate vs. Bivariate analyses †
- univariate
- an examination of a single variable
Concept of "significance" †
- Difference in proportions
- significance = a difference that is sufficiently large enough to trust it
Null Hypothesis †
- There is no particular difference
Expected frequencies †
- the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5
Chi-square †
- sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells
- The probability of chi-square statistic is known for each number of degrees of freedom [number of groups -1]
- Advantages
- easy to understand
- used widely
- Disadvantages
- For small O in 2x2, apply Yate's correction or user Fisher exact test
- Dunning shows chi-square is not a good test when O are small and N is large.
- Log-likelihood test does basically the same job without these limitations
Collocation statistics †
- Notation borrowed from Stefan Evert
- Effect size
- Observed/Expected
- Could be very high effect size with only a single instance
- Mutual Information
- Formal definition: log(p(n,c)/(p(n)p(c)))
- MI= log (Observed/Expected)
- MI of 3 = oft-recommended cut-off = observed 8 times greater than expected
- frequency floors = recommended value is 10 (Andrew Hardie)
- Significance testing
- problems
- random samples from two populations
- Chi-square test
- Log-likelihood test
- 2 x (sum of (Obs * log(Obs/Exp))
- MI and LL
- grammatical patterns & function words --> LL
- lexical words and semantics --> MI
- LL
- biased towards words where there is lots of evidence due to high overall frequency
- MI
- biased towards words where effect size is huge due to low overall frequency
- MI3
- log(Obs^3/Exp)
- it over-corrects MI: its high-frequency focus is too great.
Problems †
- Windows and sentence boundaries
- No hope of seeing how they are related to each other
- The formulae often come in multiple versions...
- Overlapping windows
- Martin Amis problem (Hardie)
- Possibilities (speculative proposal by Hardie)
- sig test like LL
- rank the list by MI or effect size
Multivariate Analysis †
- Control of variables
- Interaction of multiple predictor variables over response variables
- Log-linear analysis
- Generalized linear model