FrontPage

### Understanding Statistics in Corpus Linguistics †

• Variables
• categorical
• binary (2 categories)
• multiple (n > 2): (a) nominal vs. (b) ordinal
• quantitative
• interval
• discrete

### Another aspect of variables: †

• explanatory/predictor/independent variables
• response/outcome/dependent variables

### Univariate vs. Bivariate analyses †

• univariate
• an examination of a single variable

### Concept of "significance" †

• Difference in proportions
• significance = a difference that is sufficiently large enough to trust it

#### Null Hypothesis †

• There is no particular difference

#### Expected frequencies †

• the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5

### Chi-square †

• sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells
• The probability of chi-square statistic is known for each number of degrees of freedom [number of groups -1]
• easy to understand
• used widely
• For small O in 2x2, apply Yate's correction or user Fisher exact test
• Dunning shows chi-square is not a good test when O are small and N is large.
• Log-likelihood test does basically the same job without these limitations

## Collocation statistics †

• Effect size
• Observed/Expected
• Could be very high effect size with only a single instance
• Evidence floor
• f(node, collocate)
• Mutual Information
• Formal definition: log(p(n,c)/(p(n)p(c)))
• MI= log (Observed/Expected)
• MI of 3 = oft-recommended cut-off = observed 8 times greater than expected
• frequency floors = recommended value is 10 (Andrew Hardie)
• Significance testing
• problems
• random samples from two populations
• Chi-square test
• sum of (Obs-Exp)^2 / Exp
• Log-likelihood test
• 2 x (sum of (Obs * log(Obs/Exp))
• MI and LL
• grammatical patterns & function words --> LL
• lexical words and semantics --> MI
• LL
• biased towards words where there is lots of evidence due to high overall frequency
• MI
• biased towards words where effect size is huge due to low overall frequency
• MI3
• log(Obs^3/Exp)
• it over-corrects MI: its high-frequency focus is too great.

### Problems †

• Windows and sentence boundaries
• No hope of seeing how they are related to each other
• The formulae often come in multiple versions...
• Overlapping windows
• Martin Amis problem (Hardie)
• Possibilities (speculative proposal by Hardie)
• sig test like LL
• rank the list by MI or effect size

### Multivariate Analysis †

• Control of variables
• Interaction of multiple predictor variables over response variables
• Log-linear analysis
• Generalized linear model

Last-modified: 2013-07-22 (��) 01:34:31 (2329d)