[[FrontPage]] **Understanding Statistics in Corpus Linguistics [#cf852098] -Variables --categorical ---binary (2 categories) ---multiple (n > 2): (a) nominal vs. (b) ordinal --quantitative ---interval ---discrete **Another aspect of variables: [#x6bfd7ee] -explanatory/predictor/independent variables -response/outcome/dependent variables **Univariate vs. Bivariate analyses [#c85829ac] -univariate --an examination of a single variable **Concept of "significance" [#u10790cf] -Difference in proportions -significance = a difference that is sufficiently large enough to trust it ***Null Hypothesis [#lcc1139e] -There is no particular difference ***Expected frequencies [#a23f67b4] -the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5 **Chi-square [#t18709d3] -sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells -The probability of chi-square statistic is known for each number of degrees of freedom [number of groups -1] -Advantages --easy to understand --used widely -Disadvantages --For small O in 2x2, apply Yate's correction or user Fisher exact test --Dunning shows chi-square is not a good test when O are small and N is large. --Log-likelihood test does basically the same job without these limitations *Collocation statistics [#n7a538d1] -Notation borrowed from Stefan Evert --[[association measures>http://collocates.de/AM/]] --[[association measures>http://www.collocations.de/]] -Effect size --Observed/Expected ---Could be very high effect size with only a single instance -Evidence floor --f(node, collocate) -- -Mutual Information --Formal definition: log(p(n,c)/(p(n)p(c))) --MI= log (Observed/Expected) --MI of 3 = oft-recommended cut-off = observed 8 times greater than expected --frequency floors = recommended value is 10 (Andrew Hardie) -Significance testing --problems ---random samples from two populations -Chi-square test --sum of (Obs-Exp)^2 / Exp -Log-likelihood test --2 x (sum of (Obs * log(Obs/Exp)) -MI and LL -grammatical patterns & function words --> LL -lexical words and semantics --> MI -LL --biased towards words where there is lots of evidence due to high overall frequency -MI --biased towards words where effect size is huge due to low overall frequency -MI3 --log(Obs^3/Exp) --it over-corrects MI: its high-frequency focus is too great. **Problems [#ua00c807] -Windows and sentence boundaries -No hope of seeing how they are related to each other --The formulae often come in multiple versions... -Overlapping windows --Martin Amis problem (Hardie) -Possibilities (speculative proposal by Hardie) --sig test like LL --rank the list by MI or effect size **Multivariate Analysis [#t7720dba] -Control of variables -Interaction of multiple predictor variables over response variables --Log-linear analysis --Generalized linear model -