Understanding Statistics in Corpus Linguistics

  • Variables
    • categorical
      • binary (2 categories)
      • multiple (n > 2): (a) nominal vs. (b) ordinal
  • quantitative
    • interval
    • discrete

Another aspect of variables:

  • explanatory/predictor/independent variables
  • response/outcome/dependent variables

Univariate vs. Bivariate analyses

  • univariate
    • an examination of a single variable

Concept of "significance"

  • Difference in proportions
  • significance = a difference that is sufficiently large enough to trust it

Null Hypothesis

  • There is no particular difference

Expected frequencies

  • the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5


  • sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells
  • The probability of chi-square statistic is known for each number of degrees of freedom [number of groups -1]
  • Advantages
    • easy to understand
    • used widely
  • Disadvantages
    • For small O in 2x2, apply Yate's correction or user Fisher exact test
    • Dunning shows chi-square is not a good test when O are small and N is large.
    • Log-likelihood test does basically the same job without these limitations

Collocation statistics

  • Effect size
    • Observed/Expected
      • Could be very high effect size with only a single instance
  • Evidence floor
    • f(node, collocate)
  • Mutual Information
    • Formal definition: log(p(n,c)/(p(n)p(c)))
    • MI= log (Observed/Expected)
    • MI of 3 = oft-recommended cut-off = observed 8 times greater than expected
    • frequency floors = recommended value is 10 (Andrew Hardie)
  • Significance testing
    • problems
      • random samples from two populations
  • Chi-square test
    • sum of (Obs-Exp)^2 / Exp
  • Log-likelihood test
    • 2 x (sum of (Obs * log(Obs/Exp))
  • MI and LL
  • grammatical patterns & function words --> LL
  • lexical words and semantics --> MI
  • LL
    • biased towards words where there is lots of evidence due to high overall frequency
  • MI
    • biased towards words where effect size is huge due to low overall frequency
  • MI3
    • log(Obs^3/Exp)
    • it over-corrects MI: its high-frequency focus is too great.


  • Windows and sentence boundaries
  • No hope of seeing how they are related to each other
    • The formulae often come in multiple versions...
  • Overlapping windows
    • Martin Amis problem (Hardie)
  • Possibilities (speculative proposal by Hardie)
    • sig test like LL
    • rank the list by MI or effect size

Multivariate Analysis

  • Control of variables
  • Interaction of multiple predictor variables over response variables
    • Log-linear analysis
    • Generalized linear model

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2013-07-22 (月) 01:34:31 (2404d)