- 追加された行はこの色です。
- 削除された行はこの色です。
[[FrontPage]]
**Understanding Statistics in Corpus Linguistics [#cf852098]
-Variables
--categorical
---binary (2 categories)
---multiple (n > 2): (a) nominal vs. (b) ordinal
--quantitative
---interval
---discrete
**Another aspect of variables: [#x6bfd7ee]
-explanatory/predictor/independent variables
-response/outcome/dependent variables
**Univariate vs. Bivariate analyses [#c85829ac]
-univariate
--an examination of a single variable
**Concept of "significance" [#u10790cf]
-Difference in proportions
-significance = a difference that is sufficiently large enough to trust it
***Null Hypothesis [#lcc1139e]
-There is no particular difference
***Expected frequencies [#a23f67b4]
-the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5
**Chi-square [#t18709d3]
-sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells
-The probability of chi-square statistic is known for each number of degrees of freedom [number of groups -1]
-Advantages
--easy to understand
--used widely
-Disadvantages
--For small O in 2x2, apply Yate's correction or user Fisher exact test
--Dunning shows chi-square is not a good test when O are small and N is large.
--Log-likelihood test does basically the same job without these limitations
*Collocation statistics [#n7a538d1]
-Notation borrowed from Stefan Evert
--[[association measures>http://www.collocations.de/]]
-Effect size
--Observed/Expected
---Could be very high effect size with only a single instance
-Evidence floor
--f(node, collocate)
--
-Mutual Information
--Formal definition: log(p(n,c)/(p(n)p(c)))
--MI= log (Observed/Expected)
--MI of 3 = oft-recommended cut-off = observed 8 times greater than expected
--frequency floors = recommended value is 10 (Andrew Hardie)
-Significance testing
--problems
---random samples from two populations
-Chi-square test
--sum of (Obs-Exp)^2 / Exp
-Log-likelihood test
--2 x (sum of (Obs * log(Obs/Exp))
-MI and LL
-grammatical patterns & function words --> LL
-lexical words and semantics --> MI
-LL
--biased towards words where there is lots of evidence due to high overall frequency
-MI
--biased towards words where effect size is huge due to low overall frequency
-MI3
--log(Obs^3/Exp)
--it over-corrects MI: its high-frequency focus is too great.
**Problems [#ua00c807]
-Windows and sentence boundaries
-No hope of seeing how they are related to each other
--The formulae often come in multiple versions...
-Overlapping windows
--Martin Amis problem (Hardie)
-Possibilities (speculative proposal by Hardie)
--sig test like LL
--rank the list by MI or effect size
**Multivariate Analysis [#t7720dba]
-Control of variables
-Interaction of multiple predictor variables over response variables
--Log-linear analysis
--Generalized linear model
-