UnderstandingStatistics - 投野由紀夫授業用ホームページ

[ トップ ] [ 編集 | 凍結 | 差分 | バックアップ | 添付 | リロード ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]

最新の20件

2025-06-12

TonoPaper

2025-05-06

TonoSpeech

2025-05-04

FrontPage

2025-04-06

2025-03-30

論文リスト

2024-12-23

過去のお知らせ

2024-09-01

CEFR-J Members

2024-07-20

CEFR-Jx28

2024-07-16

RecentDeleted

2024-04-14

DictionaryCanDo

2023-09-02

ChatGPT

2023-08-24

RTutorial2023

2022-12-29

UsefulUnix

2022-10-16

CEFR-J RLD

2022-04-30

投野研究室メンバー2022

2022-03-29

TreeTaggerMemo

2022-01-31

岩研コーパス部会

2021-11-16

UdPipe

2021-11-01

UsefulLinks

Understanding Statistics in Corpus Linguistics †

Variables
- categorical
  - binary (2 categories)
  - multiple (n > 2): (a) nominal vs. (b) ordinal

quantitative
- interval
- discrete

Another aspect of variables: †

explanatory/predictor/independent variables
response/outcome/dependent variables

Univariate vs. Bivariate analyses †

univariate
- an examination of a single variable

Concept of "significance" †

Difference in proportions
significance = a difference that is sufficiently large enough to trust it

Null Hypothesis †

There is no particular difference

Expected frequencies †

the frequencies we WOULD get if the two proportions are identical. Both probabilities equal 0.5

Chi-square †

sum of the squared differences between obser[ved and expected frequencies, divided by the expected frequency, across all cells
The probability of chi-square statistic is known for each number of degrees of freedom [number of groups -1]

Advantages
- easy to understand
- used widely

Disadvantages
- For small O in 2x2, apply Yate's correction or user Fisher exact test
- Dunning shows chi-square is not a good test when O are small and N is large.
- Log-likelihood test does basically the same job without these limitations

Collocation statistics †

Notation borrowed from Stefan Evert
- association measures

Effect size
- Observed/Expected
  - Could be very high effect size with only a single instance

Evidence floor
- f(node, collocate)

Mutual Information
- Formal definition: log(p(n,c)/(p(n)p(c)))
- MI= log (Observed/Expected)
- MI of 3 = oft-recommended cut-off = observed 8 times greater than expected
- frequency floors = recommended value is 10 (Andrew Hardie)

Significance testing
- problems
  - random samples from two populations

Chi-square test
- sum of (Obs-Exp)^2 / Exp
Log-likelihood test
- 2 x (sum of (Obs * log(Obs/Exp))

MI and LL
grammatical patterns & function words --> LL
lexical words and semantics --> MI

LL
- biased towards words where there is lots of evidence due to high overall frequency

MI
- biased towards words where effect size is huge due to low overall frequency

MI3
- log(Obs^3/Exp)
- it over-corrects MI: its high-frequency focus is too great.

Problems †

Windows and sentence boundaries

No hope of seeing how they are related to each other
- The formulae often come in multiple versions...

Overlapping windows
- Martin Amis problem (Hardie)

Possibilities (speculative proposal by Hardie)
- sig test like LL
- rank the list by MI or effect size

Multivariate Analysis †

Control of variables
Interaction of multiple predictor variables over response variables
- Log-linear analysis
- Generalized linear model

Last-modified: 2013-07-22 (月) 01:34:31 (4380d)