Elex2015Summary - 投野由紀夫授業用ホームページ

[ トップ ] [ 編集 | 凍結 | 差分 | バックアップ | 添付 | リロード ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]

最新の20件

2025-06-12

TonoPaper

2025-05-06

TonoSpeech

2025-05-04

FrontPage

2025-04-06

2025-03-30

論文リスト

2024-12-23

過去のお知らせ

2024-09-01

CEFR-J Members

2024-07-20

CEFR-Jx28

2024-07-16

RecentDeleted

2024-04-14

DictionaryCanDo

2023-09-02

ChatGPT

2023-08-24

RTutorial2023

2022-12-29

UsefulUnix

2022-10-16

CEFR-J RLD

2022-04-30

投野研究室メンバー2022

2022-03-29

TreeTaggerMemo

2022-01-31

岩研コーパス部会

2021-11-16

UdPipe

2021-11-01

UsefulLinks

SkE workshop †

Herstmonceux Castle, East Essex
- 10 August, 2015

Sketch engine data †

August 2015
- 400 corpora for 82 languages
- 100+ corpora (more than 100 million tokens)
- 30+ corpora (more than 1 billion tokens)
- 60+ languages with POS tagged corpus
- 42 languages with word sketches
- 26 languages with integrated tagger
- parallel corpora

Sketch engine marketing †

ambassadors and advocates in the SkE community
workshop

SkE future †

President: Adam's wife
MJ buys the company: Czech company will own SkE
no immediate changes to any customers
- at least a couple of years
  - in 5 years, no employees in UK now; all the members in Brno

Research agenda †

parallel and distributed processing of very large text corpora
buidling very large text corpora from the web
corpus heterogeneity and homogeneity
- what's in the corpus?
corpus evaluation
- lexicographic tasks to compare different corpora
terminology extraction
- more and more users; cheap opportunities; supporting tools for translators;
corpora and language teaching
- a book by James Thomas
language change over time

Bilingual terminology extraction by Vit Baisa †

monolingual terminology extraction is supported for 14 languages
what is a term?

unithood: grammatically defined (e.g. noun phrases) †

formalism
- 2:NN/JJ/VVG + 1: NN(head) -- English
  - Depending on languages, more complex rules need to be made, taking into case, gender, number agreements

Statistics
- simple math parameter N: f(focus) + N / f(ref) + N f: relative frequency

Output:
- Download TBX (exchange format of terminology)/SCV

Fine tuning
- stoplists
- minimum freq
- minimum score
- minimum character length
- alphanumeric character only

Multilingual terminology extraction

Future work:
- API: sketch engine available
- plugins for SDL, Kilgray products

Parallel corpora by Jan Michelfeit (Lexical Computing Ltd) †

Features
- sentence alignment only
- mapping file showing the alignment of sentences

Preloaded corpora
- Europal7
- DGT (4M English sentences, translation memory)
- OPUS2 (130M English sentences)
  - various docuents (EU, UN, movie, Tatoeba, etc.)

TMX
- a good input format for 1:1 bilingual data

Underused functionality †

Global attributes in CQL †

1.number=2.number tokens #1 and token #2 have the same number (plural or singular)
1.id = 2.ref ??

m4 macros
- A noun --> define('noun', '[tag="N.*"])
- An adjective --> define ('adj', '[tag="JJ.?"]') etc.
  - make the agreement rules simpler once definition is made

wordlist functions
- using word attributes

WebBootCaT
- get seed words from Wikipedia

New functions in CQL †

adopted by Manatee
Query language
- CQL: Oliver Christ
Useful options:
- general NOT perator; NOT within;
  - !within
  - <doc year="2010"> ! containing [word="castle"]
  - general ! <query>

Global conditions
- 1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case # agreement in case
- 1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id #dependency relation

New operators
- [ws("test-n", "modifier", ".*")]
  - modifiers of test(noun)

[swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
- with collocates being in dative
- node の位置を変えるためのコマンド

[lempos~10"test-n"]
- search for the top 10 synonyms of "test-n" (synonym search)

REGEX optimization
- simple OR optimization
- prefix optimization
- n-gram prefetching
  - pre index all character in one- bi- trigrams

Ted Briscoe Keynote †

Appplications †

CLC †

40 M words with 80 error types coded
24 different exam types
CEFR levels benchmarked to exam grades

Error-coded script †

Transcribe
Error coding
- <M|our having> M:missing

Parse it with RASP †

Parsing errors: use this to machine learning
dependency grammar: representation of texts

Overall assessment of proficiency †

textual features which are proxies for writing competence
predict a grade

Machine learning †

Training : CLC
- Classification: number of labels (pass vs. fail)
- Regression: try to predict real number
- Ranking: A>B>C (e.g. rank the web pages)

Features
- secript length
- errors
- n-grams
- parse trees

Rank Preference --> predictiona

Pairwise Ranking SVM: †

Joachims 2002
Learn an optimal ranking funciton that explicitly models the grade relatihoships

Features †

word sequences
POS sequences
Grammatical constructions
Other features
- error rate estimate (against n-gram corpora)
- readability score

Correlation between human grades and the system-predicted grades
- Upper bound = 0.796 (word+POS+readability+constructions+error rate)

Highly ranked features †

100,000 features
- see images

Error detection and correction suggestions †

ensure high accuracy and reasonable coverage

corpus-derived rules
- error rules from CLC
- detect incorrect word sequences
- At least 90% incorrect occurrences

online dictionary-derived rules

Precision:90% Recall:10% (effect on learning rate)

Sentence evaluation
- Limited linguistic evidence that can be extracted automatically
- difficulty in acquiring

Incremental Semantic Analysis (ISA) †

Distributional semantics
- fully incremental variation of Random Indexing
- Similarity between words measured by context vectors

Future work †

prompt relevance
task achievement
discourse organization feedback
L1-specific feedback
content word error detection and correction suggestions
link to online dictionary/thesaurus
link to courseware multiple choice exercises

Last-modified: 2015-08-12 (水) 05:17:55 (3629d)