Elex2015Summary
をテンプレートにして作成
[
トップ
] [
新規
|
一覧
|
単語検索
|
最終更新
|
ヘルプ
]
開始行:
[[FrontPage]]
*SkE workshop [#sa570497]
-Herstmonceux Castle, East Essex
--10 August, 2015
**Sketch engine data [#z4e3d6e3]
-August 2015
--400 corpora for 82 languages
--100+ corpora (more than 100 million tokens)
--30+ corpora (more than 1 billion tokens)
--60+ languages with POS tagged corpus
--42 languages with word sketches
--26 languages with integrated tagger
--parallel corpora
**Sketch engine marketing [#y21f123b]
-ambassadors and advocates in the SkE community
-workshop
**SkE future [#i3c2152d]
-President: Adam's wife
-MJ buys the company: Czech company will own SkE
-no immediate changes to any customers
--at least a couple of years
---in 5 years, no employees in UK now; all the members in...
**Research agenda [#qb0bd1fe]
-parallel and distributed processing of very large text c...
-buidling very large text corpora from the web
-corpus heterogeneity and homogeneity
--what's in the corpus?
-corpus evaluation
--lexicographic tasks to compare different corpora
-terminology extraction
--more and more users; cheap opportunities; supporting to...
-corpora and language teaching
--a book by James Thomas
-language change over time
**Bilingual terminology extraction by Vit Baisa [#y2e4d5a1]
-monolingual terminology extraction is supported for 14 l...
-what is a term?
***unithood: grammatically defined (e.g. noun phrases) [#...
-formalism
--2:NN/JJ/VVG + 1: NN(head) -- English
---Depending on languages, more complex rules need to be ...
-Statistics
--simple math parameter N: f(focus) + N / f(ref) + N f:...
-Output:
--Download TBX (exchange format of terminology)/SCV
-Fine tuning
--stoplists
--minimum freq
--minimum score
--minimum character length
--alphanumeric character only
-Multilingual terminology extraction
-Future work:
--API: sketch engine available
--plugins for SDL, Kilgray products
**Parallel corpora by Jan Michelfeit (Lexical Computing L...
-Features
--sentence alignment only
--mapping file showing the alignment of sentences
-Preloaded corpora
--Europal7
--DGT (4M English sentences, translation memory)
--OPUS2 (130M English sentences)
---various docuents (EU, UN, movie, Tatoeba, etc.)
-TMX
--a good input format for 1:1 bilingual data
*Underused functionality [#x27f9b37]
**Global attributes in CQL [#ofbdc06f]
-1.number=2.number tokens #1 and token #2 have the sa...
-1.id = 2.ref ??
-m4 macros
--A noun --> define('noun', '[tag="N.*"])
--An adjective --> define ('adj', '[tag="JJ.?"]') etc.
---make the agreement rules simpler once definition is made
-wordlist functions
--using word attributes
-WebBootCaT
--get seed words from Wikipedia
**New functions in CQL [#ubc5d8c1]
-adopted by Manatee
-Query language
--CQL: Oliver Christ
-Useful options:
--general NOT perator; NOT within;
---!within
---<doc year="2010"> ! containing [word="castle"]
---general ! <query>
-Global conditions
--1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case ...
--1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id #depend...
-New operators
--[ws("test-n", "modifier", ".*")]
---modifiers of test(noun)
--[swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
---with collocates being in dative
---node の位置を変えるためのコマンド
--[lempos~10"test-n"]
---search for the top 10 synonyms of "test-n" (synonym se...
-REGEX optimization
--simple OR optimization
--prefix optimization
--n-gram prefetching
---pre index all character in one- bi- trigrams
*Ted Briscoe Keynote [#i1af835f]
**Appplications [#y68b0bc8]
***CLC [#g0e3112d]
-40 M words with 80 error types coded
-24 different exam types
-CEFR levels benchmarked to exam grades
***Error-coded script [#sbcf57de]
-Transcribe
-Error coding
--<M|our having> M:missing
***Parse it with RASP [#za839cfe]
-Parsing errors: use this to machine learning
-dependency grammar: representation of texts
***Overall assessment of proficiency [#nf961427]
-textual features which are proxies for writing competence
-predict a grade
***Machine learning [#r8e01a31]
-Training : CLC
--Classification: number of labels (pass vs. fail)
--Regression: try to predict real number
--Ranking: A>B>C (e.g. rank the web pages)
-Features
--secript length
--errors
--n-grams
--parse trees
-Rank Preference --> predictiona
***Pairwise Ranking SVM: [#cb86f93b]
-Joachims 2002
-Learn an optimal ranking funciton that explicitly models...
***Features [#g359fa70]
-word sequences
-POS sequences
-Grammatical constructions
-Other features
--error rate estimate (against n-gram corpora)
--readability score
-Correlation between human grades and the system-predicte...
--Upper bound = 0.796 (word+POS+readability+constructions...
***Highly ranked features [#fca1ff02]
-100,000 features
--see images
***Error detection and correction suggestions [#j61fcb6f]
-ensure high accuracy and reasonable coverage
--corpus-derived rules
---error rules from CLC
---detect incorrect word sequences
---At least 90% incorrect occurrences
--online dictionary-derived rules
--Precision:90% Recall:10% (effect on learning rate)
-Sentence evaluation
--Limited linguistic evidence that can be extracted autom...
--difficulty in acquiring
***Incremental Semantic Analysis (ISA) [#e6dd0815]
-Distributional semantics
--fully incremental variation of Random Indexing
--Similarity between words measured by context vectors
***Future work [#n850479c]
-prompt relevance
-task achievement
-discourse organization feedback
-L1-specific feedback
-content word error detection and correction suggestions
-link to online dictionary/thesaurus
-link to courseware multiple choice exercises
終了行:
[[FrontPage]]
*SkE workshop [#sa570497]
-Herstmonceux Castle, East Essex
--10 August, 2015
**Sketch engine data [#z4e3d6e3]
-August 2015
--400 corpora for 82 languages
--100+ corpora (more than 100 million tokens)
--30+ corpora (more than 1 billion tokens)
--60+ languages with POS tagged corpus
--42 languages with word sketches
--26 languages with integrated tagger
--parallel corpora
**Sketch engine marketing [#y21f123b]
-ambassadors and advocates in the SkE community
-workshop
**SkE future [#i3c2152d]
-President: Adam's wife
-MJ buys the company: Czech company will own SkE
-no immediate changes to any customers
--at least a couple of years
---in 5 years, no employees in UK now; all the members in...
**Research agenda [#qb0bd1fe]
-parallel and distributed processing of very large text c...
-buidling very large text corpora from the web
-corpus heterogeneity and homogeneity
--what's in the corpus?
-corpus evaluation
--lexicographic tasks to compare different corpora
-terminology extraction
--more and more users; cheap opportunities; supporting to...
-corpora and language teaching
--a book by James Thomas
-language change over time
**Bilingual terminology extraction by Vit Baisa [#y2e4d5a1]
-monolingual terminology extraction is supported for 14 l...
-what is a term?
***unithood: grammatically defined (e.g. noun phrases) [#...
-formalism
--2:NN/JJ/VVG + 1: NN(head) -- English
---Depending on languages, more complex rules need to be ...
-Statistics
--simple math parameter N: f(focus) + N / f(ref) + N f:...
-Output:
--Download TBX (exchange format of terminology)/SCV
-Fine tuning
--stoplists
--minimum freq
--minimum score
--minimum character length
--alphanumeric character only
-Multilingual terminology extraction
-Future work:
--API: sketch engine available
--plugins for SDL, Kilgray products
**Parallel corpora by Jan Michelfeit (Lexical Computing L...
-Features
--sentence alignment only
--mapping file showing the alignment of sentences
-Preloaded corpora
--Europal7
--DGT (4M English sentences, translation memory)
--OPUS2 (130M English sentences)
---various docuents (EU, UN, movie, Tatoeba, etc.)
-TMX
--a good input format for 1:1 bilingual data
*Underused functionality [#x27f9b37]
**Global attributes in CQL [#ofbdc06f]
-1.number=2.number tokens #1 and token #2 have the sa...
-1.id = 2.ref ??
-m4 macros
--A noun --> define('noun', '[tag="N.*"])
--An adjective --> define ('adj', '[tag="JJ.?"]') etc.
---make the agreement rules simpler once definition is made
-wordlist functions
--using word attributes
-WebBootCaT
--get seed words from Wikipedia
**New functions in CQL [#ubc5d8c1]
-adopted by Manatee
-Query language
--CQL: Oliver Christ
-Useful options:
--general NOT perator; NOT within;
---!within
---<doc year="2010"> ! containing [word="castle"]
---general ! <query>
-Global conditions
--1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case ...
--1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id #depend...
-New operators
--[ws("test-n", "modifier", ".*")]
---modifiers of test(noun)
--[swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
---with collocates being in dative
---node の位置を変えるためのコマンド
--[lempos~10"test-n"]
---search for the top 10 synonyms of "test-n" (synonym se...
-REGEX optimization
--simple OR optimization
--prefix optimization
--n-gram prefetching
---pre index all character in one- bi- trigrams
*Ted Briscoe Keynote [#i1af835f]
**Appplications [#y68b0bc8]
***CLC [#g0e3112d]
-40 M words with 80 error types coded
-24 different exam types
-CEFR levels benchmarked to exam grades
***Error-coded script [#sbcf57de]
-Transcribe
-Error coding
--<M|our having> M:missing
***Parse it with RASP [#za839cfe]
-Parsing errors: use this to machine learning
-dependency grammar: representation of texts
***Overall assessment of proficiency [#nf961427]
-textual features which are proxies for writing competence
-predict a grade
***Machine learning [#r8e01a31]
-Training : CLC
--Classification: number of labels (pass vs. fail)
--Regression: try to predict real number
--Ranking: A>B>C (e.g. rank the web pages)
-Features
--secript length
--errors
--n-grams
--parse trees
-Rank Preference --> predictiona
***Pairwise Ranking SVM: [#cb86f93b]
-Joachims 2002
-Learn an optimal ranking funciton that explicitly models...
***Features [#g359fa70]
-word sequences
-POS sequences
-Grammatical constructions
-Other features
--error rate estimate (against n-gram corpora)
--readability score
-Correlation between human grades and the system-predicte...
--Upper bound = 0.796 (word+POS+readability+constructions...
***Highly ranked features [#fca1ff02]
-100,000 features
--see images
***Error detection and correction suggestions [#j61fcb6f]
-ensure high accuracy and reasonable coverage
--corpus-derived rules
---error rules from CLC
---detect incorrect word sequences
---At least 90% incorrect occurrences
--online dictionary-derived rules
--Precision:90% Recall:10% (effect on learning rate)
-Sentence evaluation
--Limited linguistic evidence that can be extracted autom...
--difficulty in acquiring
***Incremental Semantic Analysis (ISA) [#e6dd0815]
-Distributional semantics
--fully incremental variation of Random Indexing
--Similarity between words measured by context vectors
***Future work [#n850479c]
-prompt relevance
-task achievement
-discourse organization feedback
-L1-specific feedback
-content word error detection and correction suggestions
-link to online dictionary/thesaurus
-link to courseware multiple choice exercises
ページ名: