[[FrontPage]]

*SkE workshop [#sa570497]
-Herstmonceux Castle, East Essex
--10 August, 2015

**Sketch engine data [#z4e3d6e3]
-August 2015
--400 corpora for 82 languages
--100+ corpora (more than 100 million tokens)
--30+ corpora (more than 1 billion tokens)
--60+ languages with POS tagged corpus
--42 languages with word sketches
--26 languages with integrated tagger
--parallel corpora

**Sketch engine marketing [#y21f123b]
-ambassadors and advocates in the SkE community
-workshop

**SkE future [#i3c2152d]
-President: Adam's wife
-MJ buys the company: Czech company will own SkE
-no immediate changes to any customers
--at least a couple of years
---in 5 years, no employees in UK now; all the members in Brno

**Research agenda [#qb0bd1fe]
-parallel and distributed processing of very large text corpora
-buidling very large text corpora from the web
-corpus heterogeneity and homogeneity
--what's in the corpus?
-corpus evaluation
--lexicographic tasks to compare different corpora
-terminology extraction
--more and more users; cheap opportunities; supporting tools for translators; 
-corpora and language teaching
--a book by James Thomas
-language change over time

**Bilingual terminology extraction by Vit Baisa [#y2e4d5a1]
-monolingual terminology extraction is supported for 14 languages
-what is a term?
***unithood: grammatically defined (e.g. noun phrases) [#nd32179b]
-formalism
--2:NN/JJ/VVG + 1: NN(head) -- English
---Depending on languages, more complex rules need to be made, taking into case, gender, number agreements

-Statistics
--simple math parameter N: f(focus) + N / f(ref) + N   f: relative frequency

-Output:
--Download TBX (exchange format of terminology)/SCV

-Fine tuning
--stoplists
--minimum freq
--minimum score
--minimum character length
--alphanumeric character only

-Multilingual terminology extraction

-Future work:
--API: sketch engine available
--plugins for SDL, Kilgray products

**Parallel corpora by Jan Michelfeit (Lexical Computing Ltd) [#sb12f7a7]

-Features
--sentence alignment only
--mapping file showing the alignment of sentences

-Preloaded corpora
--Europal7
--DGT (4M English sentences, translation memory)
--OPUS2 (130M English sentences)
---various docuents (EU, UN, movie, Tatoeba, etc.)

-TMX
--a good input format for 1:1 bilingual data

*Underused functionality [#x27f9b37]

**Global attributes in CQL [#ofbdc06f]

-1.number=2.number    	tokens #1 and token #2 have the same number (plural or singular)
-1.id = 2.ref			??

-m4 macros
--A noun --> define('noun', '[tag="N.*"])
--An adjective --> define ('adj', '[tag="JJ.?"]')  etc.
---make the agreement rules simpler once definition is made

-wordlist functions
--using word attributes

-WebBootCaT
--get seed words from Wikipedia

**New functions in CQL [#ubc5d8c1]

-adopted by Manatee
-Query language
--CQL: Oliver Christ
-Useful options:
--general NOT perator; NOT within;
---!within
---<doc year="2010"> ! containing [word="castle"]
---general ! <query>

-Global conditions
--1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case # agreement in case
--1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id  #dependency relation

-New operators
--[ws("test-n", "modifier", ".*")]
---modifiers of test(noun)

--[swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
---with collocates being in dative
---node の位置を変えるためのコマンド

--[lempos~10"test-n"]
---search for the top 10 synonyms of "test-n" (synonym search)

-REGEX optimization
--simple OR optimization
--prefix optimization
--n-gram prefetching
---pre index all character in one- bi- trigrams

*Ted Briscoe Keynote [#i1af835f]

**Appplications [#y68b0bc8]

***CLC [#g0e3112d]
-40 M words with 80 error types coded
-24 different exam types
-CEFR levels benchmarked to exam grades

***Error-coded script [#sbcf57de]
-Transcribe
-Error coding
--<M|our having>	M:missing

***Parse it with RASP [#za839cfe]
-Parsing errors: use this to machine learning
-dependency grammar: representation of texts

***Overall assessment of proficiency [#nf961427]
-textual features which are proxies for writing competence
-predict a grade

***Machine learning [#r8e01a31]
-Training : CLC
--Classification: number of labels (pass vs. fail)
--Regression: try to predict real number
--Ranking: A>B>C (e.g. rank the web pages)

-Features
--secript length
--errors
--n-grams
--parse trees

-Rank Preference --> predictiona


***Pairwise Ranking SVM: [#cb86f93b]
-Joachims 2002
-Learn an optimal ranking funciton that explicitly models the grade relatihoships 

***Features [#g359fa70]
-word sequences
-POS sequences
-Grammatical constructions
-Other features
--error rate estimate (against n-gram corpora)
--readability score

-Correlation between human grades and the system-predicted grades
--Upper bound = 0.796 (word+POS+readability+constructions+error rate)

***Highly ranked features [#fca1ff02]
-100,000 features
--see images

***Error detection and correction suggestions [#j61fcb6f]
-ensure high accuracy and reasonable coverage

--corpus-derived rules
---error rules from CLC
---detect incorrect word sequences
---At least 90% incorrect occurrences

--online dictionary-derived rules

--Precision:90% Recall:10% (effect on learning rate)

-Sentence evaluation
--Limited linguistic evidence that can be extracted automatically
--difficulty in acquiring 

***Incremental Semantic Analysis (ISA) [#e6dd0815]
-Distributional semantics
--fully incremental variation of Random Indexing
--Similarity between words measured by context vectors

***Future work [#n850479c]
-prompt relevance
-task achievement
-discourse organization feedback
-L1-specific feedback
-content word error detection and correction suggestions
-link to online dictionary/thesaurus
-link to courseware multiple choice exercises


トップ   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS