[[FrontPage]] *SkE workshop [#sa570497] -Herstmonceux Castle, East Essex --10 August, 2015 **Sketch engine data [#z4e3d6e3] -August 2015 --400 corpora for 82 languages --100+ corpora (more than 100 million tokens) --30+ corpora (more than 1 billion tokens) --60+ languages with POS tagged corpus --42 languages with word sketches --26 languages with integrated tagger --parallel corpora **Sketch engine marketing [#y21f123b] -ambassadors and advocates in the SkE community -workshop **SkE future [#i3c2152d] -President: Adam's wife -MJ buys the company: Czech company will own SkE -no immediate changes to any customers --at least a couple of years ---in 5 years, no employees in UK now; all the members in Brno **Research agenda [#qb0bd1fe] -parallel and distributed processing of very large text corpora -buidling very large text corpora from the web -corpus heterogeneity and homogeneity --what's in the corpus? -corpus evaluation --lexicographic tasks to compare different corpora -terminology extraction --more and more users; cheap opportunities; supporting tools for translators; -corpora and language teaching --a book by James Thomas -language change over time **Bilingual terminology extraction by Vit Baisa [#y2e4d5a1] -monolingual terminology extraction is supported for 14 languages -what is a term? ***unithood: grammatically defined (e.g. noun phrases) [#nd32179b] -formalism --2:NN/JJ/VVG + 1: NN(head) -- English ---Depending on languages, more complex rules need to be made, taking into case, gender, number agreements -Statistics --simple math parameter N: f(focus) + N / f(ref) + N f: relative frequency -Output: --Download TBX (exchange format of terminology)/SCV -Fine tuning --stoplists --minimum freq --minimum score --minimum character length --alphanumeric character only -Multilingual terminology extraction -Future work: --API: sketch engine available --plugins for SDL, Kilgray products **Parallel corpora by Jan Michelfeit (Lexical Computing Ltd) [#sb12f7a7] -Features --sentence alignment only --mapping file showing the alignment of sentences -Preloaded corpora --Europal7 --DGT (4M English sentences, translation memory) --OPUS2 (130M English sentences) ---various docuents (EU, UN, movie, Tatoeba, etc.) -TMX --a good input format for 1:1 bilingual data *Underused functionality [#x27f9b37] **Global attributes in CQL [#ofbdc06f] -1.number=2.number tokens #1 and token #2 have the same number (plural or singular) -1.id = 2.ref ?? -m4 macros --A noun --> define('noun', '[tag="N.*"]) --An adjective --> define ('adj', '[tag="JJ.?"]') etc. ---make the agreement rules simpler once definition is made -wordlist functions --using word attributes -WebBootCaT --get seed words from Wikipedia **New functions in CQL [#ubc5d8c1] -adopted by Manatee -Query language --CQL: Oliver Christ -Useful options: --general NOT perator; NOT within; ---!within ---<doc year="2010"> ! containing [word="castle"] ---general ! <query> -Global conditions --1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case # agreement in case --1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id #dependency relation -New operators --[ws("test-n", "modifier", ".*")] ---modifiers of test(noun) --[swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"] ---with collocates being in dative ---node の位置を変えるためのコマンド --[lempos~10"test-n"] ---search for the top 10 synonyms of "test-n" (synonym search) -REGEX optimization --simple OR optimization --prefix optimization --n-gram prefetching ---pre index all character in one- bi- trigrams