[[FrontPage]] *SkE workshop [#cf4e1329] -Herstmonceux Castle, East Essex --10 August, 2015 **Sketch engine data [#q51526ce] -August 2015 --400 corpora for 82 languages --100+ corpora (more than 100 million tokens) --30+ corpora (more than 1 billion tokens) --60+ languages with POS tagged corpus --42 languages with word sketches --26 languages with integrated tagger --parallel corpora **Sketch engine marketing [#h8a9ce82] -ambassadors and advocates in the SkE community -workshop **SkE future [#zc019c4c] -President: Adam's wife -MJ buys the company: Czech company will own SkE -no immediate changes to any customers --at least a couple of years ---in 5 years, no employees in UK now; all the members in Brno **Research agenda [#t00c08cd] -parallel and distributed processing of very large text corpora -buidling very large text corpora from the web -corpus heterogeneity and homogeneity --what's in the corpus? -corpus evaluation --lexicographic tasks to compare different corpora -terminology extraction --more and more users; cheap opportunities; supporting tools for translators; -corpora and language teaching --a book by James Thomas -language change over time **Bilingual terminology extraction by Vit Baisa [#tbcfc06f] -monolingual terminology extraction is supported for 14 languages -what is a term? ***unithood: grammatically defined (e.g. noun phrases) [#a99234ad] -formalism --2:NN/JJ/VVG + 1: NN(head) -- English ---Depending on languages, more complex rules need to be made, taking into case, gender, number agreements -Statistics --simple math parameter N: f(focus) + N / f(ref) + N f: relative frequency -Output: --Download TBX (exchange format of terminology)/SCV -Fine tuning --stoplists --minimum freq --minimum score --minimum character length --alphanumeric character only -Multilingual terminology extraction