FrontPage
SkE workshop †
- Herstmonceux Castle, East Essex
Sketch engine data †
- August 2015
- 400 corpora for 82 languages
- 100+ corpora (more than 100 million tokens)
- 30+ corpora (more than 1 billion tokens)
- 60+ languages with POS tagged corpus
- 42 languages with word sketches
- 26 languages with integrated tagger
- parallel corpora
Sketch engine marketing †
- ambassadors and advocates in the SkE community
- workshop
SkE future †
- President: Adam's wife
- MJ buys the company: Czech company will own SkE
- no immediate changes to any customers
- at least a couple of years
- in 5 years, no employees in UK now; all the members in Brno
Research agenda †
- parallel and distributed processing of very large text corpora
- buidling very large text corpora from the web
- corpus heterogeneity and homogeneity
- corpus evaluation
- lexicographic tasks to compare different corpora
- terminology extraction
- more and more users; cheap opportunities; supporting tools for translators;
- corpora and language teaching
- language change over time
Bilingual terminology extraction by Vit Baisa †
- monolingual terminology extraction is supported for 14 languages
- what is a term?
unithood: grammatically defined (e.g. noun phrases) †
- formalism
- 2:NN/JJ/VVG + 1: NN(head) -- English
- Depending on languages, more complex rules need to be made, taking into case, gender, number agreements
- Statistics
- simple math parameter N: f(focus) + N / f(ref) + N f: relative frequency
- Output:
- Download TBX (exchange format of terminology)/SCV
- Fine tuning
- stoplists
- minimum freq
- minimum score
- minimum character length
- alphanumeric character only
- Multilingual terminology extraction