FrontPage
SkE workshop †
- Herstmonceux Castle, East Essex
Sketch engine data †
- August 2015
- 400 corpora for 82 languages
- 100+ corpora (more than 100 million tokens)
- 30+ corpora (more than 1 billion tokens)
- 60+ languages with POS tagged corpus
- 42 languages with word sketches
- 26 languages with integrated tagger
- parallel corpora
Sketch engine marketing †
- ambassadors and advocates in the SkE community
- workshop
SkE future †
- President: Adam's wife
- MJ buys the company: Czech company will own SkE
- no immediate changes to any customers
- at least a couple of years
- in 5 years, no employees in UK now; all the members in Brno
Research agenda †
- parallel and distributed processing of very large text corpora
- buidling very large text corpora from the web
- corpus heterogeneity and homogeneity
- corpus evaluation
- lexicographic tasks to compare different corpora
- terminology extraction
- more and more users; cheap opportunities; supporting tools for translators;
- corpora and language teaching
- language change over time
Bilingual terminology extraction by Vit Baisa †
- monolingual terminology extraction is supported for 14 languages
- what is a term?
unithood: grammatically defined (e.g. noun phrases) †
- formalism
- 2:NN/JJ/VVG + 1: NN(head) -- English
- Depending on languages, more complex rules need to be made, taking into case, gender, number agreements
- Statistics
- simple math parameter N: f(focus) + N / f(ref) + N f: relative frequency
- Output:
- Download TBX (exchange format of terminology)/SCV
- Fine tuning
- stoplists
- minimum freq
- minimum score
- minimum character length
- alphanumeric character only
- Multilingual terminology extraction
- Future work:
- API: sketch engine available
- plugins for SDL, Kilgray products
Parallel corpora by Jan Michelfeit (Lexical Computing Ltd) †
- Features
- sentence alignment only
- mapping file showing the alignment of sentences
- Preloaded corpora
- Europal7
- DGT (4M English sentences, translation memory)
- OPUS2 (130M English sentences)
- various docuents (EU, UN, movie, Tatoeba, etc.)
- TMX
- a good input format for 1:1 bilingual data
Underused functionality †
Global attributes in CQL †
- 1.number=2.number tokens #1 and token #2 have the same number (plural or singular)
- 1.id = 2.ref ??
- m4 macros
- A noun --> define('noun', '[tag="N.*"])
- An adjective --> define ('adj', '[tag="JJ.?"]') etc.
- make the agreement rules simpler once definition is made
- WebBootCaT
- get seed words from Wikipedia
New functions in CQL †
- adopted by Manatee
- Query language
- Useful options:
- general NOT perator; NOT within;
- !within
- <doc year="2010"> ! containing [word="castle"]
- general ! <query>
- Global conditions
- 1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case # agreement in case
- 1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id #dependency relation
- New operators
- [ws("test-n", "modifier", ".*")]
- [swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
- with collocates being in dative
- node の位置を変えるためのコマンド
- [lempos~10"test-n"]
- search for the top 10 synonyms of "test-n" (synonym search)
- REGEX optimization
- simple OR optimization
- prefix optimization
- n-gram prefetching
- pre index all character in one- bi- trigrams