Elex2015Summary のバックアップ(No.2) - 投野由紀夫授業用ホームページ

[ トップ ] [ 新規 | 一覧 | 単語検索 | 最終更新 | ヘルプ ]

バックアップ一覧
差分を表示
現在との差分を表示
ソースを表示
Elex2015Summary へ行く。
- 1 (2015-08-10 (月) 22:51:54)
- 2 (2015-08-11 (火) 00:28:54)

SkE workshop †

Herstmonceux Castle, East Essex
- 10 August, 2015

Sketch engine data †

August 2015
- 400 corpora for 82 languages
- 100+ corpora (more than 100 million tokens)
- 30+ corpora (more than 1 billion tokens)
- 60+ languages with POS tagged corpus
- 42 languages with word sketches
- 26 languages with integrated tagger
- parallel corpora

Sketch engine marketing †

ambassadors and advocates in the SkE community
workshop

SkE future †

President: Adam's wife
MJ buys the company: Czech company will own SkE
no immediate changes to any customers
- at least a couple of years
  - in 5 years, no employees in UK now; all the members in Brno

Research agenda †

parallel and distributed processing of very large text corpora
buidling very large text corpora from the web
corpus heterogeneity and homogeneity
- what's in the corpus?
corpus evaluation
- lexicographic tasks to compare different corpora
terminology extraction
- more and more users; cheap opportunities; supporting tools for translators;
corpora and language teaching
- a book by James Thomas
language change over time

Bilingual terminology extraction by Vit Baisa †

monolingual terminology extraction is supported for 14 languages
what is a term?

unithood: grammatically defined (e.g. noun phrases) †

formalism
- 2:NN/JJ/VVG + 1: NN(head) -- English
  - Depending on languages, more complex rules need to be made, taking into case, gender, number agreements

Statistics
- simple math parameter N: f(focus) + N / f(ref) + N f: relative frequency

Output:
- Download TBX (exchange format of terminology)/SCV

Fine tuning
- stoplists
- minimum freq
- minimum score
- minimum character length
- alphanumeric character only

Multilingual terminology extraction

Future work:
- API: sketch engine available
- plugins for SDL, Kilgray products

Parallel corpora by Jan Michelfeit (Lexical Computing Ltd) †

Features
- sentence alignment only
- mapping file showing the alignment of sentences

Preloaded corpora
- Europal7
- DGT (4M English sentences, translation memory)
- OPUS2 (130M English sentences)
  - various docuents (EU, UN, movie, Tatoeba, etc.)

TMX
- a good input format for 1:1 bilingual data

Underused functionality †

Global attributes in CQL †

1.number=2.number tokens #1 and token #2 have the same number (plural or singular)
1.id = 2.ref ??

m4 macros
- A noun --> define('noun', '[tag="N.*"])
- An adjective --> define ('adj', '[tag="JJ.?"]') etc.
  - make the agreement rules simpler once definition is made

wordlist functions
- using word attributes

WebBootCaT
- get seed words from Wikipedia

New functions in CQL †

adopted by Manatee
Query language
- CQL: Oliver Christ
Useful options:
- general NOT perator; NOT within;
  - !within
  - <doc year="2010"> ! containing [word="castle"]
  - general ! <query>

Global conditions
- 1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case # agreement in case
- 1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id #dependency relation

New operators
- [ws("test-n", "modifier", ".*")]
  - modifiers of test(noun)

[swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
- with collocates being in dative
- node の位置を変えるためのコマンド

[lempos~10"test-n"]
- search for the top 10 synonyms of "test-n" (synonym search)

REGEX optimization
- simple OR optimization
- prefix optimization
- n-gram prefetching
  - pre index all character in one- bi- trigrams