Elex2015Summary のバックアップソース(No.1) - 投野由紀夫授業用ホームページ

バックアップ一覧
差分を表示
現在との差分を表示
バックアップを表示
Elex2015Summary へ行く。
- 1 (2015-08-10 (月) 22:51:54)
- 2 (2015-08-11 (火) 00:28:54)

[[FrontPage]]

*SkE workshop [#cf4e1329]
-Herstmonceux Castle, East Essex
--10 August, 2015

**Sketch engine data [#q51526ce]
-August 2015
--400 corpora for 82 languages
--100+ corpora (more than 100 million tokens)
--30+ corpora (more than 1 billion tokens)
--60+ languages with POS tagged corpus
--42 languages with word sketches
--26 languages with integrated tagger
--parallel corpora

**Sketch engine marketing [#h8a9ce82]
-ambassadors and advocates in the SkE community
-workshop

**SkE future [#zc019c4c]
-President: Adam's wife
-MJ buys the company: Czech company will own SkE
-no immediate changes to any customers
--at least a couple of years
---in 5 years, no employees in UK now; all the members in Brno

**Research agenda [#t00c08cd]
-parallel and distributed processing of very large text corpora
-buidling very large text corpora from the web
-corpus heterogeneity and homogeneity
--what's in the corpus?
-corpus evaluation
--lexicographic tasks to compare different corpora
-terminology extraction
--more and more users; cheap opportunities; supporting tools for translators; 
-corpora and language teaching
--a book by James Thomas
-language change over time

**Bilingual terminology extraction by Vit Baisa [#tbcfc06f]
-monolingual terminology extraction is supported for 14 languages
-what is a term?
***unithood: grammatically defined (e.g. noun phrases) [#a99234ad]
-formalism
--2:NN/JJ/VVG + 1: NN(head) -- English
---Depending on languages, more complex rules need to be made, taking into case, gender, number agreements

-Statistics
--simple math parameter N: f(focus) + N / f(ref) + N   f: relative frequency

-Output:
--Download TBX (exchange format of terminology)/SCV

-Fine tuning
--stoplists
--minimum freq
--minimum score
--minimum character length
--alphanumeric character only

-Multilingual terminology extraction