[[FrontPage]]

*SkE workshop [#cf4e1329]
*SkE workshop [#sa570497]
-Herstmonceux Castle, East Essex
--10 August, 2015

**Sketch engine data [#q51526ce]
**Sketch engine data [#z4e3d6e3]
-August 2015
--400 corpora for 82 languages
--100+ corpora (more than 100 million tokens)
--30+ corpora (more than 1 billion tokens)
--60+ languages with POS tagged corpus
--42 languages with word sketches
--26 languages with integrated tagger
--parallel corpora

**Sketch engine marketing [#h8a9ce82]
**Sketch engine marketing [#y21f123b]
-ambassadors and advocates in the SkE community
-workshop

**SkE future [#zc019c4c]
**SkE future [#i3c2152d]
-President: Adam's wife
-MJ buys the company: Czech company will own SkE
-no immediate changes to any customers
--at least a couple of years
---in 5 years, no employees in UK now; all the members in Brno

**Research agenda [#t00c08cd]
**Research agenda [#qb0bd1fe]
-parallel and distributed processing of very large text corpora
-buidling very large text corpora from the web
-corpus heterogeneity and homogeneity
--what's in the corpus?
-corpus evaluation
--lexicographic tasks to compare different corpora
-terminology extraction
--more and more users; cheap opportunities; supporting tools for translators; 
-corpora and language teaching
--a book by James Thomas
-language change over time

**Bilingual terminology extraction by Vit Baisa [#tbcfc06f]
**Bilingual terminology extraction by Vit Baisa [#y2e4d5a1]
-monolingual terminology extraction is supported for 14 languages
-what is a term?
***unithood: grammatically defined (e.g. noun phrases) [#a99234ad]
***unithood: grammatically defined (e.g. noun phrases) [#nd32179b]
-formalism
--2:NN/JJ/VVG + 1: NN(head) -- English
---Depending on languages, more complex rules need to be made, taking into case, gender, number agreements

-Statistics
--simple math parameter N: f(focus) + N / f(ref) + N   f: relative frequency

-Output:
--Download TBX (exchange format of terminology)/SCV

-Fine tuning
--stoplists
--minimum freq
--minimum score
--minimum character length
--alphanumeric character only

-Multilingual terminology extraction

-Future work:
--API: sketch engine available
--plugins for SDL, Kilgray products

**Parallel corpora by Jan Michelfeit (Lexical Computing Ltd) [#sb12f7a7]

-Features
--sentence alignment only
--mapping file showing the alignment of sentences

-Preloaded corpora
--Europal7
--DGT (4M English sentences, translation memory)
--OPUS2 (130M English sentences)
---various docuents (EU, UN, movie, Tatoeba, etc.)

-TMX
--a good input format for 1:1 bilingual data

*Underused functionality [#x27f9b37]

**Global attributes in CQL [#ofbdc06f]

-1.number=2.number    	tokens #1 and token #2 have the same number (plural or singular)
-1.id = 2.ref			??

-m4 macros
--A noun --> define('noun', '[tag="N.*"])
--An adjective --> define ('adj', '[tag="JJ.?"]')  etc.
---make the agreement rules simpler once definition is made

-wordlist functions
--using word attributes

-WebBootCaT
--get seed words from Wikipedia

**New functions in CQL [#ubc5d8c1]

-adopted by Manatee
-Query language
--CQL: Oliver Christ
-Useful options:
--general NOT perator; NOT within;
---!within
---<doc year="2010"> ! containing [word="castle"]
---general ! <query>

-Global conditions
--1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case # agreement in case
--1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id  #dependency relation

-New operators
--[ws("test-n", "modifier", ".*")]
---modifiers of test(noun)

--[swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
---with collocates being in dative
---node の位置を変えるためのコマンド

--[lempos~10"test-n"]
---search for the top 10 synonyms of "test-n" (synonym search)

-REGEX optimization
--simple OR optimization
--prefix optimization
--n-gram prefetching
---pre index all character in one- bi- trigrams



トップ   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS