FrontPage

SkE workshop

  • Herstmonceux Castle, East Essex
    • 10 August, 2015

Sketch engine data

  • August 2015
    • 400 corpora for 82 languages
    • 100+ corpora (more than 100 million tokens)
    • 30+ corpora (more than 1 billion tokens)
    • 60+ languages with POS tagged corpus
    • 42 languages with word sketches
    • 26 languages with integrated tagger
    • parallel corpora

Sketch engine marketing

  • ambassadors and advocates in the SkE community
  • workshop

SkE future

  • President: Adam's wife
  • MJ buys the company: Czech company will own SkE
  • no immediate changes to any customers
    • at least a couple of years
      • in 5 years, no employees in UK now; all the members in Brno

Research agenda

  • parallel and distributed processing of very large text corpora
  • buidling very large text corpora from the web
  • corpus heterogeneity and homogeneity
    • what's in the corpus?
  • corpus evaluation
    • lexicographic tasks to compare different corpora
  • terminology extraction
    • more and more users; cheap opportunities; supporting tools for translators;
  • corpora and language teaching
    • a book by James Thomas
  • language change over time

Bilingual terminology extraction by Vit Baisa

  • monolingual terminology extraction is supported for 14 languages
  • what is a term?

unithood: grammatically defined (e.g. noun phrases)

  • formalism
    • 2:NN/JJ/VVG + 1: NN(head) -- English
      • Depending on languages, more complex rules need to be made, taking into case, gender, number agreements
  • Statistics
    • simple math parameter N: f(focus) + N / f(ref) + N f: relative frequency
  • Output:
    • Download TBX (exchange format of terminology)/SCV
  • Fine tuning
    • stoplists
    • minimum freq
    • minimum score
    • minimum character length
    • alphanumeric character only
  • Multilingual terminology extraction
  • Future work:
    • API: sketch engine available
    • plugins for SDL, Kilgray products

Parallel corpora by Jan Michelfeit (Lexical Computing Ltd)

  • Features
    • sentence alignment only
    • mapping file showing the alignment of sentences
  • Preloaded corpora
    • Europal7
    • DGT (4M English sentences, translation memory)
    • OPUS2 (130M English sentences)
      • various docuents (EU, UN, movie, Tatoeba, etc.)
  • TMX
    • a good input format for 1:1 bilingual data

Underused functionality

Global attributes in CQL

  • 1.number=2.number tokens #1 and token #2 have the same number (plural or singular)
  • 1.id = 2.ref ??
  • m4 macros
    • A noun --> define('noun', '[tag="N.*"])
    • An adjective --> define ('adj', '[tag="JJ.?"]') etc.
      • make the agreement rules simpler once definition is made
  • wordlist functions
    • using word attributes
  • WebBootCaT
    • get seed words from Wikipedia

New functions in CQL

  • adopted by Manatee
  • Query language
    • CQL: Oliver Christ
  • Useful options:
    • general NOT perator; NOT within;
      • !within
      • <doc year="2010"> ! containing [word="castle"]
      • general ! <query>
  • Global conditions
    • 1:[lemma="alt"] 2:[lemma="Scholoss"] & 1.case = 2.case # agreement in case
    • 1.[lemma="alt"] 2:[tag="NOUN"] & 1.head = 2.id #dependency relation
  • New operators
    • [ws("test-n", "modifier", ".*")]
      • modifiers of test(noun)
  • [swap(1, ws("test-n", "modifier", ".*")) & tag="*.DAT.*"]
    • with collocates being in dative
    • node の位置を変えるためのコマンド
  • [lempos~10"test-n"]
    • search for the top 10 synonyms of "test-n" (synonym search)
  • REGEX optimization
    • simple OR optimization
    • prefix optimization
    • n-gram prefetching
      • pre index all character in one- bi- trigrams

Ted Briscoe Keynote

Appplications

CLC

  • 40 M words with 80 error types coded
  • 24 different exam types
  • CEFR levels benchmarked to exam grades

Error-coded script

  • Transcribe
  • Error coding
    • <M|our having> M:missing

Parse it with RASP

  • Parsing errors: use this to machine learning
  • dependency grammar: representation of texts

Overall assessment of proficiency

  • textual features which are proxies for writing competence
  • predict a grade

Machine learning

  • Training : CLC
    • Classification: number of labels (pass vs. fail)
    • Regression: try to predict real number
    • Ranking: A>B>C (e.g. rank the web pages)
  • Features
    • secript length
    • errors
    • n-grams
    • parse trees
  • Rank Preference --> predictiona

Pairwise Ranking SVM:

  • Joachims 2002
  • Learn an optimal ranking funciton that explicitly models the grade relatihoships

Features

  • word sequences
  • POS sequences
  • Grammatical constructions
  • Other features
    • error rate estimate (against n-gram corpora)
    • readability score
  • Correlation between human grades and the system-predicted grades
    • Upper bound = 0.796 (word+POS+readability+constructions+error rate)

Highly ranked features

  • 100,000 features
    • see images

Error detection and correction suggestions

  • ensure high accuracy and reasonable coverage
  • corpus-derived rules
    • error rules from CLC
    • detect incorrect word sequences
    • At least 90% incorrect occurrences
  • online dictionary-derived rules
  • Precision:90% Recall:10% (effect on learning rate)
  • Sentence evaluation
    • Limited linguistic evidence that can be extracted automatically
    • difficulty in acquiring

Incremental Semantic Analysis (ISA)

  • Distributional semantics
    • fully incremental variation of Random Indexing
    • Similarity between words measured by context vectors

Future work

  • prompt relevance
  • task achievement
  • discourse organization feedback
  • L1-specific feedback
  • content word error detection and correction suggestions
  • link to online dictionary/thesaurus
  • link to courseware multiple choice exercises

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2015-08-12 (水) 05:17:55 (798d)