[ CEFR-J Members | CEFR-J RLD ]

CEFR-J Reference Level Descriptions (RLDs)

What is RLD?

Corpora used for our RLD work

Textbook Corpus as INPUT

  • CEFR Course Book Corpus:
    • internal resource
    • 96 CEFR-based course books published in the UK, with CEFR level classifications
    • 1,801,549 running words
  • English Textbook Corpus in Japan
    • internal resource
    • 7 junior high textbooks and 21 senior high school textbooks (Grades 1-3 each)
    • 1,158,525 running words

Learner Corpus as OUTPUT

  • Spoken:
    • NICT JLE Corpus (CEFR-aligned version) [ website ]
      • A corpus of English oral interview tests called Standard Speaking Test by ALC Press
      • 1,281 examinees' interview transcripts with SST levels
      • We created a CEFR-aligned version of the NICT JLE Corpus.
      • 763,289 words (only interviewees' utterances)
  • Written:
    • JEFLL Corpus (CEFR-level classified version) [ Information ]
      • A corpus of free compositions by Japanese junior and senior high school students.
      • 10,038 samples
      • 669,281 running words

CEFR-J Wordlist


  • CEFR-J Wordlist Version 1.6
    • A list of 7,801 items classified by the CEFR (A1 to B2) levels.
    • Each item has the following information:
      • headword (lemma form)
      • part of speech
      • CEFR level
      • thematic categories defined by the British Council/EAQUALS Core Inventory for General English and Threshold Levels 1990 (Council of Europe)


How to cite:

  • Tono, Y. (2017). The CEFR-J and its Impact on English Language Teaching in Japan. JACET International Convention Selected Papers, Volume 4, pp. 31-52. JACET.

CEFR-J Collocation Dataset


  • First release (September, 2022)
  • Collocation list based on the CEFR-J Wordlist Ver. 1.6
  • Syntactic frame-based collocation pairs extracted from BNC (dependency-parsed by stanza)

Dataset information:

  • Each collocation pair has the following information:
    • w1: collocate
    • w2: node
    • w1_CEFR: CEFR level of w1
    • w2_CEFR: CEFR level of w2
    • relation: dependency relation
    • cooccurrence: collocation frequency
    • freq_w1: independent frequency of w1in the entire BNC
    • freq_w2: independent frequency of w2 in the entire BNC
    • w1_in_rel: frequency of w1 in the given dependency relation
    • w2_in_rel: frequency of w2 in the given dependency relation
    • DP: dispersion measure DP (Gries)
    • expected_freq: expected frequencies
    • Association measures for this given collocation pair:
      • MI/ MI2/ MI3/ t_score/ z_score/ logDice/ log_likelihood/ chi_squared


  • ADJ+NOUN (amod): 135,939 pairs [ download ]
  • VERB+NOUN (obj): 114,582 pairs [ download ]
  • NOUN+NOUN (nounmod): 72,340 pairs [ download ]
  • ADVERB+VERB (advmod verb): 43,992 pairs [ download ]
  • ADVERB+ADJ (advmod adj): 16,180 pairs [ download ]


  • This dataset was created by Kohei Fukuda, a postgraduate student in my lab.

How to cite:

  • Fukuda, K. & Tono, Y. (2022). The CEFR-J Collocation Dataset Version 1.0. Tono Lab, TUFS. (this URL)

CEFR-J Grammar Profile


  • An inventory of grammar items classified by CEFR levels
  • Profiling was based on INPUT (ELT Course Book Corpus) as well as OUTPUT (Spoken and Written Learner Corpus)

A list of grammar items and their REGEX queries

  • The following Excel file describes 263 grammar items investigated and their REGEX query

Grammar Profile for Teachers and Learners

  • A user-friendly version of the Grammar Profile
  • Visual display showing CEFR levels where particular grammar items are introduced based on the distributions of grammar items across position-based CEFR course books.

Original dataset

  • Frequencies of 263 grammar items are obtained from the following corpora. Corpora themselves cannot be redistributed due to copyright restrictions, but the frequency data from each text will be made publicly available.
  • CEFR-based ELT Course Books: Frequency of 263 grammar items in CEFR-classified course books
  • CEFR-based ELT Course Books (Position-based): Frequency of 263 grammar items in the course books divided by two or three parts in order to examine the detailed occurrences in CEFR sub-levels.
  • CEFR-based ELT Course Books (Skill-based): Frequency of 263 grammar items in the course books divided by sections focusing on 4 skills (listening/reading/speaking/writing).
  • Written Learner Corpus: Frequency of 263 grammar items in the JEFLL Corpus, a corpus of 10,000 Japanese EFL learners' 20-minute in-class free compositions. Frequencies were obtained in both the original student writings and the versions corrected by native speakers.
  • Spoken Learner Corpus: Frequency of 263 grammar items in NICT JLE Corpus, a corpus of oral interviews by Japanese EFL learners (1,281 samples)

English Level Checker

  • About
    • A tool developed by the Okumura Lab at Tokyo Institute of Technology.
    • The site will provide a list of grammar items found in the text you input along with other lexical measures and the final CEFR level judgement.
    • It has been trained by both textbook and essay data. In the case of essay data, the input text can be automatically spotted for errors and suggestions will be made.

How to cite:

  • Ishii, Y. & Tono, Y. (2018). Investigating Japanese EFL learners' overuse/underuse of English grammar categories and their relevance to CEFR levels. Proceedings of the 4th Asia Pacific Corpus Linguistics Conference, (Edited by Y. Tono and H. Isahara), pp. 160-165.

CEFR-J Text Profile


  • Text Profile is a list of textual characteristics and their values obtained from the analysis of CEFR-classified texts
  • The CEFR-J Text Profile was mainly constructed by our project member, Dr Satoru Uchida (Kyushu University), the team of Yuki Arase Lab at Osaka University and Sachio Hirokawa Lab at Kyushu University.

Text profile measures

  • Common measures:
    • word length (1 to 3 letters)
    • word length (4 to 6 letters)
    • word length (7 letters +)
    • average word length
    • types
    • TTR
    • mean length of sentences
  • Lexical profile measures:
    • Average difficulty
    • A1_per
    • A2_per
    • B1_per
    • B2_per
    • C1_per
    • C2_per
  • Complexity measures:
    • sum_D_score
    • avg_D_score
    • sum_L_score
    • avg_L_score
    • avg_MaxDepth?
  • D_score: depth x difficulty level
  • L_score: depth x word length
  • Grammatical measures:
    • avg_[G-item]
    • [G-item]_per


  • Text profile metrics and their values for CEFR Course Book Corpus (All / Skill-based / Position-based)


  • About
    • CEFR-based Vocabulary Level Analyzer by Satoru Uchida at Kyushu University
    • A tool to report the CEFR levels of vocabulary used in the input text along with other text profile measures and the estimated CEFR level.

How to cite:

  • Uchida, Satoru and Masashi Negishi (2018) Assigning CEFR-J levels to English texts based on textual features. In Y. Tono and H. Isahara (eds.) Proceedings of the 4th Asia Pacific Corpus Linguistics Conference (APCLC 2018), pp. 463-467. PDF
  • Uchida, S. (2015). A CEFR-based Textbook Corpus: An attempt to reveal linguistic features of CEFR levels (original in Japanese). English Corpus Studies, 22, 87-99.

Major RLD projects for English

British Council/EAQUALS Core Inventory for General English

English Profile:

Global Scale of English by Pearson

Before CEFR

  • Threshold Level Series ("T-series")
    • You can access the original T-series books from here

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2022-10-16 (日) 12:04:19 (613d)