RandomForest

LCR2011 の処理の備忘録

学習者作文データ13000件の最も頻度の高い脱落・余剰エラー50種類の頻度を各作文毎に抽出
作文の CEFR レベル推定にこれらのエラーが効いてくるかどうかを判別したい
今回はランダム・フォレストを使用
特に変数の重要度グラフが有効と期待できる

library(gdata)
lc <- read.xls("lcr2011data.xls")
install.packages("randomForest")
library(randomForest)

# データ形式として factor 変数のレベルが32以上あってはいけない。
# 被験者のIDデータを読み込まないようにしないとエラーが出る

> lc.rf <-randomForest(Grade~.,data=lc)
 以下にエラー randomForest.default(m, y, ...) : 
 Can not handle categorical predictors with more than 32 categories.
> str(gtec1)
'data.frame':	100 obs. of  58 variables:
$ Subject    : Factor w/ 12278 levels "A1090723176071L-O.xml",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Score      : int  95 97 101 82 97 111 94 111 113 95 ...
$ Grade      : Factor w/ 5 levels "A1","A2","B1",..: 2 1 2 1 2 2 2 3 2 3 ...
$ a.add      : int  0 0 0 0 1 0 0 0 1 0 ...
$ a.om       : int  0 0 1 1 0 0 1 0 0 1 ...
$ am.add     : int  0 0 0 0 0 0 0 0 0 0 ...
$ and.add    : int  0 0 0 0 0 0 0 0 0 0 ...
$ and.om     : int  0 0 0 0 0 0 0 0 0 0 ...
$ are.add    : int  0 0 0 0 0 0 0 0 0 0 ...
$ are.om     : int  0 0 0 0 0 0 0 0 0 0 ...

＃被験者部分をカットしてデータをセットし直す
lc2 <-lc[,2:58]
> str(lc2)
'data.frame':	100 obs. of  57 variables:
$ Score      : int  95 97 101 82 97 111 94 111 113 95 ...
$ Grade      : Factor w/ 5 levels "A1","A2","B1",..: 2 1 2 1 2 2 2 3 2 3 ...
$ a.add      : int  0 0 0 0 1 0 0 0 1 0 ...
$ a.om       : int  0 0 1 1 0 0 1 0 0 1 ...
$ am.add     : int  0 0 0 0 0 0 0 0 0 0 ...

今度は大丈夫

教師なし判別
目的変数は与えない

> lc2.rf = randomForest(lc2,ntree=200)
> lc2.rf

Call:
randomForest(x = lc2, ntree = 200) 
               Type of random forest: unsupervised
                    Number of trees: 200
No. of variables tried at each split: 7

MDSplot(lc2.rf, fac=lc2[,2], cex=2, pch=as.numeric(lc2[,2]))

100件のテストでは動いたが，13000件のデータでは 1.1GB のベクトルになって割り当てができないというエラーメッセージが出た

> lc2.rf = randomForest(lc2,ntree=200)
エラー：  サイズ 1.1 Gb のベクトルを割り当てることができません 
R(1016,0xa055b540) malloc: *** mmap(size=1205997568) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

2000件のサンプリングをする

set.seed(1)
sam<-sample(1:12278,2000)  #12278件からランダムに2000件
sam2<-sample(1:12278,2000)　#同様のルールでもう1回
train<-lc2[sam,]　#トレーニング用
test<-lc2[sam2,]　#テスト用
train<-na.omit(train)  ＃サンプリングすると NA の行が出ることがあるので欠損値行をカット

教師あり判別

train.rf = randomForest(Grade ~ ., data=train)
> train.rf

Call:
randomForest(formula = Grade ~ ., data = train) 
              Type of random forest: classification
                    Number of trees: 500
No. of variables tried at each split: 7

       OOB estimate of  error rate: 35.59%
Confusion matrix:
      A1  A2  B1 B2 C1 class.error
A1 173  72  30  0  0   0.3709091
A2  37 225 197  0  0   0.5098039
B1   6 104 800 16  1   0.1370011
B2   0   2 160 28  5   0.8564103
C1   0   0  39 14 10   0.8412698

木の数を1万回にしても結果はほぼ同じだった。

教師ありの結果をプロットしたい場合には以下のように proximity の指定が必要，ただしメモリを食う。今回1万回の木の作成では 250MB のベクトルになったが割り当てられずエラーを吐いた

train.rf = randomForest(Grade ~ ., data=train, ntree=10000, proximity=T)

importance=Tで変数の重要度を返すことができる

test.rf = randomForest(Grade ~ ., data=test, importance=TRUE)

最新の20件