機械学習リファレンスマニュアル

最終変更者: 怡土順一, 最終変更リビジョン: 467, 最終変更日時: 2009-06-23 14:23:34 +0900 (火, 23 6月 2009)

ランダムツリー（Random Trees）

ランダムツリーは，Leo Breiman と Adele Cutlerによって提唱された． http://www.stat.berkeley.edu/users/breiman/RandomForests/. このアルゴリズムは，分類と回帰問題の両方を扱うことができる．ランダムツリーは決定木の集合（集合体）であり，このセクション以降では forest（この言葉も Breiman によって提唱された）と呼ばれる．分類は次のように行われる．ランダムツリー分類器は入力である特徴ベクトルを取り込み，forestの中の全てのツリーでそれらを分類し，多くの投票を得たクラスのラベルを出力する．回帰の場合，分類器の応答はforestの全てのツリーの応答の平均である．

すべてのツリーは同じパラメータで学習されるが，そこで用いられるデータ集合はブートストラップ法によって元データから生成され，各ツリーで異なる．それぞれの学習用データのために，ランダムに同じ数のベクトルをオリジナルデータの集合（=N）から選択する．このベクトルは，置換を伴って選択される．つまり，複数回使用されるベクトルもあれば，まったく使用されないベクトルもある．学習済みの各ツリーの各ノードにおいて，最適な分岐を見つけるのためには，すべての変数ではなくランダムに選択されたその部分集合のみを用いる．それぞれのノードで新しい部分集合が生成されるが，そのサイズは全てのノードおよび全てのツリーについて固定である．これは学習パラメータであり，デフォルトではsqrt(<number_of_variables>)である．学習後のツリーでは，枝の刈り込みは行われない．

ランダムツリーにおいて，交差検証法やブートストラップ法などの正確な推定手法，あるいは，学習誤差を評価するためのセパレートテストセットは必要ではない．誤差は，学習中に内部で評価される．現在のツリーのための学習集合が置換を伴うサンプリングによって抽出される際に，いくらかのベクトルは除外される（いわゆるoob (out-of-bag) データである）．oob データのサイズは約 N/3である. 分類誤差は以下のようにoob-dataを使用することによって評価される．

i番目のツリーに関係するobbである各ベクトルに対する予測を，そのi番目のツリーを使用して得る．
すべてのツリーの学習後，oobであった各ベクトルに対する class-"winner" （つまり，oobのベクトルを用いたときに，最も多くの投票を得ることができたクラス）を見つける．そして，これを真値と比較する．
その後，分類誤差の評価値は，元データのすべてのベクトルに対する，誤って分類したoobベクトルの割合として計算される. 回帰の場合，oob誤差は，oobベクトルの予測の和をベクトルの総数で割った値と真値との平方誤差として計算される．

参考文献

Machine Learning, Wald I, July 2002

Looking Inside the Black Box, Wald II, July 2002

Software for the Masses, Wald III, July 2002

And other articles from the web-site http://www.stat.berkeley.edu/users/breiman/RandomForests/cc_home.htm.

CvRTParams

ランダムツリーの学習パラメータ

struct CvRTParams : public CvDTreeParams
{
    bool calc_var_importance;
    int nactive_vars;
    CvTermCriteria term_crit;

    CvRTParams() : CvDTreeParams( 5, 10, 0, false, 10, 0, false, false, 0 ),
        calc_var_importance(false), nactive_vars(0)
    {
        term_crit = cvTermCriteria( CV_TERMCRIT_ITER+CV_TERMCRIT_EPS, 50, 0.1 );
    }

    CvRTParams( int _max_depth, int _min_sample_count,
                float _regression_accuracy, bool _use_surrogates,
                int _max_categories, const float* _priors,
                bool _calc_var_importance,
                int _nactive_vars, int max_tree_count,
                float forest_accuracy, int termcrit_type );
};

calc_var_importance: セットされている場合，変数の重要度が学習の際に計算される．計算した変数の重要度の配列を取り出すためには，CvRTrees::get_var_importance()を呼びだす.
nactive_vars: 変数の数．それぞれのツリーでランダムに選択され，最適な分割を求めるために使用される．
term_crit: forestの成長に対する終了条件． term_crit.max_iterは，forestの中のツリーの最大数（コンストラクタのパラメータであるmax_tree_countも参照する，デフォルトでは50）．
term_crit.epsilonは，満足される精度を表す（OOB error）.

forestのための学習パラメータの集合は，単一のツリーのための学習パラメータの上位集合である．しかし，ランダムツリーは決定木の機能や特徴のすべてを必要とせず，木の枝刈は行われない．そのため，交差検証法のパラメータは使用されない．

CvRTrees

ランダムツリークラス

class CvRTrees : public CvStatModel
{
public:
    CvRTrees();
    virtual ~CvRTrees();
    virtual bool train( const CvMat* _train_data, int _tflag,
                        const CvMat* _responses, const CvMat* _var_idx=0,
                        const CvMat* _sample_idx=0, const CvMat* _var_type=0,
                        const CvMat* _missing_mask=0,
                        CvRTParams params=CvRTParams() );
    virtual float predict( const CvMat* sample, const CvMat* missing = 0 ) const;
    virtual void clear();

    virtual const CvMat* get_var_importance();
    virtual float get_proximity( const CvMat* sample_1, const CvMat* sample_2 ) const;

    virtual void read( CvFileStorage* fs, CvFileNode* node );
    virtual void write( CvFileStorage* fs, const char* name );

    CvMat* get_active_var_mask();
    CvRNG* get_rng();

    int get_tree_count() const;
    CvForestTree* get_tree(int i) const;

protected:

    bool grow_forest( const CvTermCriteria term_crit );

    // forestのツリーの配列
    CvForestTree** trees;
    CvDTreeTrainData* data;
    int ntrees;
    int nclasses;
    ...
};

CvRTrees::train

ランダムツリーモデルの学習

bool CvRTrees::train(const CvMat* train_data, int tflag,
                    const CvMat* responses, const CvMat* comp_idx=0,
                    const CvMat* sample_idx=0, const CvMat* var_type=0,
                    const CvMat* missing_mask=0,
                    CvRTParams params=CvRTParams() );

メソッドCvRTrees::trainは，CvDTree::train()の第一形式に良く似ており，一般的なメソッドCvStatModel::trainの仕様に従う．パラメータ学習アルゴリズムへの指定は，すべて CvRTParamsインスタンスとして渡される. 学習誤差の評価（oob-error）は，プロテクトクラスのメンバoob_errorに保存される.

CvRTrees::predict

入力サンプルに対する出力を予測する

double CvRTrees::predict(const CvMat* sample, const CvMat* missing=0 ) const;

予測手法の入力パラメータは，CvDTree::predictと同じである．しかし，戻り値のタイプが異なる．この手法はforestの中のすべてのツリーの累積結果（多数が支持するクラス，あるいは，回帰関数推定値の平均）を返す．

CvRTrees::get_var_importance

変数の重要度を表す配列を取得する

const CvMat* CvRTrees::get_var_importance() const;

CvRTParams::calc_var_importance がセットされた場合，この手法は学習時に計算した変数の重要度を表すベクトルを返す．学習フラグがセットされていない場合は，NULLポインタを返す．学習後にいつでも変数の重要度が計算できる点が，決定木とは異なる.

CvRTrees::get_proximity

二つの学習サンプル間の近さを取り出す

float CvRTrees::get_proximity(const CvMat* sample_1, const CvMat* sample_2 ) const;

このメソッドは二つのサンプル間の近さを返す（二つのサンプルが同じ葉ノードに属するようなツリーの数の，ツリーの総和に対する割合）．

（例）ランダムツリー分類器を使用したキノコの食用可否の予測

#include <float.h>
#include <stdio.h>
#include <ctype.h>
#include "ml.h"

int main( void )
{
    CvStatModel*    cls = NULL;
    CvFileStorage*  storage = cvOpenFileStorage( "Mushroom.xml", NULL,CV_STORAGE_READ );
    CvMat*          data = (CvMat*)cvReadByName(storage, NULL, "sample", 0 );
    CvMat           train_data, test_data;
    CvMat           response;
    CvMat*          missed = NULL;
    CvMat*          comp_idx = NULL;
    CvMat*          sample_idx = NULL;
    CvMat*          type_mask = NULL;
    int             resp_col = 0;
    int             i,j;
    CvRTreesParams  params;
    CvTreeClassifierTrainParams cart_params;
    const int       ntrain_samples = 1000;
    const int       ntest_samples  = 1000;
    const int       nvars = 23;

    if(data == NULL || data->cols != nvars)
    {
        puts("Error in source data");
        return -1;
    }

    cvGetSubRect( data, &train_data, cvRect(0, 0, nvars, ntrain_samples) );
    cvGetSubRect( data, &test_data, cvRect(0, ntrain_samples, nvars,
        ntrain_samples + ntest_samples) );

    resp_col = 0;
    cvGetCol( &train_data, &response, resp_col);

    /* missed variable matrixの生成 */
    missed = cvCreateMat(train_data.rows, train_data.cols, CV_8UC1);
    for( i = 0; i < train_data.rows; i++ )
        for( j = 0; j < train_data.cols; j++ )
            CV_MAT_ELEM(*missed,uchar,i,j) = (uchar)(CV_MAT_ELEM(train_data,float,i,j) < 0);

    /* comp_idxベクトルの生成 */
    comp_idx = cvCreateMat(1, train_data.cols-1, CV_32SC1);
    for( i = 0; i < train_data.cols; i++ )
    {
        if(i<resp_col)CV_MAT_ELEM(*comp_idx,int,0,i) = i;
        if(i>resp_col)CV_MAT_ELEM(*comp_idx,int,0,i-1) = i;
    }

    /* sample_idxベクトルの生成 */
    sample_idx = cvCreateMat(1, train_data.rows, CV_32SC1);
    for( j = i = 0; i < train_data.rows; i++ )
    {
        if(CV_MAT_ELEM(response,float,i,0) < 0) continue;
        CV_MAT_ELEM(*sample_idx,int,0,j) = i;
        j++;
    }
    sample_idx->cols = j;

    /* タイプマスクの生成 */
    type_mask = cvCreateMat(1, train_data.cols+1, CV_8UC1);
    cvSet( type_mask, cvRealScalar(CV_VAR_CATEGORICAL), 0);

    // initialize training parameters
    cvSetDefaultParamTreeClassifier((CvStatModelParams*)&cart_params);
    cart_params.wrong_feature_as_unknown = 1;
    params.tree_params = &cart_params;
    params.term_crit.max_iter = 50;
    params.term_crit.epsilon = 0.1;
    params.term_crit.type = CV_TERMCRIT_ITER|CV_TERMCRIT_EPS;

    puts("Random forest results");
    cls = cvCreateRTreesClassifier( &train_data, CV_ROW_SAMPLE, &response,
        (CvStatModelParams*)& params, comp_idx, sample_idx, type_mask, missed );
    if( cls )
    {
        CvMat sample = cvMat( 1, nvars, CV_32FC1, test_data.data.fl );
        CvMat test_resp;
        int wrong = 0, total = 0;
        cvGetCol( &test_data, &test_resp, resp_col);
        for( i = 0; i < ntest_samples; i++, sample.data.fl += nvars )
        {
            if( CV_MAT_ELEM(test_resp,float,i,0) >= 0 )
            {
                float resp = cls->predict( cls, &sample, NULL );
                wrong += (fabs(resp-response.data.fl[i]) > 1e-3 ) ? 1 : 0;
                total++;
            }
        }
        printf( "Test set error = %.2f\n", wrong*100.f/(float)total );
    }
    else
       puts("Error forest creation");


    cvReleaseMat(&missed);
    cvReleaseMat(&sample_idx);
    cvReleaseMat(&comp_idx);
    cvReleaseMat(&type_mask);
    cvReleaseMat(&data);
    cvReleaseStatModel(&cls);
    cvReleaseFileStorage(&storage);
    return 0;
}

機械学習 リファレンス マニュアル