Random Forest Classification:
> Consists of many decision trees and outputs the class that is the mode of the classes output by individual trees
> Uses Bagging Method.
> Algorithm:
Assumptions:
N = Training Classes; M = No.of Variables ; m = No.of input variables (m<M)
Steps:
1. Randomly select 'k' features from 'm' where k<<m
2. Among 'k' features, calculate the node 'd' using the best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1,2,3 steps until 'I' number of nodes reached.
5. Build the forest by repeating steps 1 to 4 for 'n' number of times to create 'n' number of trees
Prediction:
Takes the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome (target)Calculate the votes for each predicted target.Consider the high voted predicted target as the final prediction from the random forest algorithm.
Random Forest Regression:
The target variable doesn't have classes. we fit a regression model to the target variable using each of the independent variables. Then for each independent variable, the data is split at several split points. At each split point, the "error" between the predicted value and the actual values is squared to get a "Sum of Squared Errors (SSE)". The split point errors across the variables are compared and the variable/point yielding the lowest SSE is chosen as the root node/split point. This process is recursively continued.
Tuning parameters of Random Forest:
1. Number of features (max_features)
2. Number of trees (n_estimators)
3. Minimum leaf size (min_sample_leaf)
4. Number of jobs (n_jobs)
5. Random state (random_state)
6. oob_score
1. Number of features (max_features):
Auto/None :
This will simply take all the features which make sense in every tree.Here we simply do not put any restrictions on the individual tree.
sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number of variables are 100, we can only take 10 of them in individual tree.”log2″ is another similar type of option for max_features.0.6 : This option allows the random forest to take 60% of variables in individual run. We can assign and value in a format “0.x” where we want x% of features to be considered.
Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. However, this is not necessarily true as this decreases the diversity of individual tree which is the USP of random forest. But, for sure, you decrease the speed of algorithm by increasing the max_features.
Hence, you need to strike the right balance and choose the optimal max_features.
2. Number of trees (n_estimators) :
This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable.
3. Minimum leaf size (min_sample_leaf):
Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data.
4. Number of jobs (n_jobs) :
This parameter tells the engine how many processors is it allowed to use. A value of “-1” means there is no restriction whereas a value of “1” means it can only use one processor.
5. Random state (random_state ):
This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if given with same parameters and training data.
This is a random forest cross validation method. It is very similar to leave one out validation technique, however, this is so much faster. This method simply tags every observation used in different tress. And then it finds out a maximum vote score for every observation based on only trees which did not use this particular observation to train itself.
Here is a single example of using all these parameters in a single function :
model = RandomForestRegressor(n_estimator = 100, oob_score = TRUE, n_jobs = -1,random_state =50,
max_features = "auto", min_samples_leaf = 50)
model.fit(X,y)
Advantages:
Overfitting will never come.
Can be used for both regression and classification
Can be used as feature engineering
Can handle missing data
Comments