Post

Hands-on 05. Random Forest

Hands-on 05. Random Forest

Random Forest


Prerequisites

1
Random Forest

1. How to implement the real

Random Forest is ensembling model with various random decision tree. Each trees has different features from being choosed randomly and structure. So when it comes to inference with each trees, combined the each decision and build the result like average, majority vote

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

RANDOM_STATE = 42

def train_random_forest(df):

    # Drop rows involved NA
    df = df.copy()

    rf_features = [
        "bedrooms",
        "bathrooms",
        "garage",
        "land_area",
        "floor_area",
        "latitude",
        "longitude",
        "cbd_dist",
        "nearest_stn_dist",
        "nearest_sch_dist",
    ]

    available_features = [c for c in rf_features if c in df.columns]
    model_df = df.dropna(subset=available_features + ["price"]).copy()

    if len(model_df) < 100:
        df["predicted_price"] = np.nan
        df["prediction_gap"] = np.nan
        df["prediction_gap_pct"] = np.nan
        return df, {"mae": None, "r2": None, "features": []}

    X = model_df[available_features]
    y = model_df["price"]

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=RANDOM_STATE
    )

    # set model
    model = RandomForestRegressor(
        n_estimators=120,
        max_depth=14,
        min_samples_leaf=4,
        random_state=RANDOM_STATE,
        n_jobs=-1,
    )

    # train
    model.fit(X_train, y_train)

    # inference
    y_pred = model.predict(X_test)
    all_pred = model.predict(X)

    # result
    model_df["predicted_price"] = all_pred
    model_df["prediction_gap"] = model_df["predicted_price"] - model_df["price"]
    model_df["prediction_gap_pct"] = model_df["prediction_gap"] / model_df["predicted_price"]

    df["predicted_price"] = np.nan
    df["prediction_gap"] = np.nan
    df["prediction_gap_pct"] = np.nan

    df.loc[model_df.index, ["predicted_price", "prediction_gap", "prediction_gap_pct"]] = model_df[["predicted_price", "prediction_gap", "prediction_gap_pct"]]

    importance = sorted(
        zip(available_features, model.feature_importances_),
        key=lambda x: x[1], reverse=True,
    )

    metrics = {
        "mae": float(mean_absolute_error(y_test, y_pred)),
        "r2": float(r2_score(y_test, y_pred)),
        "features": importance[:8],
    }

    return df, metrics


df = pd.read_csv("data.csv")
df, metrics = train_random_forest(df)

1-1. Data

  1. Select features
  2. Split train and test data

1-2. Random Forest

  1. Set model parameters

    • Number of estimators (tree count)
    • Maximum depth
    • Minimum samples per leaf
    • Number of features to consider at each split
  2. Process as follows:

    1. Create a Decision Tree

      1. Generate a bootstrap sample (random sampling with replacement)
      2. At each node:

        • Randomly select a subset of features
        • For each selected feature, test all possible split thresholds
        • Choose the split that minimizes error (e.g., MSE)
      3. Split the data based on the best condition
      4. Repeat recursively until stopping criteria are met (e.g., max depth, minimum samples)
    2. Repeat tree creation

      • Build multiple trees independently using different random samples
      • Continue until the total number of estimators is reached
    3. Make predictions

      • Pass input data through all trees
      • Aggregate results:

        • Regression → average
        • Classification → majority vote

1.3 In real

"ml-ho-rf01"

2. How to work Random Forest

1
2
3
4
5
Random Forest =
Data randomness +
Feature randomness +
Optimal split selection +
Aggregation (average / vote)

Random Forest is an ensemble learning method used to improve prediction accuracy by combining multiple decision trees. Instead of relying on a single tree, it builds many trees using randomness and aggregates their results.

Complexity Comparison:

1
2
Single Decision Tree:   High variance (overfitting risk)
Random Forest:          Reduced variance (stable prediction)

2-1. Build Process (Tree Construction)

Random Forest builds multiple decision trees using:

1
2
3
1. Bootstrap Sampling (random data)
2. Random Feature Selection
3. Best Split Selection (not random split)

Example Data

1
2
3
4
5
6
7
House : (bedrooms, area, cbd_dist) → price

A : (2, 300, 15) → 500k
B : (3, 500, 10) → 800k
C : (4, 700, 5)  → 1200k
D : (3, 400, 12) → 700k
E : (5, 900, 3)  → 1500k
> Tree 1
Step 1: Bootstrap Sampling
1
[A, B, C, D]
Step 2: Random Feature Selection
1
2
Selected features:
area, cbd_dist
Step 3: Try multiple split candidates
1
2
3
area < 400
area < 600
cbd_dist < 10
Step 4: Evaluate each split (MSE)
1
2
3
area < 400 → MSE = 120k
area < 600 → MSE = 80k   ✅ best
cbd_dist < 10 → MSE = 150k
Step 5: Choose best split
1
if area < 600
> Tree 2
Step 1: Bootstrap Sampling
1
[B, C, D, E]
Step 2: Random Feature Selection
1
2
Selected features:
bedrooms, cbd_dist
Step 3: Split candidates
1
2
3
bedrooms < 4
cbd_dist < 8
cbd_dist < 15
Step 4: Evaluation
1
2
3
bedrooms < 4 → MSE = 90k
cbd_dist < 8 → MSE = 70k   ✅ best
cbd_dist < 15 → MSE = 100k
Step 5: Best split
1
if cbd_dist < 8
> Tree 3
Step 1: Bootstrap Sampling
1
[A, C, E]
Step 2: Random Feature Selection
1
2
Selected features:
area, bedrooms
Step 3: Split candidates
1
2
area < 500
bedrooms < 4
Step 4: Evaluation
1
2
area < 500 → MSE = 60k   ✅ best
bedrooms < 4 → MSE = 110k
Step 5: Best split
1
if area < 500

> Final Tree Concept

Each tree is:

1
2
3
- trained on different data
- uses different features
- finds different decision rules

2-2. Prediction

1
2
Predict price for:
(3 bedrooms, 450 area, cbd_dist = 10)
Step 1: Run all trees
1
2
3
Tree 1 → 1000k
Tree 2 → 750k
Tree 3 → 650k
Step 2: Aggregate results
For regression:
1
2
3
Final = Average

(1000 + 750 + 650) / 3 = 800k
For classification:
1
2
3
4
5
Tree1 → Premium
Tree2 → Mid
Tree3 → Mid

Final = Majority Vote → Mid
This post is licensed under CC BY 4.0 by the author.