Post

Hands-on 07. SHAP

Hands-on 07. SHAP

SHAP


Prerequisites

1
SHAP

1. How to implement the real

SHAP is a model interpretation algorithm.

The main idea is:

1
2
Measure how much each feature contributed
to the final prediction.

In other words:

1
2
3
Base prediction
+ feature contributions
= final prediction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
from sklearn.ensemble import RandomForestRegressor
import shap
import numpy as np
import pandas as pd

model_df = pd.DataFrame({
    "bedrooms": [3, 4, 3, 5, 4],
    "bathrooms": [2, 2, 1, 3, 2],
    "land_area": [500, 650, 420, 800, 700],
    "floor_area": [180, 240, 130, 320, 260],
    "cbd_dist": [12, 8, 20, 6, 10],
    "nearest_stn_dist": [1.2, 0.8, 2.5, 0.6, 1.0],
    "price": [720000, 950000, 560000, 1250000, 1050000],
})

RANDOM_STATE = 42
MAX_SHAP_EXPLANATION_ROWS = 1000

def money_effect(value):
    sign = "+" if value >= 0 else "-"
    return f"{sign}${abs(value):,.0f}"

available_features = [
    "bedrooms",
    "bathrooms",
    "land_area",
    "floor_area",
    "cbd_dist",
    "nearest_stn_dist",
]

X = model_df[available_features]
y = model_df["price"]

# Random Forest Training
model = RandomForestRegressor(
    n_estimators=100,
    max_depth=5,
    random_state=RANDOM_STATE,
)

model.fit(X, y)
model_df["predicted_price"] = model.predict(X)

#-----------------------------------------------------#

if len(model_df) > MAX_SHAP_EXPLANATION_ROWS:
    shap_index = model_df.sample(
        n=MAX_SHAP_EXPLANATION_ROWS,
        random_state=RANDOM_STATE,
    ).index
else:
    shap_index = model_df.index

X_shap = X.loc[shap_index]

explainer = shap.TreeExplainer(model) # SHAP for tree structure
shap_values = explainer.shap_values(X_shap) # features effect using SHAP
{
    if isinstance(shap_values, list):
        shap_values = shap_values[0]

    shap_values = np.asarray(shap_values)
}

mean_abs = np.abs(shap_values).mean(axis=0)

base_value = explainer.expected_value # base value of SHAP 
{
    if isinstance(base_value, (list, np.ndarray)):
        base_value = float(np.asarray(base_value).ravel()[0])
    else:
        base_value = float(base_value)
}

shap_summary_features = sorted(
    zip(available_features, mean_abs),
    key=lambda x: x[1],
    reverse=True,
)[:8]

shap_summary_features = [
    (name, float(value))
    for name, value in shap_summary_features
]

shap_result = pd.DataFrame(index=shap_index)

shap_result["shap_base_value"] = base_value
shap_result["shap_total_effect"] = shap_values.sum(axis=1)
shap_result["shap_top_positive"] = "N/A"
shap_result["shap_top_negative"] = "N/A"
shap_result["shap_explanation"] = "N/A"
shap_result["shap_dominant_feature"] = "N/A"
shap_result["shap_dominant_value"] = np.nan

for row_pos, idx in enumerate(shap_index):

    vals = shap_values[row_pos]

    pairs = list(zip(available_features, vals))

    pos = sorted(
        [p for p in pairs if p[1] > 0],
        key=lambda x: x[1],
        reverse=True,
    )[:3]

    neg = sorted(
        [p for p in pairs if p[1] < 0],
        key=lambda x: x[1],
    )[:3]

    dominant = max(
        pairs,
        key=lambda x: abs(x[1]),
    )

    pos_text = (
        "; ".join(
            [f"{name}: {money_effect(value)}" for name, value in pos]
        )
        if pos
        else "None"
    )

    neg_text = (
        "; ".join(
            [f"{name}: {money_effect(value)}" for name, value in neg]
        )
        if neg
        else "None"
    )

    explanation = (
        f"Base price: ${base_value:,.0f}<br>"
        f"Positive factors: {pos_text}<br>"
        f"Negative factors: {neg_text}"
    )

    shap_result.loc[idx, "shap_top_positive"] = pos_text
    shap_result.loc[idx, "shap_top_negative"] = neg_text
    shap_result.loc[idx, "shap_explanation"] = explanation
    shap_result.loc[idx, "shap_dominant_feature"] = dominant[0]
    shap_result.loc[idx, "shap_dominant_value"] = float(dominant[1])

shap_cols = [
    "shap_base_value",
    "shap_total_effect",
    "shap_top_positive",
    "shap_top_negative",
    "shap_explanation",
    "shap_dominant_feature",
    "shap_dominant_value",
]

model_df.loc[shap_result.index, shap_cols] = shap_result[shap_cols]

1-1. Data

  1. Set parameters

    • Trained tree model (Random Forest / XGBoost / LightGBM)
    • Input data to explain
    • Feature list
    • Background / base prediction
    • Maximum rows for SHAP calculation

1-2. SHAP Model

  1. Process as follows:

    1. Create a TreeExplainer

      1. Pass the trained tree model to SHAP

        1
        
        explainer = shap.TreeExplainer(model)
        
      2. SHAP reads the tree structure:

        • Split features
        • Split thresholds
        • Leaf prediction values
        • Tree paths
    2. Select data to explain

      • If the dataset is large, sample only part of it
      • This reduces computation time
      1
      
      X_shap = X.loc[shap_index]
      
    3. Compute base value

      • Base value means the model’s average expected prediction
      • It is the starting point before feature effects are added
      1
      
      base_value = explainer.expected_value
      

      Example:

      1
      
      Base value = 650k
      
    4. Compute SHAP values

      • For each row, SHAP calculates how much each feature moves the prediction
      • Positive value → pushes prediction higher
      • Negative value → pushes prediction lower
      1
      
      shap_values = explainer.shap_values(X_shap)
      
    5. Explain one prediction

      Example:

      1
      2
      3
      4
      5
      6
      
      Base value = 650k
      
      floor_area = +150k
      cbd_dist   = +100k
      
      Final prediction = 900k
      

      Calculation:

      1
      
      650k + 150k + 100k = 900k
      
    6. Repeat for all rows

      • For every house, SHAP calculates feature contribution values

      Example:

      1
      2
      3
      4
      5
      6
      7
      
      House A:
      floor_area = +120k
      cbd_dist   = +80k
      
      House B:
      floor_area = -60k
      cbd_dist   = -100k
      
    7. Aggregate feature importance

      • Compute average absolute SHAP value for each feature
      • Larger average absolute value means stronger overall model influence
      1
      
      mean_abs = np.abs(shap_values).mean(axis=0)
      

      Example:

      1
      2
      3
      
      cbd_dist   = 140k average impact
      floor_area = 120k average impact
      bathrooms  = 20k average impact
      
    8. Determine dominant feature

      • For each row, find the feature with the largest absolute SHAP value
      1
      
      dominant = max(pairs, key=lambda x: abs(x[1]))
      

      Example:

      1
      2
      3
      4
      
      floor_area = +150k
      cbd_dist   = +100k
      
      Dominant feature = floor_area
      
    9. Final output

      • Base value
      • SHAP values for each feature
      • Positive factors
      • Negative factors
      • Dominant feature
      • Feature importance summary

      Example:

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      
      Base price: 650k
      
      Positive factors:
      floor_area: +150k
      cbd_dist: +100k
      
      Negative factors:
      None
      
      Final prediction:
      900k
      
  2. Core rule

    1
    2
    3
    4
    5
    
    Final prediction
    =
    Base value
    +
    Sum of SHAP values
    
  3. Interpretation

    • Positive SHAP value → feature increased the prediction
    • Negative SHAP value → feature decreased the prediction
    • Large absolute SHAP value → feature had strong influence
    • Small absolute SHAP value → feature had weak influence

1.2 In real

"ml-ho-shap01"

2. How to work Random Forest

Example Data

Assume we trained a Random Forest model using:

1
2
3
4
price
floor_area
cbd_dist
bathrooms

We want to predict:

1
2
3
4
House E:
floor_area = 250
cbd_dist = 8
bathrooms = 2

Suppose the Random Forest predicts:

1
Predicted price = 900k

Now SHAP asks:

1
Why did the model predict 900k?

2-1. Base Value

SHAP first calculates the average prediction of the model.

Example:

1
Average house prediction = 650k

This becomes:

1
base_value = 650k

Meaning:

1
2
Without knowing any feature,
the model predicts around 650k.

2-2. Tree Contribution Calculation

Suppose one decision tree looks like:

1
2
3
4
5
6
7
8
9
                [floor_area < 200]

                 /             \
              YES               NO
           predict 500k      [cbd_dist < 10]

                              /          \
                           YES            NO
                        predict 900k   predict 700k

Our house:

1
2
floor_area = 250
cbd_dist = 8
Step 1
1
floor_area < 200 ?

Result:

1
2
250 > 200
→ NO

Move right.

Step 2
1
cbd_dist < 10 ?

Result:

1
2
8 < 10
→ YES

Final prediction:

1
900k

2-3. SHAP Contribution Idea

SHAP measures:

1
How much each feature changed the prediction.
Step 1: Start from base value
1
650k
Step 2: Add floor_area effect

Because:

1
floor_area = 250

the model moves prediction upward:

1
650k → 800k

Contribution:

1
floor_area contribution = +150k
Step 3: Add cbd_dist effect

Because:

1
cbd_dist = 8

the house is relatively close to CBD.

Prediction changes:

1
800k → 900k

Contribution:

1
cbd_dist contribution = +100k

2-4. Final SHAP Values

Final contributions:

FeatureSHAP Value
floor_area+150k
cbd_dist+100k

Now verify:

1
2
3
base_value
+ all SHAP values
= final prediction

Calculation:

1
2
3
4
5
650k
+150k
+100k
=
900k

Correct.

2-5. Multiple Trees in Random Forest

Random Forest contains many trees.

Example:

1
120 trees

SHAP calculates contributions for:

1
2
3
4
5
Tree 1
Tree 2
Tree 3
...
Tree 120

Then averages them.

Treefloor_area contribution
Tree 1+150k
Tree 2+120k
Tree 3+180k

Average:

1
2
3
(150 + 120 + 180) / 3
=
150k

Final SHAP value:

1
floor_area = +150k

2-6. Dominant Feature

The dominant feature is:

1
Feature with the largest absolute SHAP value.

Example:

FeatureSHAP
floor_area+150k
cbd_dist+100k
bathrooms+20k

Dominant feature:

1
floor_area

because:

1
|150k| is the largest effect
This post is licensed under CC BY 4.0 by the author.