Post

Hands-on 04. BallTree

Hands-on 04. BallTree

BallTree


Prerequisites

1
BallTree

1. How to implement the real

DBSCAN do not have manual number of clusters unlike KMeans. DBSCAN find another point from original point. It’s extending continuously and sometimes there’s no group and sometimes there’s so many groups because of automatically being determined.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree

MAX_GAP_PAIRS = 200

def add_local_price_gap_zones(df, radius_m=500, min_price_gap=250000, max_pairs=MAX_GAP_PAIRS):

    # Drop rows involved NA
    work_df = df.dropna(subset=["latitude", "longitude", "price"]).copy()

    if work_df.empty:
        df["price_gap_zone"] = False
        return df, []

    coords_rad = np.radians(work_df[["latitude", "longitude"]].to_numpy())
    prices = work_df["price"].to_numpy()

    tree = BallTree(coords_rad, metric="haversine")

    earth_radius_m = 6371000
    radius_rad = radius_m / earth_radius_m

    neighbors = tree.query_radius(coords_rad, r=radius_rad)

    gap_pairs = []
    seen = set()

    for i, neighbor_indices in enumerate(neighbors):
        for j in neighbor_indices:
            if i == j:
                continue

            pair_key = tuple(sorted((i, j)))
            if pair_key in seen:
                continue

            seen.add(pair_key)

            price_gap = abs(float(prices[i]) - float(prices[j]))

            if price_gap >= min_price_gap:
                row_i = work_df.iloc[i]
                row_j = work_df.iloc[j]

                gap_pairs.append({
                    "i_index": row_i.name,
                    "j_index": row_j.name,
                    "lat1": row_i["latitude"],
                    "lon1": row_i["longitude"],
                    "lat2": row_j["latitude"],
                    "lon2": row_j["longitude"],
                    "price1": float(row_i["price"]),
                    "price2": float(row_j["price"]),
                    "gap": price_gap,
                })

    gap_pairs = sorted(gap_pairs, key=lambda x: x["gap"], reverse=True)[:max_pairs]

    df = df.copy()
    df["price_gap_zone"] = False

    for pair in gap_pairs:
        df.loc[pair["i_index"], "price_gap_zone"] = True
        df.loc[pair["j_index"], "price_gap_zone"] = True

    return df, gap_pairs


df = pd.read_csv("data.csv")
df_map, gap_pairs = add_local_price_gap_zones(df)

1-1. Data

  1. Select independent and dependent features you wanna check
  2. Check whether the data is sufficient to calculate BallTree.

1-2. BallTree

  1. Set the tolerance between points
  2. Process as follow:
    1. Build a spatial index (tree)
      1. Convert coordinates into a suitable mertric space
      2. Recursively split the dataset into smaller groups
      3. Each node represents a “ball” (center + radius) that contains a subset of points
      4. Continue splitting until leaf nodes conatin a small number of points
    2. Find
      1. Start froom the root node
      2. Check distance betwen query point and noe region
      3. Prune irrelevant nodes
      4. Traverse relevant child nodes
      5. Reach leaf nodes
      6. REturn nighbors

1.3 In real

"ml-ho-balltree01"

2. How to work Balltree

BallTree is a spatial data structure used to efficiently find nearby points. Instead of comparing every point with every other point, it groups points into regions (balls) and skips irrelevant areas during search.

Complexity Comparison:

1
2
Naive search:   O(N²)
BallTree:       O(N log N)

2-1. Build Process (Tree Construction)

BallTree splits the dataset recursively using two distant points (pivots).

Example Data

1
2
3
4
5
6
7
8
9
10
11
12
13
Index : Position
0 : 1
1 : 2
2 : 3
3 : 4
4 : 5

5 : 20
6 : 21
7 : 22
8 : 23

9 : 50

We can visually see: (ideal)

1
2
[1~5]       [20~23]       [50]
Cluster A   Cluster B     Outlier
Step 1: Select initial pivot

Pick any point randomly.

1
A = 3
Step 2: Find a far point

Find the point farthest from A:

1
B = 50
Step 3: Find another far point

Find the point farthest from B:

1
C = 1

Now we use (B, C) as pivots: (Choice the furthest distance between A,B and C)

1
2
Pivot1 = 50
Pivot2 = 1
Step 4: Split data into two groups

Assign each point to the closer pivot:

1
2
Left  (closer to 1):  [1,2,3,4,5,20]
Right (closer to 50): [21,22,23,50]
Step 5: Recursively split

Repeat the same process:

1
2
3
4
5
6
7
Left:
  Pivot → (1,20)
  → [1,2,3,4,5] | [20]

Right:
  Pivot → (21,50)
  → [21,22,23] | [50]
Final Tree Structure
1
2
3
4
5
             Root
           /      \
   [1,2,3,4,5,20]   [21,22,23,50]
      /      \           /      \
 [1~5]      [20]   [21~23]     [50]

Each node represents a “ball” (center + radius).

2-2. Query Process (Finding Neighbors)

1
Find all points within distance = 3 from x = 2
Step 1: Start from root

Check both child nodes.

Step 2: Check left node
1
2
3
Center ≈ 3
distance(2, 3) = 1 → inside range
→ explore this node
Step 3: Check right node
1
2
3
Center ≈ 30
distance(2, 30) = 28 → too far
→ prune (skip entire node)

Important:

1
Points [20,21,22,23,50] are never checked individually
Step 4: Traverse leaf nodes

Check only relevant points:

1
2
3
4
5
distance(2,1) = 1 → include
distance(2,2) = 0 → include
distance(2,3) = 1 → include
distance(2,4) = 2 → include
distance(2,5) = 3 → include
Final Result
1
Neighbors = [1,2,3,4,5]
This post is licensed under CC BY 4.0 by the author.