top of page

Understanding ID3 Algorithm


Machine Learning has played a major role in solving complex problems. Specially Decision trees have gained popularity because of their simplicity. Here, we will try to understand one such algorithm to implement decision trees and make a prediction.

ID3 Algorithm Implementation using Python


1. Quick overview / what you'll build

We implement ID3 (Iterative Dichotomiser 3), a classic decision-tree algorithm that uses entropy and information gain to choose splits. We will:

  • Load a CSV dataset (student_dataset_200.csv).

  • Preprocess features (drop irrelevant ID columns; discretize numeric features simply).

  • Implement ID3 (entropy, info gain, recursive tree builder).

  • Train/test split, evaluate accuracy and save the learned tree to JSON.

  • Show the learned tree structure and suggest improvements.


2. The dataset (what I used)

  • File: /mnt/data/student_dataset_200.csv (the file you uploaded).

  • Detected columns: Student_ID, Study_Hours, Attendance, Result.

  • We treat Result as the target (class label).

  • Student_ID is an identifier (should not be used as a predictor) — we drop it.


3. Short theory refresher

Entropy

Entropy measures the impurity of a set of labels - where (p_i) is the proportion of label (i) in set (S).

Information Gain

Information gain for attribute (A) is the reduction in entropy when splitting on (A):

ID3 chooses the attribute with highest IG at each node, and the process recurses on each branch until all examples in a subset have the same class, or no attributes remain.


4. Full Python implementation (copy-paste and run)

Below is the complete code used. It is self-contained (requires pandas, scikit-learn for train_test_split). It automatically detects target column if named Result, otherwise uses last column. Numeric columns get a median split (simple discretization). The learned tree is saved to disk.


# id3_from_scratch.py
import pandas as pd
import numpy as np
from math import log2
from collections import Counter
from sklearn.model_selection import train_test_split
import json

# --- CONFIG: path to your CSV file
CSV_PATH = "/mnt/data/student_dataset_200.csv"  # change path if needed
TARGET_CANDIDATES = ['target','class','label','result','Result','Class','Label','Passed','passed','Pass','pass','Outcome','outcome','status','Status']

# --- 1) Load dataset
df = pd.read_csv(CSV_PATH)
print("Loaded dataset shape:", df.shape)
print("Columns:", list(df.columns))

# --- 2) Detect target column (try common names, fallback to last)
target_col = None
for c in df.columns:
    if c in TARGET_CANDIDATES:
        target_col = c
        break
if target_col is None:
    target_col = df.columns[-1]
print("Assumed target column:", target_col)

# --- 3) Drop ID-like columns (columns containing 'id', except target)
drop_cols = [c for c in df.columns if ('id' in c.lower() and c != target_col)]
if drop_cols:
    print("Dropping ID columns:", drop_cols)
    df = df.drop(columns=drop_cols)

# --- 4) Separate features and target
X = df.drop(columns=[target_col])
y = df[target_col].astype(str)

# --- 5) Simple discretization for numeric features (median split)
def discretize_numeric(df, numeric_cols):
    df2 = df.copy()
    for col in numeric_cols:
        median = df2[col].median()
        df2[col] = df2[col].apply(lambda x: f"<= {median}" if x <= median else f"> {median}")
    return df2

numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
if numeric_cols:
    print("Numeric columns detected, discretizing:", numeric_cols)
    X = discretize_numeric(X, numeric_cols)

# Convert features to string (treat as categorical)
X = X.astype(str)
data = pd.concat([X, y.rename(target_col)], axis=1)

# --- 6) ID3 helpers: entropy, information gain, majority
def entropy(labels):
    counts = Counter(labels)
    total = len(labels)
    ent = 0.0
    for v in counts.values():
        p = v/total
        if p > 0:
            ent -= p * log2(p)
    return ent

def information_gain(data, attr, target):
    total_entropy = entropy(data[target])
    values = data[attr].unique()
    remainder = 0.0
    total = len(data)
    for v in values:
        subset = data[data[attr]==v]
        remainder += (len(subset)/total) * entropy(subset[target])
    return total_entropy - remainder

def majority_class(labels):
    return Counter(labels).most_common(1)[0][0]

# --- 7) ID3 recursive builder
def id3(data, attributes, target, depth=0, max_depth=None):
    labels = data[target].tolist()
    # If all labels equal -> leaf
    if len(set(labels)) == 1:
        return {'type':'leaf', 'label': labels[0]}
    # If no attributes left or max depth reached
    if not attributes or len(data) == 0 or (max_depth is not None and depth >= max_depth):
        return {'type':'leaf', 'label': majority_class(labels)}
    # Select best attribute by information gain
    gains = [(attr, information_gain(data, attr, target)) for attr in attributes]
    best_attr, best_gain = max(gains, key=lambda x: x[1])
    # If no gain, return leaf
    if best_gain <= 0:
        return {'type':'leaf', 'label': majority_class(labels)}
    tree = {'type':'node', 'attribute': best_attr, 'branches': {}}
    for val in data[best_attr].unique():
        subset = data[data[best_attr] == val]
        if subset.empty:
            tree['branches'][val] = {'type':'leaf', 'label': majority_class(labels)}
        else:
            new_attrs = [a for a in attributes if a != best_attr]
            tree['branches'][val] = id3(subset, new_attrs, target, depth+1, max_depth)
    return tree

# --- 8) Prediction helper
def predict_instance(tree, instance):
    if tree['type'] == 'leaf':
        return tree['label']
    attr = tree['attribute']
    val = str(instance.get(attr, ''))
    if val in tree['branches']:
        return predict_instance(tree['branches'][val], instance)
    else:
        # Unseen attribute value: fallback to majority class among leaves
        labels = []
        def collect_labels(t):
            if t['type'] == 'leaf':
                labels.append(t['label'])
            else:
                for b in t['branches'].values():
                    collect_labels(b)
        collect_labels(tree)
        return majority_class(labels) if labels else None

def predict(tree, X_df):
    return [predict_instance(tree, row.to_dict()) for _, row in X_df.iterrows()]

# --- 9) Train/test split and build
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42, stratify=data[target_col])
attributes = list(X.columns)
print("Train size:", train_df.shape, "Test size:", test_df.shape)

tree = id3(train_df, attributes, target_col, max_depth=10)

# --- 10) Evaluate
X_test = test_df.drop(columns=[target_col]).astype(str)
y_test = test_df[target_col].astype(str).tolist()
preds = predict(tree, X_test)
accuracy = sum(1 for a,b in zip(preds, y_test) if a == b) / len(y_test)
print("Test accuracy:", accuracy)

# --- 11) Print & save tree (pretty)
def print_tree(tree, indent=""):
    if tree['type'] == 'leaf':
        print(indent + "Leaf:", tree['label'])
    else:
        print(indent + f"[{tree['attribute']}]")
        for val, subtree in tree['branches'].items():
            print(indent + " ->", val)
            print_tree(subtree, indent + "    ")

print_tree(tree)

with open('/mnt/data/id3_tree_no_id.json', 'w') as f:
    json.dump(tree, f, indent=2)

print("Saved tree to /mnt/data/id3_tree_no_id.json")

5. Results produced on your dataset

When we run the code on our student_dataset_200.csv (with Student_ID dropped), we obtain:

Learned tree:


[Attendance]
 -> Poor
    [Study_Hours]
     -> Low      -> Leaf: Fail
     -> Medium   -> Leaf: Fail
     -> High     -> Leaf: Pass
 -> Good
    [Study_Hours]
     -> Low      -> Leaf: Pass
     -> High     -> Leaf: Pass
     -> Medium   -> Leaf: Pass

Test accuracy (80/20 stratified split): 0.75 (30/40)

Saved tree file: /mnt/data/id3_tree_no_id.json (Also earlier run saved /mnt/data/id3_tree.json — the first run included the Student_ID column.)


6. Explanation of the learned tree (intuition)

  • The first split chosen is Attendance — apparently it gives the most information about Result.

  • If Attendance = Poor, we look at Study_Hours: low/medium → Fail, high → Pass.

  • If Attendance = Good, Study_Hours generally leads to Pass across low/medium/high.

  • This suggests Good attendance is strongly associated with Pass, while when attendance is Poor, only high study hours recover the outcome.


7. How to run locally (step-by-step)

  1. Install dependencies:

    pip install pandas scikit-learn

  2. Save the code snippet above as id3_from_scratch.py.

  3. Put your student_dataset_200.csv in the path you set in CSV_PATH (or change the path in the script).

  4. Run:

    python id3_from_scratch.py

  5. Check saved tree JSON at /mnt/data/id3_tree_no_id.json (change path if you run locally).


Comments


bottom of page