Information Gain (IG) is a metric used to select the best feature to split the data in a decision tree. It measures how much uncertainty (entropy) is reduced after the split.
Information Gain=Entropy (before split)âWeighted Entropy (after split)\text{Information Gain} = \text{Entropy (before split)} - \text{Weighted Entropy (after split)}
Where:
Entropy(S)=ââi=1cpilogâĄ2pi\text{Entropy}(S) = -\sum_{i=1}^{c} p_i \log_2 p_i
Where pip_i is the proportion of samples in class ii.
Problem: Information Gain prefers features with many distinct values, like:
Such features create many small, pure partitions, which looks good because it reduces entropy, but doesnât generalize well.
If you split on a feature like "Customer ID", each value could go to its own branch â entropy becomes 0, but the tree becomes overfitted and useless.