Feature Selection
Feature selection refers to the process of filtering out or removing features from a dataset. There are two main reasons to perform feature selection: removing redundant (very similar information content) features and filtering out irrelevant (information content not valuable w.r.t. the target) features that may worsen model performance. The difference between feature extraction and feature selection lies in that selection reduces the number of features while extraction creates new features or modifies existing ones. A universal approach to feature selection usually consists of obtaining a measure of “usefulness” for each feature and then eliminating those that do not meet a threshold. Note that no matter which method for feature selection is used, the best result will likely come from trial and error since the optimal techniques and tools vary for datasets.
Information Gain
Information Gain, synonymous with Kullback-Leiber divergence, can be defined as a measure of how much a certain feature tells us about a target class. Before discussing its use in feature selection, note that Information Gain could act as another “metric” for finding the best split in the Decision Tree along with Gini Impurity and Entropy. One major drawback of Information Gain being used in Decision Trees is that it tends to select features with more unique values. One example is if the dataset contains a certain attribute like Date, in usual cases, it would not be useful for Decision Trees to utilize such feature as its values are independent of targets. However, Information Gain would output a score for the Date feature potentially higher than other more useful features. Additionally, when dealing with categorical features, Information Gain favors features with more categories, which might not be ideal. Although Information Gain may be useful in serving as a “metric” during Decision Tree splits in rare cases, most of the time, it’s not considered due to its major disadvantage.
In technical terms, Information Gain produces the difference in Entropy before and after a transformation. When applied to classification feature selection, it calculates the statistical dependence between two variables, or how much information the two share; it’s sometimes referred to as Mutual Information. In statistics, the term information refers to how surprising a certain event is. An event is considered more surprising than another when it has a more balanced probability distribution and thus more Entropy. Entropy measures the “purity” of the dataset in terms of the probability distribution of samples belonging to classes. For example, a dataset with perfectly balanced targets (50-50 split) would result in Entropy of 1, while a dataset with imbalanced targets (90-10 split) would produce a low Entropy. Information Gain evaluates the impact on the purity by splitting the dataset by each unique value in the dataset. Essentially, it calculates one feature’s usefulness in relation to the target based on how well the feature splits the target.
Information Gain is extremely useful as a feature selection technique for relatively smaller datasets since the computational cost increases tremendously for larger datasets with more unique valued features.
Variance Threshold
Compared with Information Gain, the Variance Threshold provides a significantly faster and simpler method of feature selection with decent improvements on models. The Variance Threshold is usually used as a baseline feature selector to filter out inadequate features without the significant computational cost.
High-Correlation Method
One of the most straightforward ways to determine whether some features will be adequate indicators of the target is by correlation. In statistics, correlation defines the relevancy between two variables; it usually produces a measure that specifies how well the two variables are related. The relationship between features and targets is arguably the most important factor that determines whether the trained model will predict the target well or not. Those features with low correlation to the target will present as noise and possibly reduce the performance of the trained model. Linear correlation between two variables calculated using Pearson’s Correlation Coefficient is frequently used to measure how closely two variables align with each other or their correlation.
Pearson’s Correlation Coefficient produces a value between –1 and 1; –1 illustrates a negative correlation between two variables, while 1 illustrates a perfect positive correlation. A value of 0 indicates no correlation at all between variables. Generally, when the value lies above 0.5 or below –0.5, the two variables are considered to have a strong positive/negative correlation.
Recursive Feature Elimination
In previous sections, all feature selection techniques introduced are in the form of measuring some individual properties relative to each feature and then determining the removal of features based on their “measures.” These methods are universal and can be applied to any dataset using the same pipeline and process. But at the end of the day, feature selection is aimed at improving model performance, and therefore it is crucial to observe how well each feature specifically contributes to model performance. Recursive Feature Elimination (RFE) is a process in which features are removed (eliminated) based on how much they contribute to a trained model.
Due to its effectiveness and flexibility, RFE is one of the most used feature selection algorithms. RFE is not a single method or tool; it’s a wrapper that can adapt to any model depending on the use case. In the following example, Random Forest will be used as the model for feature selection; however, it can be replaced with any other model to improve performance.
Permutation Importance
Permutation Importance can be seen as another way of calculating feature importance. Both Permutation Importance and feature importance measure how much one feature contributes to the overall prediction. However, the calculation of Permutation Importance is independent of the model, meaning that the algorithm remains the same no matter what machine learning model is used. Permutation Importance’s speed depends on the model prediction rate, but it’s still relatively faster than other feature selection algorithms such as RFE.
Permutation Importance produces a measure of relevancy from the feature to the target. Logically, features with low Permutation Importance are potentially unnecessary to the model, while features with higher Permutation Importance may be deemed more useful to the model.
The algorithm starts by shuffling the rows of one feature in the validation dataset. After the shuffling, we predict using the trained model and observe the effect that the shuffling has on performance. Theoretically, if one feature is crucial to the model, it would significantly decrease the accuracy of the model prediction. On the other hand, if the feature shuffled does not contribute to model prediction as much, then it wouldn’t affect the model performance as much. By computing the loss function compared to the ground-truth values, we can obtain a measure of feature importance by the performance deterioration from shuffled features.
LASSO Coefficient Selection
Recall that during Linear Regression, a coefficient is assigned to each feature, acting as a weight that decides how much that feature will contribute to the final prediction. Ideally, a perfectly trained regression model would also have perfect coefficients and thus perfect feature importance. As RFE and Permutation Importance demonstrate the concept, we could select and remove features based on their feature importance. Those features with low or zero weight are unimportant or do not contribute to the prediction, so we do not need them for training as they only increase training time and possibly even reduce the performance of our models. Luckily, LASSO regression does this exactly. Depending on the adjustable hyperparameter, the weights of unimportant features will shrink to zero.