13  Feature Selection

The book’s Feature Selection chapter focuses on different methods to reduce the number of predictors used by the model

13.1 Requirements

We will use the following package in this chapter: You’ll need 9 packages (bestNormalize, future.mirai, janitor, kernlab, partykit, QSARdata, ranger, rpart, and tidymodels) for this chapter. You can install them via:. To install them:

req_pkg <- c("bestNormalize", "future.mirai", "janitor", "kernlab", "partykit", 
             "QSARdata", "ranger", "rpart", "tidymodels")

# Check to see if they are installed: 
pkg_installed <- vapply(req_pkg, rlang::is_installed, logical(1))

# Install missing packages: 
if ( any(!pkg_installed) ) {
  install_list <- names(pkg_installed)[!pkg_installed]
  pak::pak(install_list)
}

# For coliono, install from GitHub
pak::pak("stevenpawley/colino")

We’ll demonstrate these tools using the phospholipidosis (PLD) data set from the QSARdata package. These data are an example of a data set used in drug discovery to help predict when compounds have unwanted side effects (e.g., toxicity).

We’ll consider two predictor sets. One consists of a series of descriptors of molecules (such as size or weight) and a set of predictors that are binary indicators for important sub-structures in the equations that make up the molecules. The response has two classes (“clean” and toxic outcomes).

Even before initial splitting, there are more predictors (1345) than data points (324), making feature selection an essential task for these data.

Let’s load the meta package, manage some between-package function conflicts, and initialize parallel processing:

There are a few different data frames for these data. We’ll merge two and clean up and shorten the variable names, use a 4:1 split of the data, and then initialize multiple repeats of 10-fold cross-validation:

data(PLD, package = "QSARdata")

drug_data <- 
  # Merge the outcome data with a subset of possible predictors.
  PLD_Outcome %>% 
  full_join(PLD_VolSurfPlus, by = "Molecule") %>% 
  full_join(PLD_PipelinePilot_FP %>% select(1, contains("FCFP")), 
            by = "Molecule") %>% 
  select(-Molecule) %>% 
  clean_names() %>% 
  as_tibble() %>% 
  # Make shorter names:
  rename_with(~ gsub("vol_surf_plus_", "", .x), everything()) %>% 
  rename_with(~ gsub("ppfp_fcfp_", "fp_", .x), everything())

set.seed(106)
drug_split <- initial_split(drug_data, prop = 0.8, strata = class)
drug_train <- training(drug_split)
drug_test <- testing(drug_split)
drug_rs <- vfold_cv(drug_train, repeats = 10, strata = class)

There is a class imbalance for these data:

dim(drug_train)
#> [1]  259 1346
drug_train %>% count(class)
#> # A tibble: 2 × 2
#>   class          n
#>   <fct>      <int>
#> 1 inducer       99
#> 2 noninducer   160

The level “inducer” indicates that the molecule has been proven to cause phospholipidosis.

13.2 Unsupervised Selection

In tidymodels, most preprocessing methods for feature selection are recipe steps.

There are several unsupervised feature filters in recipes:

  • step_zv(): Removes predictors with a single value.
  • step_nzv(freq_cut = double(1), unique_cut = double(1)): Removes predictors that have a few unique values that are out of the mainstream.
  • step_corr(threshold = double(1)): Reduces the pairwise correlations between predictors.
  • step_lincomb(): Eliminates strict linear dependencies between predictors.
  • step_filter_missing(threshold = double(1)): Remove predictors that have too many missing values.

The textrecipes package also has a few unsupervised methods for screening tokens (i.e., words):

  • step_tokenfilter(): filter tokens based on term frequency.
  • step_stopwords(): Removes top words (e.g., “and”, “or”, etc.)
  • step_pos_filter(): Part of speech filtering of tokens.

It is suggested that these steps occur early in a recipe, perhaps after any imputation methods.

We suggest adding these to a recipe early, after any imputation methods.

For the drug toxicity data, we can visualize the amount of predictor correlations by first computing the correlation matrix:

cor_mat <- 
  drug_train %>% 
  select(-class) %>% 
  cor()
#> Warning in cor(.): the standard deviation is zero

Note the warning: some columns have a single unique value. Let’s look at this across the entire set of predictors via a function in the vctrs package:

num_unique <- map_int(drug_train %>% select(-class), vctrs::vec_unique_count)
names(num_unique)[num_unique == 1]
#>  [1] "fp_0167" "fp_0172" "fp_0174" "fp_0178" "fp_0181" "fp_0182" "fp_0183" "fp_0191"
#>  [9] "fp_0193" "fp_0195" "fp_0198" "fp_0200" "fp_0204" "fp_0205" "fp_0206" "fp_0212"

We’ll start a recipe with the step that will eliminate these:

drug_rec <- 
  recipe(class ~ ., data = drug_train) %>% 
  step_zv(all_predictors())

Returning to correlations, let’s plot the distribution of pairwise correlation between predictors:

pairwise_cor <- cor_mat[upper.tri(cor_mat)]

sum(abs(pairwise_cor) >= 3 / 4, na.rm = TRUE)
#> [1] 3922

tibble(correlation = pairwise_cor) %>% 
  filter(!is.na(correlation)) %>% 
  ggplot(aes(x = correlation)) + 
  geom_histogram(binwidth = 0.1, col = "white")

This isn’t too bad, but if we wanted to reduce the extreme pairwise correlations, we could use:

drug_rec %>% 
  step_corr(all_predictors(), threshold = 0.75)
#> 
#> ── Recipe ───────────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:      1
#> predictor: 1345
#> 
#> ── Operations
#> • Zero variance filter on: all_predictors()
#> • Correlation filter on: all_predictors()

or search for an optimal cutoff using threshold = tune() (as we will below).

13.3 Automatic Selection

The text mentioned that there are types of models that automatically select predictors. Tree-based models typically fall into this category.

To demonstrate, let’s fit a Classification and Regression Tree (CART) to the training set and see how many predictors are removed.

Before doing so, let’s turn off a feature of this model. CART computes special alternate splits during training (“surrogate” and “competing” splits) to aid with things like missing value imputation. We’ll use the built-in feature importance measure to see how many predictors were used. Unfortunately, those measures will include splits not actually used by the model, so we prohibit these from being listed using rpart.control().

We can pass that to the model fit when we set the engine:

cart_ctrl <- rpart::rpart.control(maxcompete = 0, maxsurrogate = 0)

cart_spec <- 
  decision_tree(mode = "classification") %>% 
  set_engine("rpart", control = !!cart_ctrl)

Note the use of !! (“bang-bang”) when adding cart_ctrl as an engine option. If we had just used control = cart_ctrl, it tells R to look for a reference to object “cart_ctrl”, which resides in the global environment. Ordinarily, that works fine. However, if we use parallel processing, that reference is not available to the worker processes, and an error will occur.

Using the bang-bang operator, we replace the reference to “cart_ctrl” with the actual value of that object. It splices the actual data into the model specification so parallel workers can find it. Here’s the model fit:

cart_drug_fit <- cart_spec %>% fit(class ~ ., data = drug_train)
cart_drug_fit
#> parsnip model object
#> 
#> n= 259 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#>  1) root 259 99 noninducer (0.38224 0.61776)  
#>    2) l1lg_s< -2.105 103 30 inducer (0.70874 0.29126)  
#>      4) logp_c_hex>=1.023 70 10 inducer (0.85714 0.14286)  
#>        8) cw7< 0.01674 41  1 inducer (0.97561 0.02439) *
#>        9) cw7>=0.01674 29  9 inducer (0.68966 0.31034)  
#>         18) w6>=33.5 13  0 inducer (1.00000 0.00000) *
#>         19) w6< 33.5 16  7 noninducer (0.43750 0.56250) *
#>      5) logp_c_hex< 1.023 33 13 noninducer (0.39394 0.60606)  
#>       10) fp_0048>=0.5 9  0 inducer (1.00000 0.00000) *
#>       11) fp_0048< 0.5 24  4 noninducer (0.16667 0.83333)  
#>         22) d8>=4.5 7  3 inducer (0.57143 0.42857) *
#>         23) d8< 4.5 17  0 noninducer (0.00000 1.00000) *
#>    3) l1lg_s>=-2.105 156 26 noninducer (0.16667 0.83333)  
#>      6) cw7< 0.002635 9  3 inducer (0.66667 0.33333) *
#>      7) cw7>=0.002635 147 20 noninducer (0.13605 0.86395)  
#>       14) vd>=0.4691 8  3 inducer (0.62500 0.37500) *
#>       15) vd< 0.4691 139 15 noninducer (0.10791 0.89209) *

Of the 1345 predictors, only 7 were actually part of the prediction equations. The partykit package has a nice plot method to visualize the tree:

library(partykit)

cart_drug_party <- 
  cart_drug_fit %>% 
  extract_fit_engine() %>% 
  as.party()
plot(cart_drug_party)

As previously mentioned, trees produced by the rpart package have an internal importance score. To return this, let’s write a small function to pull the rpart object out, extract the importance scores, and then return a data frame with that data:

get_active_features <- function(x) {
  require(tidymodels)
  x %>% 
    extract_fit_engine() %>% 
    pluck("variable.importance") %>% 
    enframe() %>% 
    setNames(c("predictor", "importance"))
}

get_active_features(cart_drug_fit) 
#> # A tibble: 7 × 2
#>   predictor  importance
#>   <chr>           <dbl>
#> 1 l1lg_s         36.46 
#> 2 logp_c_hex      9.624
#> 3 fp_0048         9.091
#> 4 cw7             7.553
#> 5 w6              4.539
#> 6 vd              4.045
#> # ℹ 1 more row

This shows us the 7 predictors used, along with their relative effect on the model.

These results show what happens with the training set, but would a predictor like l1lg_s be consistently selected?

To determine this, we can resample the model and save the importance scores for each of the 100 analysis sets. Let’s take the get_active_features() function and add it to a different control function that will be executed during resampling:

ctrl <- control_resamples(extract = get_active_features)
cart_drug_res <- 
  cart_spec %>% 
  fit_resamples(
    class ~ ., 
    resamples = drug_rs, 
    control = ctrl
  )

Our results will have an extra column called .extract that contains the results for the resample. Since we didn’t tune this model, .extract contains a simple tibble with the results:

cart_drug_res$.extracts[[1]]
#> # A tibble: 1 × 2
#>   .extracts        .config             
#>   <list>           <chr>               
#> 1 <tibble [7 × 2]> Preprocessor1_Model1

cart_drug_res$.extracts[[1]]$.extracts
#> [[1]]
#> # A tibble: 7 × 2
#>   predictor  importance
#>   <chr>           <dbl>
#> 1 vd             35.40 
#> 2 logp_c_hex      9.001
#> 3 l1lg_s          7.934
#> 4 aus74           3.75 
#> 5 cw1             3.438
#> 6 fp_0027         3.048
#> # ℹ 1 more row

We can extract the results from all the resamples, unnest, and count the number of times each predictor was selected:

resampled_selection <- 
  cart_drug_res %>% 
  collect_extracts() %>% 
  unnest(.extracts) %>% 
  count(predictor) %>%
  arrange(desc(n))

resampled_selection %>% slice_head(n = 5)
#> # A tibble: 5 × 2
#>   predictor      n
#>   <chr>      <int>
#> 1 l1lg_s        86
#> 2 vd            58
#> 3 logp_c_hex    57
#> 4 cw7           44
#> 5 iw4           43

A visualization illustrates that a small number of predictors were reliably selected:

resampled_selection %>% 
  ggplot(aes(n)) +
  geom_histogram(binwidth = 2, col = "white") +
  labs(x = "# Times Selected (of 100)")

We can also see the model’s performance characteristics:

collect_metrics(cart_drug_res)
#> # A tibble: 3 × 6
#>   .metric     .estimator   mean     n  std_err .config             
#>   <chr>       <chr>       <dbl> <int>    <dbl> <chr>               
#> 1 accuracy    binary     0.7406   100 0.007049 Preprocessor1_Model1
#> 2 brier_class binary     0.2001   100 0.005307 Preprocessor1_Model1
#> 3 roc_auc     binary     0.7559   100 0.008105 Preprocessor1_Model1

One additional note about using tree-based models to automatically select predictors. Many tree ensembles create a collection of individual tree models. For ensembles to work well, this collection should have a diverse set of trees (rather than those with the same splits). To encourage diversity, many tree models have an mtry parameter. This parameter is an integer for the number of predictors in the data set that should be randomly selected when making a split. For example, if mtry = 3, a different random selection of three predictors would be the only ones considered for each split in the tree. This facilitates diversity but also forces irrelevant predictors to be included in the model.

However, this also means that many tree ensembles will have prediction functions that include predictors that have no effect. If we take the same strategy as above, we will vastly overestimate the number of predictors that affect the model.

For this reason, we might consider setting mtry to use the complete predictor set during splitting if we are trying to select predictors. While this might slightly decrease the model’s performance, the false positive rate of finding “important predictors” will be significantly reduced.

13.4 Wrapper Methods

tidymodels does not contain any wrapper methods, primarily due to their computational costs.

Several other packages do, most notably caret. For more information on what that package can do, see the feature selection chapters of the documentation:

R code from the Feature Engineering and Selection book can also be found at https://github.com/topepo/FES.

13.5 Filter Methods

Currently, the majority of supervised filters live in the colino package (although this will change in the Autumn of 2025). Those steps include:

These steps contain tuning parameters that control how many predictors to retain:

  • top_n specifies the number to retain while
  • threshold describes the cut-point for the metric being used to filter

Let’s add a supervised filter based on the popular random forest importance scores to demonstrate. The step requires a model declaration via a parsnip specification. We’ll choose random forest model and optimize the number of top predictors that should be retained and then given to the model.

base_model <- 
  rand_forest(trees = 1000, mode = "classification") %>%
  set_engine("ranger", importance = "permutation")

drug_rec <- 
  recipe(class ~ ., data = drug_train) %>% 
  step_zv(all_predictors(), id = "zv") %>% 
  step_corr(all_numeric_predictors(), threshold = tune(), id = "cor")  %>%
  step_select_vip(
    all_numeric_predictors(),
    outcome = "class",
    model = base_model,
    top_p = tune(),
    id = "vip"
  ) %>%
  step_orderNorm(all_numeric_predictors())

Note that we also add a correlation filter and optimize the exclusion threshold. This helps the random forest model since the inclusion of highly correlated predictors can dilute the importance of the set of related predictors.

We’ll fit a support vector machine model to these data, so the recipe concludes with a step that will normalize the features to have the same distribution (even the binary values).

Now we can specify the supervised model, tag two parameters for optimization, and then add the model and recipe to a workflow:

svm_spec <- 
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>% 
  set_mode("classification")

vip_svm_wflow <- workflow(drug_rec, svm_spec)

Let’s add specific ranges for the supervised filter parameter since its upper range depends on the data dimensions. We’re not sure how many predictors will pass the unsupervised filter steps, but we’ll guess that we should include, at most, 100 predictors. If we overestimate this number, step_select_vip() will adjust the range to the upper limit.

We’ll also adjust the range of the correlation filter to make it more aggressively remove highly correlated predictors:

vip_svm_param <- 
  vip_svm_wflow %>% 
  extract_parameter_set_dials() %>% 
  update(
  top_p = top_p(c(1L, 100L)), 
  threshold = threshold(c(0.50, 0.99))
  )

Finally, let’s tune the model via Bayesian optimization and use the Brier score to guide the process to the best values of the tuning parameters:

ctrl <- control_bayes(no_improve = Inf, parallel_over = "everything")

vip_svm_res <- 
  vip_svm_wflow %>% 
  tune_bayes(
    resamples = drug_rs,
    metrics = metric_set(brier_class, roc_auc),
    initial = 10L,
    iter = 25L,
    control = ctrl,
    param_info = vip_svm_param
  )

A visualization of the process shows that the search does reduce the Brier score during the search:

autoplot(vip_svm_res, metric = "brier_class", type = "performance")

When we plot the parameter choices over iterations, we see that each tuning parameter converges to specific ranges. The number of predictors retained fluctuates, and a few choices could be used (say, between 5 and 15 predictors).

autoplot(vip_svm_res, type = "parameters")

A plot of the parameter values versus the Brier score tells a similar story:

autoplot(vip_svm_res, metric = "brier_class")

The numerically best results are:

show_best(vip_svm_res, metric = "brier_class")
#> # A tibble: 5 × 11
#>    cost rbf_sigma threshold top_p .metric   .estimator   mean     n  std_err .config
#>   <dbl>     <dbl>     <dbl> <int> <chr>     <chr>       <dbl> <int>    <dbl> <chr>  
#> 1 30.67 0.0005070    0.9665    35 brier_cl… binary     0.1331   100 0.003508 Iter10 
#> 2 29.89 0.001532     0.9811    75 brier_cl… binary     0.1349   100 0.003652 Iter11 
#> 3 27.73 0.0002266    0.8979    50 brier_cl… binary     0.1357   100 0.003505 Iter24 
#> 4 23.13 0.0003069    0.8237    38 brier_cl… binary     0.1359   100 0.003820 Iter15 
#> 5 23.82 0.0008604    0.9667    80 brier_cl… binary     0.1382   100 0.003489 Iter12 
#> # ℹ 1 more variable: .iter <int>

best_param <- select_best(vip_svm_res, metric = "brier_class")

Let’s update our workflow with the best parameters, then fit the final model on the entire training set:

set.seed(124)
final_model <- 
  vip_svm_wflow %>% 
  finalize_workflow(best_param) %>% 
  fit(drug_train)
final_model
#> ══ Workflow [trained] ═══════════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: svm_rbf()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────────────
#> 4 Recipe Steps
#> 
#> • step_zv()
#> • step_corr()
#> • step_select_vip()
#> • step_orderNorm()
#> 
#> ── Model ────────────────────────────────────────────────────────────────────────────
#> Support Vector Machine object of class "ksvm" 
#> 
#> SV type: C-svc  (classification) 
#>  parameter : cost C = 30.6666439770923 
#> 
#> Gaussian Radial Basis kernel function. 
#>  Hyperparameter : sigma =  0.000507017083942551 
#> 
#> Number of Support Vectors : 126 
#> 
#> Objective Function Value : -3333 
#> Training error : 0.142857 
#> Probability model included.

How many predictors were removed, and how many made it to the final model? We can write a function to use the tidy() method on the recipe steps to assess what was eliminated. The “mold” for the workflow can also tell us how many predictors were passed to the SVM model:

get_filter_info <- function(x) {
  fit_rec <- extract_recipe(x)
  # The tidy methods show the predictors that were eliminated:
  corr_rm <- nrow(tidy(fit_rec, id = "cor"))
  zv_rm <- nrow(tidy(fit_rec, id = "zv"))
  vip_rm <- nrow(tidy(fit_rec, id = "vip"))
  
  # The mold has a 'predictors' element that describes the
  # columns which are given to the model: 
  kept <- 
    x %>% 
    extract_mold() %>% 
    pluck("predictors") %>% 
    ncol()
  # We'll save them as a tibble:
  tibble(corr_rm, zv_rm, vip_rm, kept)
}

get_filter_info(final_model)
#> # A tibble: 1 × 4
#>   corr_rm zv_rm vip_rm  kept
#>     <int> <int>  <int> <int>
#> 1     682    16    612    35

The correlation filter removes a large number of predictors, which is not surprising for this type of data set.

How does the model work on the test set of 65 molecules?

test_pred <- augment(final_model, drug_test)
test_pred %>% brier_class(class, .pred_inducer)
#> # A tibble: 1 × 3
#>   .metric     .estimator .estimate
#>   <chr>       <chr>          <dbl>
#> 1 brier_class binary        0.1178
test_pred %>% roc_auc(class, .pred_inducer)
#> # A tibble: 1 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.899