17 Complex Nonlinear Boundaries

This chapter describes how to use inherently nonlinear models (excluding trees), such as neural networks, K-nearest neighbors, naive Bayes, and others.

17.1 Requirements

You’ll need 14 packages (bestNormalize, brulee, discrim, earth, embed, important, kknn, klaR, mda, mirai, nnet, probably, spatialsample, tidymodels) for this chapter:

# skip: fmt
req_pkg <- 
  c("bestNormalize", "brulee", "discrim", "earth", "embed", "important", 
    "kknn", "klaR", "mda", "mirai", "nnet", "probably", "spatialsample", 
    "tidymodels")

# Check to see if they are installed: 
pkg_installed <- vapply(req_pkg, rlang::is_installed, logical(1))

# Install missing packages: 
if ( any(!pkg_installed) ) {
  install_list <- names(pkg_installed)[!pkg_installed]
  pak::pak(install_list)
}

Let’s load the meta package and manage some between-package function conflicts.

library(tidymodels)
library(spatialsample)
library(embed)
library(bestNormalize)
library(discrim)
library(important)
library(probably)

tidymodels_prefer()
theme_set(theme_bw() + theme(legend.position = "top"))
mirai::daemons(parallel::detectCores())

# Some general settings for tuning classification models:
cls_mtr <- metric_set(brier_class, roc_auc, pr_auc, mn_log_loss)
ctrl_grid <-
  control_grid(
    save_pred = TRUE,
    save_workflow = TRUE,
    parallel_over = "everything"
  )

As before, we’ll load the objects that were already computed for the forestation data:

"https://raw.githubusercontent.com/aml4td/website/main/RData/forested_data.RData" |> 
  url() |> 
  load()

17.2 Nonlinear Discriminants

The tidymodels functions required to use discriminant analysis models are contained in the discrim package. We’ll look at two models from that package.

17.2.1 Naive Bayes

There are 2 engines for the naive_Bayes() function in discrim: klaR and naivebayes. We’ll focus on the first (which is the default engine).

The main text mentioned that removing potentially irrelevant predictors can benefit naive Bayes models. To do this, we’ll use a random forest importance filter to do so. This is accomplished using a recipe with a step from the important package:

filter_rec <-
  recipe(class ~ ., data = forested_train) |>
  step_predictor_best(all_predictors(), score = "imp_rf", prop_terms = tune())

We tune over the proportion of predictors retained using candidate values of 20%, 40%, …, 100%. We’ll also tune the kernel width adjustment parameter, with a value of 1.0 corresponding to the default estimated width. Smaller values result in more distribution peaks, while larger values widen the width, allowing more data to be used to compute the density at each point. We’ll make a regular grid from these two:

nb_wflow <- workflow(filter_rec, naive_Bayes(smoothness = tune(), Laplace = 1))

nb_grid <- crossing(
  smoothness = seq(0.5, 1.5, by = 0.1),
  prop_terms = (1:5 / 5)
)

After this, we’ll resample our models and compute performance for each candidate:

set.seed(872)
forest_nb_res <-
  nb_wflow |>
  tune_grid(
    resamples = forested_rs,
    grid = nb_grid,
    metrics = cls_mtr,
    control = ctrl_grid
  )

We can use the autoplot() method to get a quick assessment of the patterns of performance for the two tuning parameters:

autoplot(forest_nb_res, metric = "brier_class")

There was some merit in tuning the kernel smoothness parameter; smaller values produce slightly better performance. That is a fairly unusual pattern since a narrow kernel might lead to overfitting the distributions. For the feature selection step, a larger predictor set appears to be favored. However, the price is small for using the entire training set, depending on the smoothness value. We’ll revisit this again in Section 17.3 where the same recipes are used with a different model.

Overall, these results are not as favorable as those from the logistic regression in the previous chapter. We’ll move on from here to the next discriminant model.

17.2.2 Flexible Discriminants

FDA models are a generalization of linear discriminant analysis that enables us to fit our models using techniques more complex than simple linear regression. Almost any regression function can be used with FDA. Our focus will be on using the multivariate adaptive regression spline (MARS) model with FDA; however, other models can also be considered. See the documentation for mda::fda() for more information.

The best implementation of the MARS model is available in the earth package, which is the default in tidymodels. The discrim package contains discrim_flexible() for training these models.

The main function parameters are:

num_terms: the number of MARS terms (including the intercept) to keep during training.
prod_degree: values of 1 or 2 to indicate if interaction pairs should be considered (prod_degree = 2) or if an additive model (prod_degree = 1) will be used.
prune_method: How should terms be removed? Possible values are: "backward", "none", "exhaustive", "forward", "seqrep", or "cv".

One helpful engine-specific parameter is nk. This is the number of initial MARS terms to generate before pruning. By default, this is limited to \(min(200, max(2p))+1\) where \(p\) is the number of predictors. For our data, this default is 34 terms. However, when the training set has “many” rows, we might want to increase this value to encourage the model to potentially produce more complex models. We could tune this parameter to experiment a little more. However, for demonstration, a static value of nk = 201 was set. Here is the code for that specification:

fda_spec <-
  discrim_flexible(
    num_terms = tune(),
    prod_degree = tune(),
    prune_method = tune()
  ) |>
  set_engine("earth", nk = 201)

fda_wflow <- workflow(class ~ ., fda_spec)

For the grid, another regular grid was used that crossed all three parameters. We limit the number of retained terms to 50 and consider forward and backward pruning. The tuning code is otherwise unremarkable:

shears <- c("backward", "forward")
fda_grid <- crossing(num_terms = 2:50, prod_degree = 1:2, prune_method = shears)

set.seed(872)
forest_fda_res <-
  fda_wflow |>
  tune_grid(
    resamples = forested_rs,
    grid = fda_grid,
    metrics = cls_mtr,
    control = ctrl_grid
  )

Another quick visualization tells us what happened:

autoplot(forest_fda_res, metric = "brier_class")

There was a slight benefit to using interaction models. The difference between panels is difficult to see. For interaction models, about 30 terms should be retained (and around 40 for additive models). We can see that the pruning methods yielded very similar Brier scores, but forward selection might be less noisy.

To demonstrate how to visualize predictive profiles for additive models, let’s fit that model to the training set:

fda_fit <-
  fda_wflow |>
  finalize_workflow(
    tibble(
      num_terms = 40,
      prod_degree = 1,
      prune_method = "forward"
    )
  ) |> 
  fit(forested_train)

For each predictor, we want to vary its value while holding the others constant. Since the models are additive, the constant could be anything, so we use the median for numeric predictors and the first county value.

We’ll create a vector of names for the numeric predictors and get their constant values (plus the country value):

numeric_pred_nms <- setdiff(names(forested_train), c("class", "county"))
numeric_pred <-
  forested_train |>
  select(all_of(numeric_pred_nms))

median_vals <-
  numeric_pred |>
  map_dbl(median) |>
  tibble::as_tibble_row() |>
  mutate(county = levels(forested_train$county)[1])

We also create a representative grid for each predictor using quantiles (with a maximum grid size of 101):

col_seqs <- map(numeric_pred, ~ unique(quantile(.x, probs = (0:100)/ 100)))

We create a placeholder object, profiles, to contain all of the results, then loop over the numeric predictor columns to make the prediction set, compute estimated probabilities, and then bind them to the placeholder:

profiles <- NULL

for (i in seq_along(numeric_pred_nms)) {
  pred_nm <- numeric_pred_nms[i]
  static <- median_vals |> select(-all_of(pred_nm))
  var_seq <- col_seqs[[i]]
  var_grid <- crossing(!!pred_nm := var_seq, static)

  grid <-
    augment(fda_fit, new_data = var_grid) |>
    select(.pred_Yes, all_of(pred_nm)) |>
    set_names("Probability", "Value") |>
    mutate(Predictor = pred_nm)

  profiles <- bind_rows(profiles, grid)
}

Now we can create a plot of profiles:

profiles |> 
  ggplot(aes(Value, Probability)) +
  geom_line() + 
  facet_wrap(~ Predictor, scale = "free") +
  # Since the y-axis values are meaningless for this visualization, we remove it:
  theme(axis.text.y = element_blank())

The flat profiles indicate that the predictor was never used in the model.

Keep in mind that the y-axis for these profiles is relative probability. Their actual values are linear combinations that can’t be displayed in the same visualization.

17.3 K-Nearest Neighbors

The KNN model is fit using the parsnip function nearest_neighbor(). It has a single engine (kknn") with main parameters:

neighbors(): an integer for the number of nearest neighbors.
weight_func(): a character value for the kernel functions to apply to distances (e.g., "rectangular"). Possible values are contained in the vector dials::values_weight_func.
dist_power(): a numeric value for the exponent in the Minkowski distance.

For preprocessing, the predictions should all be numeric and have the same scale/units. Removing irrelevant predictors is also a good idea, as evidenced by FIGURE.

We’ll recycle the recipes from Section 17.2.1 that used a random forest importance filter to screen the predictor set and add an orderNorm transformation to put the predictors on the same symmetric scale. From there, we’ll write a model specification and tag all three parameters for tuning:

knn_spec <-
  nearest_neighbor(
    neighbors = tune(),
    dist_power = tune(),
    weight_func = tune()
  ) |>
  set_mode("classification")

encode_and_filter_rec <-
  filter_rec |>
  step_lencode_mixed(county, outcome = "class") |>
  step_orderNorm(all_numeric_predictors())

knn_wflow <- workflow(encode_and_filter_rec, knn_spec)

For a grid, we’ll set the range of neighbors to 50 locations, and use five different kernel weighting functions:

kern_vals <- c("rectangular", "triangular", "cos", "gaussian", "inv")

knn_param <- 
  knn_wflow |> 
  extract_parameter_set_dials() |> 
  update(
    neighbors = neighbors(c(2, 50)),
    weight_func = weight_func(kern_vals)
  )

Now we tune the model with the familiar syntax:

set.seed(872)
forest_knn_res <-
  knn_wflow |>
  tune_grid(
    resamples = forested_rs,
    grid = 50,
    param_info = knn_param,
    metrics = cls_mtr,
    control = ctrl_grid
  )

Once again, autoplot() is used to discover how each parameter affects model performance:

autoplot(forest_knn_res, metric = "brier_class")

17.4 Neural Networks via Multilayer Perceptrons

There are several engines in tidymodels for fitting basic, feed-forward neural networks for classification: nnet, keras, brulee, and brulee_two_layer. The first three assume a single hidden layer. The focus here will be on using the brulee engine, as it offers the most options.

The parsnip package function for this model is mlp() and has main arguments:

hidden_units: the number of hidden units in the model.
activation: the activation function (e.g., "relu").
penalty: the amount of penalization.
dropout: the proportion of model coefficients (i.e., weights) to set to zero during optimization.
epochs: the number of full passes through the training set.
learn_rate: the learning rate.

Please note that not all engines support all of these parameters. For example, the "net" engine cannot use epoch, dropout, learn_rate, or activation.

For brulee, there are a few helpful engine-specific arguments:

momentum: the number used to specify historical gradient information during optimization.
batch_size: an integer for the number of training set points in each batch.
class_weights: numeric class weights. See brulee::brulee_mlp() for more details.
stop_iter: a non-negative integer for how many iterations with no improvement before stopping. (default: 5L).
rate_schedule: A function to change the learning rate over epochs. See brulee::schedule_decay_time() for details. Possible values include: "decay_time", "decay_expo", "none", "cyclic", and "step".
optimizer: the optimization method. Some example values are: "SGD", "ADAMw", "Adadelta", "Adagrad", and "RMSprop".

To demonstrate, we’ll train our model using the AdamW variant of stochastic gradient descent for a maximum of 25 epochs, but may stop early after five consecutive poor iterations.

Before fitting the model, some preprocessing is necessary. We will need to convert county into one or more numeric columns. We can use binary indicators; however, for demonstration purposes, a supervised effect encoding model is employed to generate a single column of values. After this, we use the orderNorm technique to transform our predictors to have the same standard normal distribution.

encode_rec <-
  recipe(class ~ ., data = forested_train) |>
  step_lencode_mixed(county, outcome = "class") |>
  step_orderNorm(all_numeric_predictors())

Our model specification will tune numrous parameters:

mlp_spec <-
  mlp(
    hidden_units = tune(),
    penalty = tune(),
    learn_rate = tune(),
    epochs = 25,
    activation = tune()
  ) |>
  set_engine(
    "brulee",
    stop_iter = 5,
    optimizer = "ADAMw",
    rate_schedule = tune(),
    batch_size = tune(),
    momentum = tune()
  ) |>
  set_mode("classification")

mlp_flow <- workflow(encode_rec, mlp_spec)

tidymodels is well-versed in each of these parameters and can default to sensible ranges for each. However, we will adjust a few parameter ranges and reduce the number of possible values for the learning rate scheduler and the activation function (for demonstration purposes).

mlp_param <-
  mlp_flow |>
  extract_parameter_set_dials() |>
  update(
    momentum = momentum(c(0.8, 0.99)),
    penalty = penalty(c(-10, -1)),
    learn_rate = learn_rate(c(-4, -1)),
    activation = activation(c("elu", "relu", "tanh", "tanhshrink")),
    rate_schedule = rate_schedule(c("cyclic", "decay_time", "none"))
  )

From here, we can use tune_grid() to run the computations. In the main text, several preprocessors were used with the same model specification. There, we used a workflow set similar to what was shown in the previous chapter for logistic regression.

A note about brulee: it uses the torch ecosystem for computations natively in R (no Python required). We can run in parallel across CPU cores and/or, if your hardware allows, increase computational efficiency by utilizing a GPU.

Note that it can be very difficult to write producible code with deep learning infrastructures like torch or tensorflow. When a GPU is used, it may be impossible to get exactly the same numbers from run to run.

A space-filling design is used to tune:

set.seed(872)
forest_mlp_res <-
  mlp_flow |>
  tune_grid(
    resamples = forested_rs,
    grid = 25,
    param_info = mlp_param,
    metrics = cls_mtr,
    control = ctrl_grid
  )

For previous models, we use the autoplot() methods to quickly assess how well the model performed, identify which parameters were most important, and determine whether our grid search was searching the correct space. There is a constraint on autoplot() that prevents it from running with more than one qualitative parameter. Here we have two (activation and rate_schedule).

We can write some custom code to approximate what we would get from autoplot():

forest_mlp_res |>
  collect_metrics() |>
  filter(.metric == "brier_class") |>
  select(-n, -std_err) |>
  mutate(
    log_penalty = log10(penalty),
    log_learn_rate = log10(learn_rate),
    log_batch_size = log2(batch_size),
    .keep = "unused"
  ) |>
  pivot_longer(
    cols = c(hidden_units, starts_with("log_"), momentum),
    names_to = "Parameter",
    values_to = "Value"
  ) |>
  ggplot(aes(x = Value, y = mean, col = activation, pch = rate_schedule)) +
  geom_point() +
  facet_wrap(~ Parameter, scales = "free_x", nrow = 1) +
  theme(legend.position = "top", legend.box = "vertical") +
  labs(y = "Brier Score")