8  Interactions and Nonlinear Features

This chapter is focused on how predictors can enter the model in a nonlinear fashion. The three approaches discussed are interaction effects, basis expansions, and discretization.

8.1 Requirements

You’ll need 5 packages (aorsf, embed, gt, hstats, and tidymodels) for this chapter. You can install them via:

req_pkg <- c("aorsf", "embed", "gt", "hstats", "tidymodels")

# Check to see if they are installed: 
pkg_installed <- vapply(req_pkg, rlang::is_installed, logical(1))

# Install missing packages: 
if ( any(!pkg_installed) ) {
  install_list <- names(pkg_installed)[!pkg_installed]
  pak::pak(install_list)
}

Let’s load the meta package and manage some between-package function conflicts.

8.2 Interactions

As the text mentions, interactions involve two or more predictors (of any data type). An interaction means that the relationship between the outcome and the predictors involved cannot be articulated by looking at one predictor at a time; they act in concert.

To get started, let’s once again load the food delivery data and use the same split:

data(deliveries, package = "modeldata")

set.seed(991)
delivery_split <- initial_validation_split(deliveries, prop = c(0.6, 0.2), strata = time_to_delivery)
delivery_train <- training(delivery_split)
delivery_test  <- testing(delivery_split)
delivery_val   <- validation(delivery_split)
delivery_rs    <- validation_set(delivery_split)

## Some functions to faciltate the content

dec_to_time <- function(x) {
  mins <- floor(x)
  dec <- floor((x - mins) * 60)
  res <- cli::pluralize("{mins} minutes{?s} and {dec} second{?s}")
  res <- as.character(res)
  res <- gsub(" and 0 seconds", "", res)
  res <- gsub("1 minutes", "1 minute", res) 
  res
}

dec_to_time_rs <- function(x) {
  collect_metrics(x) %>% 
    filter(.metric == "mae") %>% 
    pluck("mean") %>% 
    dec_to_time
}

We’ll consider two mechanisms to encode interaction columns: via the base R formula method and with recipes.

8.2.1 Interactions with Model Formulas

This was briefly discussed in Section 5.3.

The main operator that creates interactions is the colon. Using a:b within a formula will create the appropriate columns that encode the interactions. The specifics depend on what type of data are in columns a and b:

  • If both are numeric, an additional column, which is their product, is added to the model matrix. By default, the base R formula method gives it the name "a:b".

  • If one is numeric and the other is categorical, R first converts the categorical (i.e., factor) column into binary indicator variables, then makes columns that are the produce of each indicator column and the original numeric column.

  • If both are categorical, indicators are made for both and then their corresponding product pairs are created (binary times binary columns).

Let’s make a small example to demonstrate:

library(gt)

interaction_example <- 
  delivery_test %>% 
  slice(1, .by = day) %>% 
  select(day, hour, distance) %>% 
  arrange(day)

interaction_example %>% gt()
day hour distance
Mon 19.38 2.59
Tue 12.14 2.40
Wed 18.03 2.96
Thu 16.22 3.25
Fri 12.88 2.88
Sat 12.94 2.47
Sun 12.96 3.80

For two numeric predictors:

model.matrix(~ hour + distance + hour:distance, interaction_example) %>% 
  as_tibble() %>% 
  select(-`(Intercept)`) %>% 
  gt()
hour distance hour:distance
19.38 2.59 50.19
12.14 2.40 29.14
18.03 2.96 53.36
16.22 3.25 52.71
12.88 2.88 37.09
12.94 2.47 31.97
12.96 3.80 49.26

One numeric and one factor predictor:

model.matrix(~ day + distance + day:distance, interaction_example) %>% 
  as_tibble() %>% 
  select(-`(Intercept)`) %>% 
  gt()
dayTue dayWed dayThu dayFri daySat daySun distance dayTue:distance dayWed:distance dayThu:distance dayFri:distance daySat:distance daySun:distance
0 0 0 0 0 0 2.59 0.0 0.00 0.00 0.00 0.00 0.0
1 0 0 0 0 0 2.40 2.4 0.00 0.00 0.00 0.00 0.0
0 1 0 0 0 0 2.96 0.0 2.96 0.00 0.00 0.00 0.0
0 0 1 0 0 0 3.25 0.0 0.00 3.25 0.00 0.00 0.0
0 0 0 1 0 0 2.88 0.0 0.00 0.00 2.88 0.00 0.0
0 0 0 0 1 0 2.47 0.0 0.00 0.00 0.00 2.47 0.0
0 0 0 0 0 1 3.80 0.0 0.00 0.00 0.00 0.00 3.8

If you want to make all possible interactions, you can use the dot and exponent operator:

# All possible two-way interactions:
model.matrix(~ (.)^2, interaction_example)

# All possible two- and three-way interactions, etc:
model.matrix(~ (.)^3, interaction_example)

8.2.2 Interactions from Recipes

The recipes has step_interact(). This step is very atypical since it uses a formula to specify the inputs (rather than a set of dplyr selectors). It also requires all columns used to be already converted to indicators (perhaps using step_dummy()).

The formula passed to step_interact() also uses colons to declare interactions, but it has two special differences from base R formulas:

  • you can use dplyr selectors to select the columns to interact with and
  • the resulting interaction columns use _x_ as the default seperator in the names.

For continuous/continuous interactions:

recipe(~ hour + distance, data = interaction_example) %>% 
  step_interact(~ hour:distance) %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  gt()
hour distance hour_x_distance
19.38 2.59 50.19
12.14 2.40 29.14
18.03 2.96 53.36
16.22 3.25 52.71
12.88 2.88 37.09
12.94 2.47 31.97
12.96 3.80 49.26

For categorical/continuous combinations, we use step_dummy() first and then a selector to make the interactions

recipe(~ day + hour, data = interaction_example) %>% 
  step_dummy(all_factor_predictors()) %>% 
  step_interact(~ starts_with("day_"):hour) %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  gt()
hour day_Tue day_Wed day_Thu day_Fri day_Sat day_Sun day_Tue_x_hour day_Wed_x_hour day_Thu_x_hour day_Fri_x_hour day_Sat_x_hour day_Sun_x_hour
19.38 0 0 0 0 0 0 0.00 0.00 0.00 0.00 0.00 0.00
12.14 1 0 0 0 0 0 12.14 0.00 0.00 0.00 0.00 0.00
18.03 0 1 0 0 0 0 0.00 18.03 0.00 0.00 0.00 0.00
16.22 0 0 1 0 0 0 0.00 0.00 16.22 0.00 0.00 0.00
12.88 0 0 0 1 0 0 0.00 0.00 0.00 12.88 0.00 0.00
12.94 0 0 0 0 1 0 0.00 0.00 0.00 0.00 12.94 0.00
12.96 0 0 0 0 0 1 0.00 0.00 0.00 0.00 0.00 12.96

8.2.3 Detecting Interactions

A few different packages can compute \(H\)-statistics (and similar quantities):

and so on. We’ll focus on the first since the other functions are tied to specific models.

For this chapter, we fit an oblique random forest model to compute the \(H\)-statistics. We’ll defer the details of that model fit until a later chapter but the code was:

library(aorsf)

p <- ncol(delivery_train) - 1

set.seed(1)
orf_fit <- orsf(time_to_delivery ~ ., data = delivery_train, mtry = p, n_tree = 50)

Any ML model can be used with the hstats package. We’ll use the validation set to compute the values; this lends more validity to the values (although we could have used the training set). There are a lot of computations here; this may take a bit to compute:

library(hstats)

set.seed(218)
orf_hstats <-
  hstats(orf_fit,
         X = delivery_val %>% dplyr::select(-time_to_delivery),
         # Prioritize the top 10 individual predictors for computing potential
         # pairwise interactions.
         pairwise_m = 10,
         # We can run a little faster by using quantiles to approximate the 
         # predictor distributions. 
         approx = TRUE,
         # How many random data points are used for the computations.
         n_max = 1000,
         verbose = FALSE)

orf_two_way_int_obj <- h2_pairwise(orf_hstats, zero = TRUE)
orf_two_way_int_obj
#> Pairwise H^2 (normalized)
#>                       [,1]
#> hour:day         6.623e-02
#> item_01:item_10  2.606e-02
#> day:distance     1.554e-02
#> hour:distance    4.647e-03
#> item_10:item_24  4.074e-03
#> day:item_10      4.043e-03
#> day:item_24      1.897e-03
#> day:item_01      1.792e-03
#> item_07:item_24  1.187e-03
#> item_01:item_24  1.149e-03
#> distance:item_24 1.022e-03
#> item_01:item_07  1.004e-03
#> item_08:item_10  9.924e-04
#> item_02:item_03  9.040e-04
#> distance:item_10 7.654e-04
#> day:item_02      7.413e-04
#> item_03:item_07  6.627e-04
#> item_01:item_08  6.616e-04
#> day:item_03      6.051e-04
#> day:item_07      5.829e-04
#> item_02:item_24  5.552e-04
#> item_08:item_24  5.536e-04
#> distance:item_02 5.458e-04
#> item_07:item_10  5.246e-04
#> distance:item_03 5.144e-04
#> distance:item_01 4.824e-04
#> distance:item_07 4.761e-04
#> item_03:item_24  4.467e-04
#> item_03:item_10  3.985e-04
#> hour:item_10     3.515e-04
#> hour:item_01     2.623e-04
#> hour:item_24     2.140e-04
#> item_07:item_08  2.086e-04
#> day:item_08      1.894e-04
#> item_02:item_10  1.800e-04
#> distance:item_08 1.757e-04
#> hour:item_08     1.711e-04
#> item_02:item_07  1.532e-04
#> hour:item_02     1.489e-04
#> hour:item_03     1.368e-04
#> item_03:item_08  1.292e-04
#> item_01:item_03  9.561e-05
#> item_01:item_02  8.399e-05
#> item_02:item_08  5.081e-05
#> hour:item_07     3.284e-05

From these results, we could add more terms by adding another layer of step_interact() to our recipe.

8.3 Polynomial Basis Expansions

There are two main functions that produce different types of orthogonal polynomials:

We can pass the columns of interest to each step:

recipe(time_to_delivery ~ ., data = delivery_train) %>% 
  step_poly(hour, degree = 4) %>% 
  prep() %>% 
  bake(new_data = NULL, starts_with("hour"))
#> # A tibble: 6,004 × 4
#>   hour_poly_1 hour_poly_2 hour_poly_3 hour_poly_4
#>         <dbl>       <dbl>       <dbl>       <dbl>
#> 1   -0.02318    0.02240     -0.01758     0.008000
#> 2    0.01558    0.01027     -0.003123   -0.01343 
#> 3    0.01105   -0.0003742   -0.01174    -0.01039 
#> 4   -0.002365  -0.01319      0.004819    0.01182 
#> 5   -0.01761    0.006275     0.009360   -0.01725 
#> 6   -0.02321    0.02249     -0.01777     0.008254
#> # ℹ 5,998 more rows

# or

recipe(time_to_delivery ~ ., data = delivery_train) %>% 
  step_poly_bernstein(hour, degree = 4) %>% 
  prep() %>% 
  bake(new_data = NULL, starts_with("hour"))
#> # A tibble: 6,004 × 4
#>    hour_1  hour_2   hour_3     hour_4
#>     <dbl>   <dbl>    <dbl>      <dbl>
#> 1 0.2593  0.03588 0.002207 0.00005090
#> 2 0.01665 0.1209  0.3899   0.4717    
#> 3 0.05106 0.2201  0.4216   0.3028    
#> 4 0.2658  0.3742  0.2342   0.05495   
#> 5 0.4047  0.1437  0.02267  0.001341  
#> 6 0.2582  0.03549 0.002169 0.00004969
#> # ℹ 5,998 more rows

If we would like different polynomial degrees for different columns, we would call the step multiple times.

8.4 Spline Functions

recipes has several spline steps:

apropos("step_spline")
#> [1] "step_spline_b"           "step_spline_convex"      "step_spline_monotone"   
#> [4] "step_spline_natural"     "step_spline_nonnegative"

The syntax is almost identical to the polynomial steps. For example, for natural splines:

recipe(time_to_delivery ~ ., data = delivery_train) %>% 
  step_spline_natural(hour, deg_free = 4) %>% 
  prep() %>% 
  bake(new_data = NULL, starts_with("hour"))
#> # A tibble: 6,004 × 4
#>   hour_1   hour_2  hour_3  hour_4
#>    <dbl>    <dbl>   <dbl>   <dbl>
#> 1 0.1970 0.004302 0       0      
#> 2 0      0.06326  0.4307  0.3122 
#> 3 0      0.2167   0.4816  0.2611 
#> 4 0.2030 0.6655   0.03618 0.01668
#> 5 0.3939 0.05003  0       0      
#> 6 0.1959 0.004225 0       0      
#> # ℹ 5,998 more rows

Some steps also have arguments for the spline degree called degree. Also, most basis expansion function steps have an option called options to pass additional specifications, such as a vector of knots.

8.5 Discretization

There is one step in recipes that conducts unsupervised binning called step_discretize() and the embed package has two for supervised methods: step_discretize_cart() and step_discretize_xgb().

For unsupervised binning:

basic_binning <- 
  recipe(time_to_delivery ~ ., data = delivery_train) %>% 
  step_discretize(hour, num_breaks = 4) %>% 
  prep()

bake(basic_binning, new_data = NULL, hour)
#> # A tibble: 6,004 × 1
#>   hour 
#>   <fct>
#> 1 bin1 
#> 2 bin4 
#> 3 bin4 
#> 4 bin2 
#> 5 bin1 
#> 6 bin1 
#> # ℹ 5,998 more rows

The hour column has been converted to a factor. There is also an argument called min_unique that defines the “line of dignity” for the binning. If this value is less than the number of data points per bin that would occur for the value of num_breaks, binning is not used.

To understand where the breakpoints reside, the tidy() method will show them:

tidy(basic_binning, number = 1)
#> # A tibble: 5 × 3
#>   terms   value id              
#>   <chr>   <dbl> <chr>           
#> 1 hour  -Inf    discretize_eoInz
#> 2 hour    14.48 discretize_eoInz
#> 3 hour    16.57 discretize_eoInz
#> 4 hour    18.21 discretize_eoInz
#> 5 hour   Inf    discretize_eoInz

# note: 
quantile(delivery_train$hour, probs = (0:4) / 4)
#>    0%   25%   50%   75%  100% 
#> 11.07 14.48 16.57 18.21 20.92

The use of infinite bounds allows data falling outside of the range of the training set to be processed.

If you want to use custom breakpoints, recipes::step_cut() is also available.

For supervised methods, the syntax is different in two ways. First, the outcome column is specified using the outcome argument. Secondly, additional parameters control the complexity of the results (i.e., the number of bins). These depend on the binning models. for CART binning, the arguments are cost_complexity, tree_depth, and min_n. They are different for the step that uses boosted trees.

The format of the results is very similar though:

library(embed)
cart_binning <-
  recipe(time_to_delivery ~ ., data = delivery_train) %>%
  step_discretize_cart(hour,
                       outcome = vars(time_to_delivery),
                       cost_complexity = 0.001) %>%
  prep()

bake(cart_binning, new_data = NULL, hour)
#> # A tibble: 6,004 × 1
#>   hour         
#>   <fct>        
#> 1 [-Inf,12.28) 
#> 2 [18.97,19.68)
#> 3 [16.63,18.97)
#> 4 [15.73,16.24)
#> 5 [12.28,13.23)
#> 6 [-Inf,12.28) 
#> # ℹ 5,998 more rows

tidy(cart_binning, number = 1)
#> # A tibble: 11 × 3
#>   terms value id                   
#>   <chr> <dbl> <chr>                
#> 1 hour  12.28 discretize_cart_ChWJu
#> 2 hour  13.23 discretize_cart_ChWJu
#> 3 hour  13.96 discretize_cart_ChWJu
#> 4 hour  14.77 discretize_cart_ChWJu
#> 5 hour  15.73 discretize_cart_ChWJu
#> 6 hour  16.24 discretize_cart_ChWJu
#> # ℹ 5 more rows

As you can see, the breakpoints are built into the factor levels.