8 Interactions and Nonlinear Features
This chapter is focused on how predictors can enter the model in a nonlinear fashion. The three approaches discussed are interaction effects, basis expansions, and discretization.
8.1 Requirements
You’ll need 5 packages (aorsf, embed, gt, hstats, and tidymodels) for this chapter. You can install them via:
Let’s load the meta package and manage some between-package function conflicts.
library(tidymodels)
library(embed)
tidymodels_prefer()
theme_set(theme_bw())
8.2 Interactions
As the text mentions, interactions involve two or more predictors (of any data type). An interaction means that the relationship between the outcome and the predictors involved cannot be articulated by looking at one predictor at a time; they act in concert.
To get started, let’s once again load the food delivery data and use the same split:
data(deliveries, package = "modeldata")
set.seed(991)
delivery_split <- initial_validation_split(deliveries, prop = c(0.6, 0.2), strata = time_to_delivery)
delivery_train <- training(delivery_split)
delivery_test <- testing(delivery_split)
delivery_val <- validation(delivery_split)
delivery_rs <- validation_set(delivery_split)
## Some functions to faciltate the content
dec_to_time <- function(x) {
mins <- floor(x)
dec <- floor((x - mins) * 60)
res <- cli::pluralize("{mins} minutes{?s} and {dec} second{?s}")
res <- as.character(res)
res <- gsub(" and 0 seconds", "", res)
res <- gsub("1 minutes", "1 minute", res)
res
}
dec_to_time_rs <- function(x) {
collect_metrics(x) %>%
filter(.metric == "mae") %>%
pluck("mean") %>%
dec_to_time
}
We’ll consider two mechanisms to encode interaction columns: via the base R formula method and with recipes.
8.2.1 Interactions with Model Formulas
This was briefly discussed in Section 5.3.
The main operator that creates interactions is the colon. Using a:b
within a formula will create the appropriate columns that encode the interactions. The specifics depend on what type of data are in columns a
and b
:
If both are numeric, an additional column, which is their product, is added to the model matrix. By default, the base R formula method gives it the name
"a:b"
.If one is numeric and the other is categorical, R first converts the categorical (i.e., factor) column into binary indicator variables, then makes columns that are the produce of each indicator column and the original numeric column.
If both are categorical, indicators are made for both and then their corresponding product pairs are created (binary times binary columns).
Let’s make a small example to demonstrate:
library(gt)
interaction_example <-
delivery_test %>%
slice(1, .by = day) %>%
select(day, hour, distance) %>%
arrange(day)
interaction_example %>% gt()
day | hour | distance |
---|---|---|
Mon | 19.38 | 2.59 |
Tue | 12.14 | 2.40 |
Wed | 18.03 | 2.96 |
Thu | 16.22 | 3.25 |
Fri | 12.88 | 2.88 |
Sat | 12.94 | 2.47 |
Sun | 12.96 | 3.80 |
For two numeric predictors:
model.matrix(~ hour + distance + hour:distance, interaction_example) %>%
as_tibble() %>%
select(-`(Intercept)`) %>%
gt()
hour | distance | hour:distance |
---|---|---|
19.38 | 2.59 | 50.19 |
12.14 | 2.40 | 29.14 |
18.03 | 2.96 | 53.36 |
16.22 | 3.25 | 52.71 |
12.88 | 2.88 | 37.09 |
12.94 | 2.47 | 31.97 |
12.96 | 3.80 | 49.26 |
One numeric and one factor predictor:
model.matrix(~ day + distance + day:distance, interaction_example) %>%
as_tibble() %>%
select(-`(Intercept)`) %>%
gt()
dayTue | dayWed | dayThu | dayFri | daySat | daySun | distance | dayTue:distance | dayWed:distance | dayThu:distance | dayFri:distance | daySat:distance | daySun:distance |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 2.59 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
1 | 0 | 0 | 0 | 0 | 0 | 2.40 | 2.4 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 |
0 | 1 | 0 | 0 | 0 | 0 | 2.96 | 0.0 | 2.96 | 0.00 | 0.00 | 0.00 | 0.0 |
0 | 0 | 1 | 0 | 0 | 0 | 3.25 | 0.0 | 0.00 | 3.25 | 0.00 | 0.00 | 0.0 |
0 | 0 | 0 | 1 | 0 | 0 | 2.88 | 0.0 | 0.00 | 0.00 | 2.88 | 0.00 | 0.0 |
0 | 0 | 0 | 0 | 1 | 0 | 2.47 | 0.0 | 0.00 | 0.00 | 0.00 | 2.47 | 0.0 |
0 | 0 | 0 | 0 | 0 | 1 | 3.80 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 3.8 |
If you want to make all possible interactions, you can use the dot and exponent operator:
# All possible two-way interactions:
model.matrix(~ (.)^2, interaction_example)
# All possible two- and three-way interactions, etc:
model.matrix(~ (.)^3, interaction_example)
8.2.2 Interactions from Recipes
The recipes has step_interact()
. This step is very atypical since it uses a formula to specify the inputs (rather than a set of dplyr selectors). It also requires all columns used to be already converted to indicators (perhaps using step_dummy()
).
The formula passed to step_interact()
also uses colons to declare interactions, but it has two special differences from base R formulas:
- you can use dplyr selectors to select the columns to interact with and
- the resulting interaction columns use
_x_
as the default seperator in the names.
For continuous/continuous interactions:
recipe(~ hour + distance, data = interaction_example) %>%
step_interact(~ hour:distance) %>%
prep() %>%
bake(new_data = NULL) %>%
gt()
hour | distance | hour_x_distance |
---|---|---|
19.38 | 2.59 | 50.19 |
12.14 | 2.40 | 29.14 |
18.03 | 2.96 | 53.36 |
16.22 | 3.25 | 52.71 |
12.88 | 2.88 | 37.09 |
12.94 | 2.47 | 31.97 |
12.96 | 3.80 | 49.26 |
For categorical/continuous combinations, we use step_dummy()
first and then a selector to make the interactions
recipe(~ day + hour, data = interaction_example) %>%
step_dummy(all_factor_predictors()) %>%
step_interact(~ starts_with("day_"):hour) %>%
prep() %>%
bake(new_data = NULL) %>%
gt()
hour | day_Tue | day_Wed | day_Thu | day_Fri | day_Sat | day_Sun | day_Tue_x_hour | day_Wed_x_hour | day_Thu_x_hour | day_Fri_x_hour | day_Sat_x_hour | day_Sun_x_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|
19.38 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
12.14 | 1 | 0 | 0 | 0 | 0 | 0 | 12.14 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
18.03 | 0 | 1 | 0 | 0 | 0 | 0 | 0.00 | 18.03 | 0.00 | 0.00 | 0.00 | 0.00 |
16.22 | 0 | 0 | 1 | 0 | 0 | 0 | 0.00 | 0.00 | 16.22 | 0.00 | 0.00 | 0.00 |
12.88 | 0 | 0 | 0 | 1 | 0 | 0 | 0.00 | 0.00 | 0.00 | 12.88 | 0.00 | 0.00 |
12.94 | 0 | 0 | 0 | 0 | 1 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 12.94 | 0.00 |
12.96 | 0 | 0 | 0 | 0 | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 12.96 |
8.2.3 Detecting Interactions
A few different packages can compute \(H\)-statistics (and similar quantities):
- hstats
-
pre, specifically the
interact()
andbsnullinteract()
functions -
bartMachine has
interaction_investigator()
-
aorsf has
orsf_vint()
and so on. We’ll focus on the first since the other functions are tied to specific models.
For this chapter, we fit an oblique random forest model to compute the \(H\)-statistics. We’ll defer the details of that model fit until a later chapter but the code was:
Any ML model can be used with the hstats package. We’ll use the validation set to compute the values; this lends more validity to the values (although we could have used the training set). There are a lot of computations here; this may take a bit to compute:
library(hstats)
set.seed(218)
orf_hstats <-
hstats(orf_fit,
X = delivery_val %>% dplyr::select(-time_to_delivery),
# Prioritize the top 10 individual predictors for computing potential
# pairwise interactions.
pairwise_m = 10,
# We can run a little faster by using quantiles to approximate the
# predictor distributions.
approx = TRUE,
# How many random data points are used for the computations.
n_max = 1000,
verbose = FALSE)
orf_two_way_int_obj <- h2_pairwise(orf_hstats, zero = TRUE)
orf_two_way_int_obj
#> Pairwise H^2 (normalized)
#> [,1]
#> hour:day 6.623e-02
#> item_01:item_10 2.606e-02
#> day:distance 1.554e-02
#> hour:distance 4.647e-03
#> item_10:item_24 4.074e-03
#> day:item_10 4.043e-03
#> day:item_24 1.897e-03
#> day:item_01 1.792e-03
#> item_07:item_24 1.187e-03
#> item_01:item_24 1.149e-03
#> distance:item_24 1.022e-03
#> item_01:item_07 1.004e-03
#> item_08:item_10 9.924e-04
#> item_02:item_03 9.040e-04
#> distance:item_10 7.654e-04
#> day:item_02 7.413e-04
#> item_03:item_07 6.627e-04
#> item_01:item_08 6.616e-04
#> day:item_03 6.051e-04
#> day:item_07 5.829e-04
#> item_02:item_24 5.552e-04
#> item_08:item_24 5.536e-04
#> distance:item_02 5.458e-04
#> item_07:item_10 5.246e-04
#> distance:item_03 5.144e-04
#> distance:item_01 4.824e-04
#> distance:item_07 4.761e-04
#> item_03:item_24 4.467e-04
#> item_03:item_10 3.985e-04
#> hour:item_10 3.515e-04
#> hour:item_01 2.623e-04
#> hour:item_24 2.140e-04
#> item_07:item_08 2.086e-04
#> day:item_08 1.894e-04
#> item_02:item_10 1.800e-04
#> distance:item_08 1.757e-04
#> hour:item_08 1.711e-04
#> item_02:item_07 1.532e-04
#> hour:item_02 1.489e-04
#> hour:item_03 1.368e-04
#> item_03:item_08 1.292e-04
#> item_01:item_03 9.561e-05
#> item_01:item_02 8.399e-05
#> item_02:item_08 5.081e-05
#> hour:item_07 3.284e-05
From these results, we could add more terms by adding another layer of step_interact()
to our recipe.
8.3 Polynomial Basis Expansions
There are two main functions that produce different types of orthogonal polynomials:
-
step_poly()
wrappingstats::poly()
-
step_poly_bernstein()
wrappingsplines2::bernsteinPoly()
.
We can pass the columns of interest to each step:
recipe(time_to_delivery ~ ., data = delivery_train) %>%
step_poly(hour, degree = 4) %>%
prep() %>%
bake(new_data = NULL, starts_with("hour"))
#> # A tibble: 6,004 × 4
#> hour_poly_1 hour_poly_2 hour_poly_3 hour_poly_4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.02318 0.02240 -0.01758 0.008000
#> 2 0.01558 0.01027 -0.003123 -0.01343
#> 3 0.01105 -0.0003742 -0.01174 -0.01039
#> 4 -0.002365 -0.01319 0.004819 0.01182
#> 5 -0.01761 0.006275 0.009360 -0.01725
#> 6 -0.02321 0.02249 -0.01777 0.008254
#> # ℹ 5,998 more rows
# or
recipe(time_to_delivery ~ ., data = delivery_train) %>%
step_poly_bernstein(hour, degree = 4) %>%
prep() %>%
bake(new_data = NULL, starts_with("hour"))
#> # A tibble: 6,004 × 4
#> hour_1 hour_2 hour_3 hour_4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.2593 0.03588 0.002207 0.00005090
#> 2 0.01665 0.1209 0.3899 0.4717
#> 3 0.05106 0.2201 0.4216 0.3028
#> 4 0.2658 0.3742 0.2342 0.05495
#> 5 0.4047 0.1437 0.02267 0.001341
#> 6 0.2582 0.03549 0.002169 0.00004969
#> # ℹ 5,998 more rows
If we would like different polynomial degrees for different columns, we would call the step multiple times.
8.4 Spline Functions
recipes has several spline steps:
apropos("step_spline")
#> [1] "step_spline_b" "step_spline_convex" "step_spline_monotone"
#> [4] "step_spline_natural" "step_spline_nonnegative"
The syntax is almost identical to the polynomial steps. For example, for natural splines:
recipe(time_to_delivery ~ ., data = delivery_train) %>%
step_spline_natural(hour, deg_free = 4) %>%
prep() %>%
bake(new_data = NULL, starts_with("hour"))
#> # A tibble: 6,004 × 4
#> hour_1 hour_2 hour_3 hour_4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.1970 0.004302 0 0
#> 2 0 0.06326 0.4307 0.3122
#> 3 0 0.2167 0.4816 0.2611
#> 4 0.2030 0.6655 0.03618 0.01668
#> 5 0.3939 0.05003 0 0
#> 6 0.1959 0.004225 0 0
#> # ℹ 5,998 more rows
Some steps also have arguments for the spline degree called degree
. Also, most basis expansion function steps have an option called options
to pass additional specifications, such as a vector of knots.
8.5 Discretization
There is one step in recipes that conducts unsupervised binning called step_discretize()
and the embed package has two for supervised methods: step_discretize_cart()
and step_discretize_xgb()
.
For unsupervised binning:
The hour
column has been converted to a factor. There is also an argument called min_unique
that defines the “line of dignity” for the binning. If this value is less than the number of data points per bin that would occur for the value of num_breaks
, binning is not used.
To understand where the breakpoints reside, the tidy()
method will show them:
tidy(basic_binning, number = 1)
#> # A tibble: 5 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 hour -Inf discretize_eoInz
#> 2 hour 14.48 discretize_eoInz
#> 3 hour 16.57 discretize_eoInz
#> 4 hour 18.21 discretize_eoInz
#> 5 hour Inf discretize_eoInz
# note:
quantile(delivery_train$hour, probs = (0:4) / 4)
#> 0% 25% 50% 75% 100%
#> 11.07 14.48 16.57 18.21 20.92
The use of infinite bounds allows data falling outside of the range of the training set to be processed.
If you want to use custom breakpoints, recipes::step_cut()
is also available.
For supervised methods, the syntax is different in two ways. First, the outcome column is specified using the outcome
argument. Secondly, additional parameters control the complexity of the results (i.e., the number of bins). These depend on the binning models. for CART binning, the arguments are cost_complexity
, tree_depth
, and min_n
. They are different for the step that uses boosted trees.
The format of the results is very similar though:
library(embed)
cart_binning <-
recipe(time_to_delivery ~ ., data = delivery_train) %>%
step_discretize_cart(hour,
outcome = vars(time_to_delivery),
cost_complexity = 0.001) %>%
prep()
bake(cart_binning, new_data = NULL, hour)
#> # A tibble: 6,004 × 1
#> hour
#> <fct>
#> 1 [-Inf,12.28)
#> 2 [18.97,19.68)
#> 3 [16.63,18.97)
#> 4 [15.73,16.24)
#> 5 [12.28,13.23)
#> 6 [-Inf,12.28)
#> # ℹ 5,998 more rows
tidy(cart_binning, number = 1)
#> # A tibble: 11 × 3
#> terms value id
#> <chr> <dbl> <chr>
#> 1 hour 12.28 discretize_cart_ChWJu
#> 2 hour 13.23 discretize_cart_ChWJu
#> 3 hour 13.96 discretize_cart_ChWJu
#> 4 hour 14.77 discretize_cart_ChWJu
#> 5 hour 15.73 discretize_cart_ChWJu
#> 6 hour 16.24 discretize_cart_ChWJu
#> # ℹ 5 more rows
As you can see, the breakpoints are built into the factor levels.