req_pkg <- c("embed", "text2vec", "textrecipes", "tidymodels", "rpart")
# Check to see if they are installed:
pkg_installed <- vapply(req_pkg, rlang::is_installed, logical(1))
# Install missing packages:
if ( any(!pkg_installed) ) {
install_list <- names(pkg_installed)[!pkg_installed]
pak::pak(install_list)
}
5 Working with Categorical Predictors
The book’s Working with Categorical Predictors chapter focuses on various ways to convert qualitative predictors into better formats for modeling.
Like the previous chapter, we do most calculations with the recipes package.
5.1 Requirements
You’ll need 5 packages (embed, rpart, text2vec, textrecipes, and tidymodels) for this chapter. You can install them via:
Let’s load the meta package and manage some between-package function conflicts.
library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())
5.2 The Hotel Rate Data
The hotel rate data are used for most examples in the chapter. The original version is in the modeldata package. We’ll split the data in the following way:
initial_time_split()
will reserve that most recent 25% for the test set. It assumes that the data are arranged in time and takes that last part of the data for testing (assumed to be the most recent).
5.3 Simple Indicator Variables
We’ll look at two primary methods for creating indicator (a.k.a. “dummy”) variables.
5.3.1 Base R formulas
Base R’s formula method, discussed previously in Section 4.3, automatically creates indicators when the formula includes a factor predictor. For example:
customer_types <-
hotel_rate_train %>%
distinct(customer_type) %>%
arrange(customer_type)
customer_types %>%
model.matrix( ~ customer_type, data = .) %>%
as_tibble() %>%
select(-`(Intercept)`)
#> # A tibble: 4 × 3
#> customer_typegroup customer_typetransient customer_typetransient_party
#> <dbl> <dbl> <dbl>
#> 1 0 0 0
#> 2 1 0 0
#> 3 0 1 0
#> 4 0 0 1
Note that the column name and the factor levels are directly concatenated.
model.matrix()
is part of the larger base R preprocessing framework and always returns a matrix (by default, with an intercept column). There is also model.frame()
. This returns a data frame without creating indicator columns or interactions. It does execute any in-line functions and only returns the columns involved in the formula.
When the factor has missing values, the default behavior is to remove the offending row:
lvls <- levels(hotel_rate_train$customer_type)
with_missing <-
customer_types %>%
bind_rows(tibble(customer_type = factor(NA, levels = lvls)))
with_missing
#> # A tibble: 5 × 1
#> customer_type
#> <fct>
#> 1 contract
#> 2 group
#> 3 transient
#> 4 transient_party
#> 5 <NA>
model.matrix( ~ customer_type, data = with_missing) %>%
as_tibble() %>%
select(-`(Intercept)`)
#> # A tibble: 4 × 3
#> customer_typegroup customer_typetransient customer_typetransient_party
#> <dbl> <dbl> <dbl>
#> 1 0 0 0
#> 2 1 0 0
#> 3 0 1 0
#> 4 0 0 1
A family of functions can be used to dictate what should be done when missing values occur. The global R option is
options()$na.action
#> [1] "na.omit"
To keep the number of rows intact, you can set the global option to be na.pass
:
orig_options <- options()
options(na.action = 'na.pass')
model.matrix( ~ customer_type, data = with_missing) %>%
as_tibble() %>%
select(-`(Intercept)`)
#> # A tibble: 5 × 3
#> customer_typegroup customer_typetransient customer_typetransient_party
#> <dbl> <dbl> <dbl>
#> 1 0 0 0
#> 2 1 0 0
#> 3 0 1 0
#> 4 0 0 1
#> 5 NA NA NA
# Now reset to original settings:
options(orig_options)
options()$na.action
#> [1] "na.omit"
In R, the word “contrast” refers to the algorithm used to create different types of indicators 1. Global options control this (for the most part) and the defaults are:
Many packages also have additional contrast functions.
5.3.2 Recipes
We can also use a recipe to do this (with more functionality):
ind_rec <-
recipe( ~ customer_type, data = customer_types) %>%
step_dummy(all_factor_predictors()) %>%
prep()
bake(ind_rec, customer_types, starts_with("customer_type"))
#> # A tibble: 4 × 3
#> customer_type_group customer_type_transient customer_type_transient_party
#> <dbl> <dbl> <dbl>
#> 1 0 0 0
#> 2 1 0 0
#> 3 0 1 0
#> 4 0 0 1
bake(ind_rec, with_missing, starts_with("customer_type"))
#> Warning: ! There are new levels in a factor: `NA`.
#> # A tibble: 5 × 3
#> customer_type_group customer_type_transient customer_type_transient_party
#> <dbl> <dbl> <dbl>
#> 1 0 0 0
#> 2 1 0 0
#> 3 0 1 0
#> 4 0 0 1
#> 5 NA NA NA
There is no need to set the global option for na.action
.
Also, the naming of features is more rational, with names and levels separated by an underscore. There is also an argument to step_dummy()
that controls the naming of new features.
There is also an option to produce one-hot encodings called… one_hot
.
5.4 Novel Categories
When we think the recipe or model will encounter new values of a factor predictor, we can use step_novel()
to add a new factor level:
recipe(avg_price_per_room ~ customer_type, data = hotel_rate_train) %>%
step_novel(customer_type) %>%
prep() %>%
bake(new_data = NULL) %>%
pluck("customer_type") %>%
levels()
#> [1] "contract" "group" "transient" "transient_party"
#> [5] "new"
For the training set, this new level will never have any data associated with it.
5.5 “Othering”
We can also determine infrequently occurring categories (in the training set) and re-level the factor by converting them to an “other” category. If we chose a frequency of 0.01% as a cutoff, we have far fewer levels:
If you are interested in which levels are combined, the tidy()
method on the recipe can tell you.
5.6 Feature Hashing
Feature hashing converts qualitative predictors to a set of binary indicators based solely on the value of their category. It is most useful when many categories and/or novel levels might be encountered.
The recipe step is in the textrecipes package, so we must load it first2. The main arguments are num_terms
and signed
. The first sets the number of features to create, and when signed = TRUE
, the indicators will have an appropriate sign attached to them (i.e., their values could be -1/0/1).
library(textrecipes)
recipe(avg_price_per_room ~ agent, data = hotel_rate_train) %>%
step_dummy_hash(agent, num_terms = 4) %>%
prep() %>%
bake(new_data = NULL, contains("agent"))
#> # A tibble: 11,551 × 4
#> dummyhash_agent_1 dummyhash_agent_2 dummyhash_agent_3 dummyhash_agent_4
#> <int> <int> <int> <int>
#> 1 0 1 0 0
#> 2 0 0 -1 0
#> 3 0 1 0 0
#> 4 0 1 0 0
#> 5 0 1 0 0
#> 6 0 0 0 1
#> # ℹ 11,545 more rows
5.7 Effect Encodings
Effect embedding is a supervised method to convert qualitative predictors to a numeric column that contains the effect of the category on the outcome. The outcome is a numeric value (the ADR) for these data. Using effect endings here will produce a column with specialized estimates of each category’s mean ADR. Let’s look at the agent predictor again.
The embed package has a few recipe steps to do this. This method is sometimes called “likelihood encoding” and the recipe steps all start with step_lencode_*
:
-
step_lencode_glm()
produces basic, naive estimates of effect. This is a “no pooling” estimate. -
step_lencode_mixed()
uses a non-Bayesian hierarchical model to produce regularized effect estimates. -
step_lencode_bayes()
uses a Bayesian model that is more flexible than its non-Bayesian sibling but can take much longer to fit.
We’ll use the “mixed” function3, For agent:
library(embed)
encoded_agent_rec <-
recipe(avg_price_per_room ~ agent, data = hotel_rate_train) %>%
step_lencode_mixed(agent, outcome = vars(avg_price_per_room), id = "effect") %>%
prep()
encoded_agent_rec
#>
#> ── Recipe ───────────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 1
#>
#> ── Training information
#> Training data contained 11551 data points and no incomplete rows.
#>
#> ── Operations
#> • Linear embedding for factors via mixed effects for: agent | Trained
To see the actual effect estimates, use the tidy()
method:
tidy(encoded_agent_rec, id = "effect")
#> # A tibble: 123 × 4
#> level value terms id
#> <chr> <dbl> <chr> <chr>
#> 1 aaron_marquez 92.22 agent effect
#> 2 alexander_drake 120.1 agent effect
#> 3 allen_her 73.70 agent effect
#> 4 anas_el_bashir 80.41 agent effect
#> 5 araseli_billy 66.91 agent effect
#> 6 arhab_al_islam 51.15 agent effect
#> # ℹ 117 more rows
# The estimate for new agents:
tidy(encoded_agent_rec, id = "effect") %>%
slice_tail(n = 1)
#> # A tibble: 1 × 4
#> level value terms id
#> <chr> <dbl> <chr> <chr>
#> 1 ..new 75.36 agent effect
When the recipe is applied to new data, the agent
column is converted to a numeric column with these values:
bake(encoded_agent_rec, hotel_rate_test) %>%
bind_cols(hotel_rate_test %>% select(original_col = agent))
#> # A tibble: 3,851 × 3
#> agent avg_price_per_room original_col
#> <dbl> <dbl> <fct>
#> 1 106.9 100 devin_rivera_borrego
#> 2 106.9 60 devin_rivera_borrego
#> 3 76.38 45 not_applicable
#> 4 106.9 82 devin_rivera_borrego
#> 5 76.38 47 not_applicable
#> 6 76.38 60 not_applicable
#> # ℹ 3,845 more rows
For categorical outcomes, the effect estimate is the log-odds of an event (the first factor level).
5.8 Supervised Combining of Categories
To collapse a large number of factor levels to a smaller set using a supervised model, we can use step_collapse_cart()
in the embed package.
For example:
# Also needs the embed package loaded (and rpart installed)
library(embed)
collapse_agent_rec <-
recipe(avg_price_per_room ~ agent, data = hotel_rate_train) %>%
step_collapse_cart(agent, outcome = vars(avg_price_per_room), id = "collapse") %>%
prep()
collapse_agent_rec
#>
#> ── Recipe ───────────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 1
#>
#> ── Training information
#> Training data contained 11551 data points and no incomplete rows.
#>
#> ── Operations
#> • Collapsing factor levels using CART: agent | Trained
The step converts 122 unique values of agent
in the training set to a smaller set of 12 categories. To see the conversion key, use the tidy()
methods:
tidy(collapse_agent_rec, id = "collapse")
#> # A tibble: 122 × 4
#> terms old new id
#> <chr> <chr> <chr> <chr>
#> 1 agent araseli_billy agent_01 collapse
#> 2 agent arhab_al_islam agent_01 collapse
#> 3 agent brayan_guerrero agent_01 collapse
#> 4 agent daifallah_el_sami agent_01 collapse
#> 5 agent dante_merritt agent_01 collapse
#> 6 agent derrick_barger agent_01 collapse
#> # ℹ 116 more rows
tidy(collapse_agent_rec, id = "collapse") %>%
count(new)
#> # A tibble: 12 × 2
#> new n
#> <chr> <int>
#> 1 agent_01 17
#> 2 agent_02 14
#> 3 agent_03 14
#> 4 agent_04 10
#> 5 agent_05 7
#> 6 agent_06 13
#> # ℹ 6 more rows
There are two main tuning parameters (described later in section TODO):
- cost complexity (a.k.a. \(C_p\)): smaller values result in more groups. Values typically range between zero and 0.1.
- minimum n: the minimum number of rows in a group to enable it to keep splitting. Smaller values should result in more groupings.
These values can be tuned.
5.9 Working with Ordinal Predictors
As reported in the section on ordinal data, the default for R is to encode ordinal values with \(p\) values is to create a set of \(p - 1\) orthogonal polynomial features. That is what step_dummy()
does by default.
quality_vals <- c('excellent', 'fair', 'good', 'typical', 'poor')
quality <- tibble(quality = ordered(quality_vals, levels = quality_vals))
str(quality)
#> tibble [5 × 1] (S3: tbl_df/tbl/data.frame)
#> $ quality: Ord.factor w/ 5 levels "excellent"<"fair"<..: 1 2 3 4 5
recipe(~ quality, data = quality) %>%
step_dummy(quality) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 5 × 4
#> quality_1 quality_2 quality_3 quality_4
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -6.325e- 1 0.5345 -3.162e- 1 0.1195
#> 2 -3.162e- 1 -0.2673 6.325e- 1 -0.4781
#> 3 -3.288e-17 -0.5345 9.637e-17 0.7171
#> 4 3.162e- 1 -0.2673 -6.325e- 1 -0.4781
#> 5 6.325e- 1 0.5345 3.162e- 1 0.1195
We can convert the ordered factor to an unordered factor:
recipe(~ quality, data = quality) %>%
step_unorder(quality) %>%
step_dummy(quality) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 5 × 4
#> quality_fair quality_good quality_typical quality_poor
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 0 0
#> 2 1 0 0 0
#> 3 0 1 0 0
#> 4 0 0 1 0
#> 5 0 0 0 1
Another strategy is mapping the ordinal factor levels to a set of numeric values that make sense for their modeling problem. step_ordinalscore()
can do that with a user-supplied conversion function:
convert_to_prime <- function(x) {
primes <- c(2, 3, 7, 11, 13)
primes[as.numeric(x)]
}
recipe(~ quality, data = quality) %>%
step_ordinalscore(quality, convert = convert_to_prime) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 5 × 1
#> quality
#> <int>
#> 1 2
#> 2 3
#> 3 7
#> 4 11
#> 5 13
step_integer()
does the same for either type of factor but converts them to consecutive one-based integers.
5.10 Other relevant recipe steps
There are a variety of other steps that can be used with qualitative predictors (a list of relevant recipe steps in recipes)
In statistics, “contrasts” typically indicates a combination of parameters. For example, if you wanted to test that the average of two parameters was equal to a third, the contrast would be \(\beta_1 + \beta_2 - 2\beta_3\) and the contrast coefficients would be
c(1, 1, -2)
.↩︎It also requires another package (text2vec) to be installed but not loaded.↩︎
For this function, we all need the lme4 package to be installed.↩︎