5  Working with Categorical Predictors

The book’s Working with Categorical Predictors chapter focuses on various ways to convert qualitative predictors into better formats for modeling.

Like the previous chapter, we do most calculations with the recipes package.

5.1 Requirements

You’ll need 5 packages (embed, rpart, text2vec, textrecipes, and tidymodels) for this chapter. You can install them via:

req_pkg <- c("embed", "text2vec", "textrecipes", "tidymodels", "rpart")

# Check to see if they are installed: 
pkg_installed <- vapply(req_pkg, rlang::is_installed, logical(1))

# Install missing packages: 
if ( any(!pkg_installed) ) {
  install_list <- names(pkg_installed)[!pkg_installed]
  pak::pak(install_list)
}

Let’s load the meta package and manage some between-package function conflicts.

library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())

5.2 The Hotel Rate Data

The hotel rate data are used for most examples in the chapter. The original version is in the modeldata package. We’ll split the data in the following way:

data(hotel_rates, package = "modeldata")

# Make the initial split
hotel_rates <- hotel_rates %>% arrange(arrival_date)
hotel_rate_split <- initial_time_split(hotel_rates, prop = c(0.75))
hotel_rate_train <- training(hotel_rate_split)
hotel_rate_test  <- testing(hotel_rate_split)

initial_time_split() will reserve that most recent 25% for the test set. It assumes that the data are arranged in time and takes that last part of the data for testing (assumed to be the most recent).

5.3 Simple Indicator Variables

We’ll look at two primary methods for creating indicator (a.k.a. “dummy”) variables.

5.3.1 Base R formulas

Base R’s formula method, discussed previously in Section 4.3, automatically creates indicators when the formula includes a factor predictor. For example:

customer_types <- 
  hotel_rate_train %>% 
  distinct(customer_type) %>% 
  arrange(customer_type)

customer_types %>% 
  model.matrix( ~ customer_type, data = .) %>% 
  as_tibble() %>% 
  select(-`(Intercept)`)
#> # A tibble: 4 × 3
#>   customer_typegroup customer_typetransient customer_typetransient_party
#>                <dbl>                  <dbl>                        <dbl>
#> 1                  0                      0                            0
#> 2                  1                      0                            0
#> 3                  0                      1                            0
#> 4                  0                      0                            1

Note that the column name and the factor levels are directly concatenated.

model.matrix() is part of the larger base R preprocessing framework and always returns a matrix (by default, with an intercept column). There is also model.frame(). This returns a data frame without creating indicator columns or interactions. It does execute any in-line functions and only returns the columns involved in the formula.

When the factor has missing values, the default behavior is to remove the offending row:

lvls <- levels(hotel_rate_train$customer_type)

with_missing <- 
  customer_types %>% 
  bind_rows(tibble(customer_type = factor(NA, levels = lvls)))

with_missing
#> # A tibble: 5 × 1
#>   customer_type  
#>   <fct>          
#> 1 contract       
#> 2 group          
#> 3 transient      
#> 4 transient_party
#> 5 <NA>

model.matrix( ~ customer_type, data = with_missing) %>% 
  as_tibble() %>% 
  select(-`(Intercept)`)
#> # A tibble: 4 × 3
#>   customer_typegroup customer_typetransient customer_typetransient_party
#>                <dbl>                  <dbl>                        <dbl>
#> 1                  0                      0                            0
#> 2                  1                      0                            0
#> 3                  0                      1                            0
#> 4                  0                      0                            1

A family of functions can be used to dictate what should be done when missing values occur. The global R option is

options()$na.action
#> [1] "na.omit"

To keep the number of rows intact, you can set the global option to be na.pass:

orig_options <- options()
options(na.action = 'na.pass')

model.matrix( ~ customer_type, data = with_missing) %>% 
  as_tibble() %>% 
  select(-`(Intercept)`)
#> # A tibble: 5 × 3
#>   customer_typegroup customer_typetransient customer_typetransient_party
#>                <dbl>                  <dbl>                        <dbl>
#> 1                  0                      0                            0
#> 2                  1                      0                            0
#> 3                  0                      1                            0
#> 4                  0                      0                            1
#> 5                 NA                     NA                           NA

# Now reset to original settings:
options(orig_options) 
options()$na.action
#> [1] "na.omit"

In R, the word “contrast” refers to the algorithm used to create different types of indicators 1. Global options control this (for the most part) and the defaults are:

options()$contrast
#>         unordered           ordered 
#> "contr.treatment"      "contr.poly"

# with possible options: 
apropos("contr\\.")
#> [1] "contr.helmert"   "contr.poly"      "contr.SAS"       "contr.sum"      
#> [5] "contr.treatment"

Many packages also have additional contrast functions.

5.3.2 Recipes

We can also use a recipe to do this (with more functionality):

ind_rec <- 
  recipe( ~ customer_type, data = customer_types) %>% 
  step_dummy(all_factor_predictors()) %>% 
  prep()

bake(ind_rec, customer_types, starts_with("customer_type"))
#> # A tibble: 4 × 3
#>   customer_type_group customer_type_transient customer_type_transient_party
#>                 <dbl>                   <dbl>                         <dbl>
#> 1                   0                       0                             0
#> 2                   1                       0                             0
#> 3                   0                       1                             0
#> 4                   0                       0                             1
bake(ind_rec, with_missing, starts_with("customer_type"))
#> Warning: ! There are new levels in a factor: `NA`.
#> # A tibble: 5 × 3
#>   customer_type_group customer_type_transient customer_type_transient_party
#>                 <dbl>                   <dbl>                         <dbl>
#> 1                   0                       0                             0
#> 2                   1                       0                             0
#> 3                   0                       1                             0
#> 4                   0                       0                             1
#> 5                  NA                      NA                            NA

There is no need to set the global option for na.action.

Also, the naming of features is more rational, with names and levels separated by an underscore. There is also an argument to step_dummy() that controls the naming of new features.

There is also an option to produce one-hot encodings called… one_hot.

5.4 Novel Categories

When we think the recipe or model will encounter new values of a factor predictor, we can use step_novel() to add a new factor level:

recipe(avg_price_per_room ~ customer_type, data = hotel_rate_train) %>% 
  step_novel(customer_type) %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  pluck("customer_type") %>% 
  levels()
#> [1] "contract"        "group"           "transient"       "transient_party"
#> [5] "new"

For the training set, this new level will never have any data associated with it.

5.5 “Othering”

We can also determine infrequently occurring categories (in the training set) and re-level the factor by converting them to an “other” category. If we chose a frequency of 0.01% as a cutoff, we have far fewer levels:

length(levels(hotel_rate_train$agent))
#> [1] 174

recipe(avg_price_per_room ~ agent, data = hotel_rate_train) %>% 
  step_other(agent, threshold = 0.0001) %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  pluck("agent") %>% 
  levels() %>% 
  length()
#> [1] 101

If you are interested in which levels are combined, the tidy() method on the recipe can tell you.

5.6 Feature Hashing

Feature hashing converts qualitative predictors to a set of binary indicators based solely on the value of their category. It is most useful when many categories and/or novel levels might be encountered.

The recipe step is in the textrecipes package, so we must load it first2. The main arguments are num_terms and signed. The first sets the number of features to create, and when signed = TRUE, the indicators will have an appropriate sign attached to them (i.e., their values could be -1/0/1).

library(textrecipes)

recipe(avg_price_per_room ~ agent, data = hotel_rate_train) %>% 
  step_dummy_hash(agent, num_terms = 4) %>% 
  prep() %>% 
  bake(new_data = NULL, contains("agent")) 
#> # A tibble: 11,551 × 4
#>   dummyhash_agent_1 dummyhash_agent_2 dummyhash_agent_3 dummyhash_agent_4
#>               <int>             <int>             <int>             <int>
#> 1                 0                 1                 0                 0
#> 2                 0                 0                -1                 0
#> 3                 0                 1                 0                 0
#> 4                 0                 1                 0                 0
#> 5                 0                 1                 0                 0
#> 6                 0                 0                 0                 1
#> # ℹ 11,545 more rows

5.7 Effect Encodings

Effect embedding is a supervised method to convert qualitative predictors to a numeric column that contains the effect of the category on the outcome. The outcome is a numeric value (the ADR) for these data. Using effect endings here will produce a column with specialized estimates of each category’s mean ADR. Let’s look at the agent predictor again.

The embed package has a few recipe steps to do this. This method is sometimes called “likelihood encoding” and the recipe steps all start with step_lencode_*:

  • step_lencode_glm() produces basic, naive estimates of effect. This is a “no pooling” estimate.
  • step_lencode_mixed() uses a non-Bayesian hierarchical model to produce regularized effect estimates.
  • step_lencode_bayes() uses a Bayesian model that is more flexible than its non-Bayesian sibling but can take much longer to fit.

We’ll use the “mixed” function3, For agent:

library(embed)

encoded_agent_rec <- 
  recipe(avg_price_per_room ~ agent, data = hotel_rate_train) %>% 
  step_lencode_mixed(agent, outcome = vars(avg_price_per_room), id = "effect") %>% 
  prep() 
encoded_agent_rec
#> 
#> ── Recipe ───────────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 1
#> 
#> ── Training information
#> Training data contained 11551 data points and no incomplete rows.
#> 
#> ── Operations
#> • Linear embedding for factors via mixed effects for: agent | Trained

To see the actual effect estimates, use the tidy() method:

tidy(encoded_agent_rec, id = "effect")
#> # A tibble: 123 × 4
#>   level            value terms id    
#>   <chr>            <dbl> <chr> <chr> 
#> 1 aaron_marquez    92.22 agent effect
#> 2 alexander_drake 120.1  agent effect
#> 3 allen_her        73.70 agent effect
#> 4 anas_el_bashir   80.41 agent effect
#> 5 araseli_billy    66.91 agent effect
#> 6 arhab_al_islam   51.15 agent effect
#> # ℹ 117 more rows

# The estimate for new agents:
tidy(encoded_agent_rec, id = "effect") %>% 
  slice_tail(n = 1)
#> # A tibble: 1 × 4
#>   level value terms id    
#>   <chr> <dbl> <chr> <chr> 
#> 1 ..new 75.36 agent effect

When the recipe is applied to new data, the agent column is converted to a numeric column with these values:

bake(encoded_agent_rec, hotel_rate_test) %>% 
  bind_cols(hotel_rate_test %>% select(original_col = agent))
#> # A tibble: 3,851 × 3
#>    agent avg_price_per_room original_col        
#>    <dbl>              <dbl> <fct>               
#> 1 106.9                 100 devin_rivera_borrego
#> 2 106.9                  60 devin_rivera_borrego
#> 3  76.38                 45 not_applicable      
#> 4 106.9                  82 devin_rivera_borrego
#> 5  76.38                 47 not_applicable      
#> 6  76.38                 60 not_applicable      
#> # ℹ 3,845 more rows

For categorical outcomes, the effect estimate is the log-odds of an event (the first factor level).

5.8 Supervised Combining of Categories

To collapse a large number of factor levels to a smaller set using a supervised model, we can use step_collapse_cart() in the embed package.

For example:

# Also needs the embed package loaded (and rpart installed)
library(embed)

collapse_agent_rec <- 
  recipe(avg_price_per_room ~ agent, data = hotel_rate_train) %>% 
  step_collapse_cart(agent, outcome = vars(avg_price_per_room), id = "collapse") %>% 
  prep() 
collapse_agent_rec
#> 
#> ── Recipe ───────────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 1
#> 
#> ── Training information
#> Training data contained 11551 data points and no incomplete rows.
#> 
#> ── Operations
#> • Collapsing factor levels using CART: agent | Trained

The step converts 122 unique values of agent in the training set to a smaller set of 12 categories. To see the conversion key, use the tidy() methods:

tidy(collapse_agent_rec, id = "collapse")
#> # A tibble: 122 × 4
#>   terms old               new      id      
#>   <chr> <chr>             <chr>    <chr>   
#> 1 agent araseli_billy     agent_01 collapse
#> 2 agent arhab_al_islam    agent_01 collapse
#> 3 agent brayan_guerrero   agent_01 collapse
#> 4 agent daifallah_el_sami agent_01 collapse
#> 5 agent dante_merritt     agent_01 collapse
#> 6 agent derrick_barger    agent_01 collapse
#> # ℹ 116 more rows

tidy(collapse_agent_rec, id = "collapse") %>% 
  count(new)
#> # A tibble: 12 × 2
#>   new          n
#>   <chr>    <int>
#> 1 agent_01    17
#> 2 agent_02    14
#> 3 agent_03    14
#> 4 agent_04    10
#> 5 agent_05     7
#> 6 agent_06    13
#> # ℹ 6 more rows

There are two main tuning parameters (described later in section TODO):

  • cost complexity (a.k.a. \(C_p\)): smaller values result in more groups. Values typically range between zero and 0.1.
  • minimum n: the minimum number of rows in a group to enable it to keep splitting. Smaller values should result in more groupings.

These values can be tuned.

5.9 Working with Ordinal Predictors

As reported in the section on ordinal data, the default for R is to encode ordinal values with \(p\) values is to create a set of \(p - 1\) orthogonal polynomial features. That is what step_dummy() does by default.

quality_vals <- c('excellent', 'fair', 'good', 'typical', 'poor')
quality <- tibble(quality = ordered(quality_vals, levels = quality_vals))
str(quality)
#> tibble [5 × 1] (S3: tbl_df/tbl/data.frame)
#>  $ quality: Ord.factor w/ 5 levels "excellent"<"fair"<..: 1 2 3 4 5

recipe(~ quality, data = quality) %>% 
  step_dummy(quality) %>% 
  prep() %>% 
  bake(new_data = NULL)
#> # A tibble: 5 × 4
#>    quality_1 quality_2  quality_3 quality_4
#>        <dbl>     <dbl>      <dbl>     <dbl>
#> 1 -6.325e- 1    0.5345 -3.162e- 1    0.1195
#> 2 -3.162e- 1   -0.2673  6.325e- 1   -0.4781
#> 3 -3.288e-17   -0.5345  9.637e-17    0.7171
#> 4  3.162e- 1   -0.2673 -6.325e- 1   -0.4781
#> 5  6.325e- 1    0.5345  3.162e- 1    0.1195

We can convert the ordered factor to an unordered factor:

recipe(~ quality, data = quality) %>% 
  step_unorder(quality) %>% 
  step_dummy(quality) %>% 
  prep() %>% 
  bake(new_data = NULL)
#> # A tibble: 5 × 4
#>   quality_fair quality_good quality_typical quality_poor
#>          <dbl>        <dbl>           <dbl>        <dbl>
#> 1            0            0               0            0
#> 2            1            0               0            0
#> 3            0            1               0            0
#> 4            0            0               1            0
#> 5            0            0               0            1

Another strategy is mapping the ordinal factor levels to a set of numeric values that make sense for their modeling problem. step_ordinalscore() can do that with a user-supplied conversion function:

convert_to_prime <-  function(x) {
  primes <- c(2, 3, 7, 11, 13)
  primes[as.numeric(x)]
}

recipe(~ quality, data = quality) %>% 
  step_ordinalscore(quality, convert = convert_to_prime) %>% 
  prep() %>% 
  bake(new_data = NULL)
#> # A tibble: 5 × 1
#>   quality
#>     <int>
#> 1       2
#> 2       3
#> 3       7
#> 4      11
#> 5      13

step_integer() does the same for either type of factor but converts them to consecutive one-based integers.

5.10 Other relevant recipe steps

There are a variety of other steps that can be used with qualitative predictors (a list of relevant recipe steps in recipes)


  1. In statistics, “contrasts” typically indicates a combination of parameters. For example, if you wanted to test that the average of two parameters was equal to a third, the contrast would be \(\beta_1 + \beta_2 - 2\beta_3\) and the contrast coefficients would be c(1, 1, -2).↩︎

  2. It also requires another package (text2vec) to be installed but not loaded.↩︎

  3. For this function, we all need the lme4 package to be installed.↩︎