3  Initial Data Splitting

We’ll illustrate how to conduct an initial split of your data into different partitions (used for different purposes).

3.1 Requirements

You’ll need 2 packages (caret and tidymodels) for this chapter. You can install them via:

req_pkg <- c("caret", "tidymodels")

# Check to see if they are installed: 
pkg_installed <- vapply(req_pkg, rlang::is_installed, logical(1))

# Install missing packages: 
if ( any(!pkg_installed) ) {
  install_list <- names(pkg_installed)[!pkg_installed]
  pak::pak(install_list)
}

Let’s load the meta package and manage some between-package function conflicts.

The data used here are both in R packages that are already installed. Let’s work with the primary data set: the Ames Iowa housing data.

3.2 The Ames Housing Data

These data are in the modeldata package, which is part of tidymodels. Let’s load the data, subset a few columns, and modify the sale price units. We’ll also combine the two bathroom-related columns into a single column.

data(ames, package = "modeldata")

ames <-
  ames %>%
  select(Sale_Price, Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath,
         Half_Bath, Year_Sold, Lot_Area, Central_Air, Longitude, Latitude) %>%
  mutate(
    Sale_Price = log10(Sale_Price),
    Baths = Full_Bath  + Half_Bath/2
  ) %>%
  select(-Half_Bath, -Full_Bath)

glimpse(ames)
#> Rows: 2,930
#> Columns: 11
#> $ Sale_Price   <dbl> 5.332, 5.021, 5.236, 5.387, 5.279, 5.291, 5.329, 5.282, 5.374…
#> $ Bldg_Type    <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, TwnhsE, Twnhs…
#> $ Neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gilbert, Gilb…
#> $ Year_Built   <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, 1999, 1…
#> $ Gr_Liv_Area  <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1804, 16…
#> $ Year_Sold    <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2…
#> $ Lot_Area     <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5389, 75…
#> $ Central_Air  <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
#> $ Longitude    <dbl> -93.62, -93.62, -93.62, -93.62, -93.64, -93.64, -93.63, -93.6…
#> $ Latitude     <dbl> 42.05, 42.05, 42.05, 42.05, 42.06, 42.06, 42.06, 42.06, 42.06…
#> $ Baths        <dbl> 1.0, 1.0, 1.5, 2.5, 2.5, 2.5, 2.0, 2.0, 2.0, 2.5, 2.5, 2.0, 2…

tidymodels requires that, for outcome data, any basic transformations should occur before data splitting.

3.3 Simple Data Splitting

There are a few main functions for an initial split:

Most of our applications will use the first function, where the default is to use 75% for training and 25% for testing. This is determined at random; there is no need to randomly sort the rows before splitting. By default, a simple random split is used.

For the Ames data, we know that the distribution of sale prices has some outlying points. To deal with this, we’ll use a stratified split (on the outcome) using 5 quantiles of the data in ames:

set.seed(3024)
ames_split <- initial_split(ames, strata = Sale_Price, breaks = 5)

ames_split
#> <Training/Testing/Total>
#> <2196/734/2930>

The output shows the size of the resulting data sets. To get the two data sets, there are simple accessor functions:

ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

Consistent with the printed output, there are 2,196 data points in the training set and 734 reserved for testing.

We won’t touch on initial_time_split() here but only mention that it takes the fraction of the data specified for testing from the bottom/tail of the data frame. Unlike the previous function, the order of the rows matters.

group_initial_split() and initial_validation_split() are discussed in more detail below.

3.4 Using the Predictors

Instead of using the outcome to partition the data, other columns can be used when applicable. The text mentions using the twinning package (CRAN page). The same authors have a second approach that can be found in the SPlit package (CRAN). Both are straightforward to use.

Maximum dissimilarity sampling can be conducted using caret::maxDissim(). It starts with an initial set of one or more or fewer data points to use as a starter. Unless there is a specific set of points of interest, picking one close to the center of the multivariate predictor distribution might make sense. Here is some code that uses the geographic coordinates as the splitting variables:

# Since we will be using distances in the calculations, create centered 
# and scaled versions of the coordinates then add a row index column. 
ames_scaled <-
  ames %>%
  select(Longitude, Latitude) %>%
  mutate(
    scaled_lon = scale(Longitude)[,1], 
    scaled_lat = scale(Latitude)[,1]
  ) %>%
  select(starts_with("scaled")) %>% 
  add_rowindex()

# Select an initial data point closest to the middle
seed_row <-
  ames_scaled %>%
  mutate(
    dist = (scaled_lon)^2 + (scaled_lat)^2
  ) %>%
  slice_min(dist, n = 1) %>%
  pluck(".row")

# Partition the data
ames_scaled_seed <- ames_scaled %>% slice( seed_row)
ames_scaled_pool <- ames_scaled %>% slice(-seed_row)

# Conduct the selection process
selection_path <- 
  caret::maxDissim(
    # Only give the function the predictor columns for each data set
    ames_scaled_seed %>% select(-.row), 
    ames_scaled_pool %>% select(-.row), 
    n = 24
  )

# Get the selected row numbers that correspond to the 'ames' data frame.
selected_rows <- c(seed_row, ames_scaled_pool$.row[selection_path])

selected_data <- ames %>% slice(selected_rows)

# A non-map plot of the values: 
selected_data %>%
  mutate(xend = lead(Longitude), yend = lead(Latitude)) %>%
  ggplot(aes(Longitude, Latitude)) +
  geom_point() +
  geom_segment(aes(xend = xend, yend = yend),
               arrow = arrow(length = unit(0.1, "inches"), type = "closed"),
               col = "blue", alpha = 1 / 5) +
  theme_bw()

3.5 Multi-Level Data

This section will focus on data with a rational grouping of data. For example, medical data might follow patient over time so that there are multiple rows per patient. The patient is the independent experimental unit (IEU), meaning that the data between patients are thought to be independent, and those within a patient are (statistically) related. We want to partition the data so that all of the data for each IEU end up in either the training or test sets but not both. We want to sample the data by the group – where the group in this example is the patient.

There are other applications of grouped data but the example data that we’ll use fits into the description above: 27 patients were followed and had data collected at four time points. The data are in the nlme package:

data(Orthodont, package = "nlme")
glimpse(Orthodont)
#> Rows: 108
#> Columns: 4
#> $ distance <dbl> 26.0, 25.0, 29.0, 31.0, 21.5, 22.5, 23.0, 26.5, 23.0, 22.5, 24.0,…
#> $ age      <dbl> 8, 10, 12, 14, 8, 10, 12, 14, 8, 10, 12, 14, 8, 10, 12, 14, 8, 10…
#> $ Subject  <ord> M01, M01, M01, M01, M02, M02, M02, M02, M03, M03, M03, M03, M04, …
#> $ Sex      <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, Male,…

To use rsample::group_initial_split(), we must supply a group argument that corresponds to one of the columns in the data. There is also a prop argument that specifies the proportion of the groups that should go into the training set.

set.seed(93)
orth_split <- group_initial_split(Orthodont, group = Subject, prop = 2 / 3)

# The numbers in this output are individual rows (not numbers of groups)
orth_split
#> <Training/Testing/Total>
#> <72/36/108>

From here, the code to get the resulting data sets is the same as previously shown. We’ll also verify that no subjects are in both data sets:

orth_train <- training(orth_split)
orth_test  <- testing(orth_split)

# Is there any overlap in the subjects? 
subjects_train <- unique(orth_train$Subject)
subjects_test  <- unique(orth_test$Subject)

intersect(subjects_train, subjects_test)
#> character(0)

3.6 Validation Sets

To add a validation set at the outset, initial_validation_split() works the same as initial_split(). The prop argument requires two values now: the first is the training set proportion, and the second is for the validation set. In this example below, we add 80% to training, 10% to validation, and the remaining 10% to the testing set:

set.seed(4)
ames_val_split <- initial_validation_split(ames, strata = Sale_Price, prop = c(0.8, 0.1))

ames_val_split
#> <Training/Validation/Testing/Total>
#> <2342/293/295/2930>

Again, the acquisition of data is the same but has the additional use of the validation() function:

ames_train <- training(ames_val_split)
ames_val   <- validation(ames_val_split)
ames_test  <- testing(ames_val_split)

rsample::initial_validation_time_split() does the same thing but based on the ordering of the data (as opposed to random selection).

Suppose a data frame had 100 rows. Using prop = c(0.8, 0.1) would place the first 80 rows into training, the next 10 into validation, and the last 10 into testing. Keeping the data appropriately ordered is important when using validation sets in tidymodels.