3 Initial Data Splitting

We’ll illustrate how to conduct an initial split of your data into different partitions (used for different purposes).

3.1 Requirements

You’ll need 4 packages (caret, spatialsample, tidymodels, and tidysdm) for this chapter. You can install them via:

req_pkg <- c("caret", "spatialsample", "tidymodels", "tidysdm")

# Check to see if they are installed: 
pkg_installed <- vapply(req_pkg, rlang::is_installed, logical(1))

# Install missing packages: 
if ( any(!pkg_installed) ) {
  install_list <- names(pkg_installed)[!pkg_installed]
  pak::pak(install_list)
}

Let’s load the meta package and manage some between-package function conflicts.

library(tidymodels)
tidymodels_prefer()

The data used here are both in R packages that are already installed. Let’s work with the primary data set: the Ames Iowa housing data.

3.2 The Ames Housing Data

These data are in the modeldata package, which is part of tidymodels. Let’s load the data, subset a few columns, and modify the sale price units. We’ll also combine the two bathroom-related columns into a single column.

data(ames, package = "modeldata")

ames <-
  ames %>%
  select(Sale_Price, Bldg_Type, Neighborhood, Year_Built, Gr_Liv_Area, Full_Bath,
         Half_Bath, Year_Sold, Lot_Area, Central_Air, Longitude, Latitude) %>%
  mutate(
    Sale_Price = log10(Sale_Price),
    Baths = Full_Bath  + Half_Bath/2
  ) %>%
  select(-Half_Bath, -Full_Bath)

glimpse(ames)
#> Rows: 2,930
#> Columns: 11
#> $ Sale_Price   <dbl> 5.332, 5.021, 5.236, 5.387, 5.279, 5.291, 5.329, 5.282, 5.374…
#> $ Bldg_Type    <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, TwnhsE, Twnhs…
#> $ Neighborhood <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gilbert, Gilb…
#> $ Year_Built   <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995, 1999, 1…
#> $ Gr_Liv_Area  <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, 1804, 16…
#> $ Year_Sold    <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2…
#> $ Lot_Area     <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005, 5389, 75…
#> $ Central_Air  <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
#> $ Longitude    <dbl> -93.62, -93.62, -93.62, -93.62, -93.64, -93.64, -93.63, -93.6…
#> $ Latitude     <dbl> 42.05, 42.05, 42.05, 42.05, 42.06, 42.06, 42.06, 42.06, 42.06…
#> $ Baths        <dbl> 1.0, 1.0, 1.5, 2.5, 2.5, 2.5, 2.0, 2.0, 2.0, 2.5, 2.5, 2.0, 2…

tidymodels requires that, for outcome data, any basic transformations should occur before data splitting.

3.3 Simple Data Splitting

There are a few main functions for an initial split:

rsample::initial_split(): completely random splits and stratified splits.
rsample::initial_time_split(): non-random splits for times series; the most recent data are used for testing.
rsample::initial_validation_split() and rsample::initial_validation_time_split(): an initial split into three partitions.
rsample::group_initial_split(): for situations with repeated measures or other important grouping factors.

Most of our applications will use the first function, where the default is to use 75% for training and 25% for testing. This is determined at random; there is no need to randomly sort the rows before splitting. By default, a simple random split is used.

Let’s do that with the Ames data:

set.seed(3024)
ames_split <- initial_split(ames)

ames_split
#> <Training/Testing/Total>
#> <2197/733/2930>

The output shows the size of the resulting data sets. To get the two data sets, there are simple accessor functions:

ames_train <- training(ames_split)
ames_test  <- testing(ames_split)

Consistent with the printed output, there are 2,197 data points in the training set and 733 reserved for testing.

We won’t touch on initial_time_split() here but only mention that it takes the fraction of the data specified for testing from the bottom/tail of the data frame. Unlike the previous function, the order of the rows matters since the rows at the end of the data frame as used as the test set.

group_initial_split() and initial_validation_split() are discussed in more detail below.

3.4 Using the Outcome

For the Ames data, we know that the distribution of sale prices has some outlying points. To deal with this, we can use a stratified split (on the outcome) using 5 quantiles of the data in ames:

set.seed(3024)
ames_block_split <- initial_split(ames, strata = Sale_Price, breaks = 5)

ames_block_split
#> <Training/Testing/Total>
#> <2196/734/2930>

Nothing else changes; the same functions can be used to access the training and testing data.

3.5 Using the Predictors

Instead of using the outcome to partition the data, other columns can be used when applicable. The text mentions using the twinning package (CRAN page). The same authors have a second approach that can be found in the SPlit package (CRAN). Both are straightforward to use.

Maximum dissimilarity sampling can be conducted using caret::maxDissim(). It starts with an initial set of one or more or fewer data points to use as a starter. Unless there is a specific set of points of interest, picking one close to the center of the multivariate predictor distribution might make sense. Here is some code that uses the geographic coordinates as the splitting variables:

# Since we will be using distances in the calculations, create centered 
# and scaled versions of the coordinates then add a row index column. 
ames_scaled <-
  ames %>%
  select(Longitude, Latitude) %>%
  mutate(
    scaled_lon = scale(Longitude)[,1], 
    scaled_lat = scale(Latitude)[,1]
  ) %>%
  select(starts_with("scaled")) %>% 
  add_rowindex()

# Select an initial data point closest to the middle
seed_row <-
  ames_scaled %>%
  mutate(
    dist = (scaled_lon)^2 + (scaled_lat)^2
  ) %>%
  slice_min(dist, n = 1) %>%
  pluck(".row")

# Partition the data
ames_scaled_seed <- ames_scaled %>% slice( seed_row)
ames_scaled_pool <- ames_scaled %>% slice(-seed_row)

# Conduct the selection process
selection_path <- 
  caret::maxDissim(
    # Only give the function the predictor columns for each data set
    ames_scaled_seed %>% select(-.row), 
    ames_scaled_pool %>% select(-.row), 
    n = 24
  )

# Get the selected row numbers that correspond to the 'ames' data frame.
selected_rows <- c(seed_row, ames_scaled_pool$.row[selection_path])

selected_data <- ames %>% slice(selected_rows)

# A non-map plot of the values: 
selected_data %>%
  mutate(xend = lead(Longitude), yend = lead(Latitude)) %>%
  ggplot(aes(Longitude, Latitude)) +
  geom_point() +
  geom_segment(aes(xend = xend, yend = yend),
               arrow = arrow(length = unit(0.1, "inches"), type = "closed"),
               col = "blue", alpha = 1 / 5) +
  theme_bw()

3.6 Validation Sets

To add a validation set at the outset, initial_validation_split() works the same as initial_split(). The prop argument requires two values now: the first is the training set proportion, and the second is for the validation set. In this example below, we add 80% to training, 10% to validation, and the remaining 10% to the testing set:

set.seed(4)
ames_val_split <- initial_validation_split(ames, strata = Sale_Price, prop = c(0.8, 0.1))

ames_val_split
#> <Training/Validation/Testing/Total>
#> <2342/293/295/2930>

Again, the acquisition of data is the same but has the additional use of the validation() function:

ames_train <- training(ames_val_split)
ames_val   <- validation(ames_val_split)
ames_test  <- testing(ames_val_split)

rsample::initial_validation_time_split() does the same thing but based on the ordering of the data (as opposed to random selection).

Suppose a data frame had 100 rows. Using prop = c(0.8, 0.1) would place the first 80 rows into training, the next 10 into validation, and the last 10 into testing. Keeping the data appropriately ordered is important when using validation sets in tidymodels.

3.7 Multi-Level Data

This section will focus on data with a rational grouping of data. For example, medical data might follow patient over time so that there are multiple rows per patient. The patient is the independent experimental unit (IEU), meaning that the data between patients are thought to be independent, and those within a patient are (statistically) related. We want to partition the data so that all of the data for each IEU end up in either the training or test sets but not both. We want to sample the data by the group – where the group in this example is the patient.

There are other applications of grouped data but the example data that we’ll use fits into the description above: 27 patients were followed and had data collected at four time points. The data are in the nlme package:

data(Orthodont, package = "nlme")
glimpse(Orthodont)
#> Rows: 108
#> Columns: 4
#> $ distance <dbl> 26.0, 25.0, 29.0, 31.0, 21.5, 22.5, 23.0, 26.5, 23.0, 22.5, 24.0,…
#> $ age      <dbl> 8, 10, 12, 14, 8, 10, 12, 14, 8, 10, 12, 14, 8, 10, 12, 14, 8, 10…
#> $ Subject  <ord> M01, M01, M01, M01, M02, M02, M02, M02, M03, M03, M03, M03, M04, …
#> $ Sex      <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, Male,…

To use rsample::group_initial_split(), we must supply a group argument that corresponds to one of the columns in the data. There is also a prop argument that specifies the proportion of the groups that should go into the training set.

set.seed(93)
orth_split <- group_initial_split(Orthodont, group = Subject, prop = 2 / 3)

# The numbers in this output are individual rows (not numbers of groups)
orth_split
#> <Training/Testing/Total>
#> <72/36/108>

From here, the code to get the resulting data sets is the same as previously shown. We’ll also verify that no subjects are in both data sets:

orth_tr <- training(orth_split)
orth_te  <- testing(orth_split)

# Is there any overlap in the subjects? 
subjects_train <- unique(orth_tr$Subject)
subjects_test  <- unique(orth_te$Subject)

intersect(subjects_train, subjects_test)
#> ordered()
#> 27 Levels: M16 < M05 < M02 < M11 < M07 < M08 < M03 < M12 < M13 < M14 < ... < F11

3.8 Splitting Spatial Data

For spatial data, we need to convert longitude and latitude to a special type of vector called a geometry vector. The sf package can do this so let’s load that and two others:

library(sf)
#> Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(spatialsample)
library(tidysdm)

The function to make the conversion is sf::st_as_sf():

ames_sf <-
  ames %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)

ames_sf %>% select(geometry)
#> Simple feature collection with 2930 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -93.69 ymin: 41.99 xmax: -93.58 ymax: 42.06
#> Geodetic CRS:  WGS 84
#> # A tibble: 2,930 × 1
#>         geometry
#>      <POINT [°]>
#> 1 (-93.62 42.05)
#> 2 (-93.62 42.05)
#> 3 (-93.62 42.05)
#> 4 (-93.62 42.05)
#> 5 (-93.64 42.06)
#> 6 (-93.64 42.06)
#> # ℹ 2,924 more rows

The crs argument specifies a coordinate reference system. We used 4326 since it contains the US.

Note that this is a sticky column and can’t be removed via normal means. There is a function to do this though: sf::st_drop_geometry().

The tidysdm package can create an initial data split using the methods given in the spatialsample package. The function is tidysdm::spatial_initial_split(). The main argument is prop for the amount of data in the testing set. Let’s put 20% into testing using prop = 0.2.

After specifying this, we can set a blocking strategy using spatialsample, such as:

spatial_block_cv(): create a regular grid.
spatial_clustering_cv(): use a clustering method to group the data.

For the blocking method, we need to specify the number of blocks (n), whether the block is square or hexagonal (square), and (optionally) the radius for a buffer (buffer). These options will define how to create blocks. There is another option that designates how the blocks are assigned to the training set: method. You should experiment with these for your data. We’ll use method = "continuous" in this example.

set.seed(318)
ames_block_split <-
  spatial_initial_split(
    ames_sf, 
    prop = 0.2, 
    strategy = spatial_block_cv,
    method = "continuous",
    n = 25, 
    square = FALSE)
ames_block_split
#> <Training/Testing/Total>
#> <2326/604/2930>

There is a handy autoplot() method to visually assess the results:

autoplot(ames_block_split, cex = 1 / 2) + 
  ggtitle("Blocking, no buffer") + 
  theme_bw()

Let’s add a buffer around these blocks to stop adjacent data from being the training and testing sets. We can experiment with the buffer argument until we have results that seem acceptable.

set.seed(318)
ames_block_buff_split <-
  spatial_initial_split(
    ames_sf, 
    prop = 0.2, 
    strategy = spatial_block_cv,
    method = "continuous",
    n = 25, 
    square = FALSE,
    buffer = 250)
ames_block_buff_split
#> <Training/Testing/Total>
#> <1587/604/2930>

The buffer points are on gray:

autoplot(ames_block_buff_split, cex = 1 / 2) + 
  ggtitle("Blocking with a buffer") + 
  theme_bw()

As with the previous methods, the training() and testing() functions are used to obtain those data sets.