1  Introduction

tidymodels is a framework for creating statistical and machine learning models in R. The framework consists of a set of tightly coupled R packages that are designed in the same way. The project began in late 2016.

The main tidymodels resources are:

We’ll reference these and other resources as needed.

1.1 Installation

tidymodels is built in R so you’ll need to install that. We used R version 4.3.2 (2023-10-31) for these notes. To install R, you can go to CRAN1 to download it for your operating system. If you are comfortable at the command line, the rig application is an excellent way to install and manage R versions.

You probably want to use an integrated development environment (IDE); it will make your life much better. We use the RStudio IDE, which can be downloaded here. Other applications are Visual Studio and emacs.

To use tidymodels, you need to install multiple packages. The core packages are bundled into a “verse” package called tidymodels. When you install that, you get the primary packages as well as some tidyverse packages such as dplyr and ggplot2.

To install it, you can use

install.packages("tidymodels")

We suggest using the pak package for installation. To do this, first install that and then use it for further installations:

install.packages("pak")

# check that it is installed then use it to install tidymodels
if (require(pak)) {
  pak::pak("tidymodels")
}

1.2 Loading tidymodels

Once you do that, load tidymodels:

library(tidymodels)
#> ── Attaching packages ─────────────────────────────────────────── tidymodels 1.1.1 ──
#> ✔ broom        1.0.5     ✔ recipes      1.0.8
#> ✔ dials        1.2.0     ✔ rsample      1.2.0
#> ✔ dplyr        1.1.4     ✔ tibble       3.2.1
#> ✔ ggplot2      3.4.4     ✔ tidyr        1.3.0
#> ✔ infer        1.0.5     ✔ tune         1.1.2
#> ✔ modeldata    1.2.0     ✔ workflows    1.1.3
#> ✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
#> ✔ purrr        1.0.2     ✔ yardstick    1.2.0
#> ── Conflicts ────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Use suppressPackageStartupMessages() to eliminate package startup messages

The default output shows the packages that are automatically attached. There are a lot of functions in tidy models, but by loading this meta-package, you don’t have to remember which functions come from which packages.

Note the lines at the bottom that messages like :

dplyr::filter() masks stats::filter()

This means that two packages, dplyr and stats, have functions with the same name (filter())2. If you were to type filter at an R prompt, the function that you get corresponds to the one in the most recently loaded package. That’s not ideal.

To handle this, we have a function called tidymodels_prefer(). When you use this, it prioritizes functions from the tidy models and tidyverse groups so that you get those 3 If you want to see the specific conflicts and how we resolve them, see this output:
tidymodels_prefer(quiet = FALSE)
#> [conflicted] Will prefer agua::refit over any other package.
#> [conflicted] Will prefer dials::Laplace over any other package.
#> [conflicted] Will prefer dials::max_rules over any other package.
#> [conflicted] Will prefer dials::neighbors over any other package.
#> [conflicted] Will prefer dials::prune over any other package.
#> [conflicted] Will prefer dials::smoothness over any other package.
#> [conflicted] Will prefer dplyr::collapse over any other package.
#> [conflicted] Will prefer dplyr::combine over any other package.
#> [conflicted] Will prefer dplyr::filter over any other package.
#> [conflicted] Will prefer dplyr::rename over any other package.
#> [conflicted] Will prefer dplyr::select over any other package.
#> [conflicted] Will prefer dplyr::slice over any other package.
#> [conflicted] Will prefer ggplot2::`%+%` over any other package.
#> [conflicted] Will prefer ggplot2::margin over any other package.
#> [conflicted] Will prefer parsnip::bart over any other package.
#> [conflicted] Will prefer parsnip::fit over any other package.
#> [conflicted] Will prefer parsnip::mars over any other package.
#> [conflicted] Will prefer parsnip::pls over any other package.
#> [conflicted] Will prefer purrr::cross over any other package.
#> [conflicted] Will prefer purrr::invoke over any other package.
#> [conflicted] Will prefer purrr::map over any other package.
#> [conflicted] Will prefer recipes::discretize over any other package.
#> [conflicted] Will prefer recipes::step over any other package.
#> [conflicted] Will prefer rsample::populate over any other package.
#> [conflicted] Will prefer scales::rescale over any other package.
#> [conflicted] Will prefer themis::step_downsample over any other package.
#> [conflicted] Will prefer themis::step_upsample over any other package.
#> [conflicted] Will prefer tidyr::expand over any other package.
#> [conflicted] Will prefer tidyr::extract over any other package.
#> [conflicted] Will prefer tidyr::pack over any other package.
#> [conflicted] Will prefer tidyr::unpack over any other package.
#> [conflicted] Will prefer tune::parameters over any other package.
#> [conflicted] Will prefer tune::tune over any other package.
#> [conflicted] Will prefer yardstick::get_weights over any other package.
#> [conflicted] Will prefer yardstick::precision over any other package.
#> [conflicted] Will prefer yardstick::recall over any other package.
#> [conflicted] Will prefer yardstick::spec over any other package.
#> [conflicted] Will prefer recipes::update over Matrix::update.
#> ── Conflicts ───────────────────────────────────────────────── tidymodels_prefer() ──

If you want to know more about why tidymodels exists, we’ve written a bit about this in the tidymodels book. The second chapter describes how tidyverse principles can be used for modeling.

1.3 Package Versions and Reproducability

We will do our best to use versions of our packages corresponding to the CRAN versions. We can’t always do that, and, for many packages, a version number ending with a value in the 9000 range (e.g., version “1.1.4.9001”) means that it was a development version of the package and was most likely installed from a GitHub repository.

At the end of each session, we’ll show which packages were loaded and used:

sessioninfo::session_info()
#> ─ Session info ────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Monterey 12.7.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2023-12-11
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ────────────────────────────────────────────────────────────────────────
#>  ! package      * version    date (UTC) lib source
#>  P backports      1.4.1      2021-12-13 [?] CRAN (R 4.3.0)
#>  P broom        * 1.0.5      2023-06-09 [?] CRAN (R 4.3.0)
#>  P cachem         1.0.8      2023-05-01 [?] CRAN (R 4.3.0)
#>  P class          7.3-22     2023-05-03 [?] CRAN (R 4.3.2)
#>  P cli            3.6.2      2023-12-11 [?] CRAN (R 4.3.0)
#>  P codetools      0.2-19     2023-02-01 [?] CRAN (R 4.3.2)
#>  P colorspace     2.1-0      2023-01-23 [?] CRAN (R 4.3.0)
#>  P conflicted     1.2.0      2023-02-01 [?] CRAN (R 4.3.0)
#>  P data.table     1.14.8     2023-02-17 [?] CRAN (R 4.3.0)
#>  P dials        * 1.2.0      2023-04-03 [?] CRAN (R 4.3.0)
#>  P DiceDesign     1.9        2021-02-13 [?] CRAN (R 4.3.0)
#>  P digest         0.6.33     2023-07-07 [?] CRAN (R 4.3.0)
#>  P dplyr        * 1.1.4      2023-11-17 [?] CRAN (R 4.3.0)
#>  P evaluate       0.23       2023-11-01 [?] CRAN (R 4.3.0)
#>  P fansi          1.0.5      2023-10-08 [?] RSPM (R 4.3.0)
#>  P fastmap        1.1.1      2023-02-24 [?] CRAN (R 4.3.0)
#>  P foreach        1.5.2      2022-02-02 [?] CRAN (R 4.3.0)
#>  P furrr          0.3.1      2022-08-15 [?] CRAN (R 4.3.0)
#>  P future         1.33.0     2023-07-01 [?] CRAN (R 4.3.0)
#>  P future.apply   1.11.0     2023-05-21 [?] CRAN (R 4.3.0)
#>  P generics       0.1.3      2022-07-05 [?] CRAN (R 4.3.0)
#>  P ggplot2      * 3.4.4      2023-10-12 [?] CRAN (R 4.3.0)
#>  P globals        0.16.2     2022-11-21 [?] CRAN (R 4.3.0)
#>  P glue           1.6.2      2022-02-24 [?] CRAN (R 4.3.0)
#>  P gower          1.0.1      2022-12-22 [?] CRAN (R 4.3.0)
#>  P GPfit          1.0-8      2019-02-08 [?] CRAN (R 4.3.0)
#>  P gtable         0.3.4      2023-08-21 [?] CRAN (R 4.3.0)
#>  P hardhat        1.3.0      2023-03-30 [?] CRAN (R 4.3.0)
#>  P htmltools      0.5.7      2023-11-03 [?] CRAN (R 4.3.0)
#>  P htmlwidgets    1.6.2      2023-03-17 [?] CRAN (R 4.3.0)
#>  P infer        * 1.0.5      2023-09-06 [?] RSPM (R 4.3.0)
#>  P ipred          0.9-14     2023-03-09 [?] CRAN (R 4.3.0)
#>  P iterators      1.0.14     2022-02-05 [?] CRAN (R 4.3.0)
#>  P jsonlite       1.8.7      2023-06-29 [?] CRAN (R 4.3.0)
#>  P knitr          1.45       2023-10-30 [?] CRAN (R 4.3.0)
#>  P lattice        0.21-9     2023-10-01 [?] CRAN (R 4.3.1)
#>  P lava           1.7.3      2023-11-04 [?] CRAN (R 4.3.0)
#>  P lhs            1.1.6      2022-12-17 [?] CRAN (R 4.3.0)
#>  P lifecycle      1.0.4      2023-11-07 [?] CRAN (R 4.3.0)
#>  P listenv        0.9.0      2022-12-16 [?] CRAN (R 4.3.0)
#>  P lubridate      1.9.3      2023-09-27 [?] RSPM (R 4.3.0)
#>  P magrittr       2.0.3      2022-03-30 [?] CRAN (R 4.3.0)
#>  P MASS           7.3-60     2023-05-04 [?] CRAN (R 4.3.2)
#>  P Matrix         1.6-1.1    2023-09-18 [?] CRAN (R 4.3.1)
#>  P memoise        2.0.1      2021-11-26 [?] CRAN (R 4.3.0)
#>  P modeldata    * 1.2.0      2023-08-09 [?] CRAN (R 4.3.0)
#>  P munsell        0.5.0      2018-06-12 [?] CRAN (R 4.3.0)
#>  P nnet           7.3-19     2023-05-03 [?] CRAN (R 4.3.2)
#>  P parallelly     1.36.0     2023-05-26 [?] CRAN (R 4.3.0)
#>  P parsnip      * 1.1.1      2023-08-17 [?] CRAN (R 4.3.0)
#>  P pillar         1.9.0      2023-03-22 [?] CRAN (R 4.3.0)
#>  P pkgconfig      2.0.3      2019-09-22 [?] CRAN (R 4.3.0)
#>  P prodlim        2023.08.28 2023-08-28 [?] CRAN (R 4.3.0)
#>  P purrr        * 1.0.2      2023-08-10 [?] CRAN (R 4.3.0)
#>  P R6             2.5.1      2021-08-19 [?] CRAN (R 4.3.0)
#>  P Rcpp           1.0.11     2023-07-06 [?] CRAN (R 4.3.0)
#>  P recipes      * 1.0.8      2023-08-25 [?] CRAN (R 4.3.0)
#>    renv           1.0.3      2023-09-19 [1] CRAN (R 4.3.0)
#>  P rlang          1.1.2      2023-11-04 [?] CRAN (R 4.3.0)
#>  P rmarkdown      2.25       2023-09-18 [?] RSPM (R 4.3.0)
#>  P rpart          4.1.21     2023-10-09 [?] CRAN (R 4.3.0)
#>  P rsample      * 1.2.0      2023-08-23 [?] CRAN (R 4.3.0)
#>  P rstudioapi     0.15.0     2023-07-07 [?] CRAN (R 4.3.0)
#>  P scales       * 1.2.1      2022-08-20 [?] CRAN (R 4.3.0)
#>  P sessioninfo    1.2.2      2021-12-06 [?] CRAN (R 4.3.0)
#>  P survival       3.5-7      2023-08-14 [?] CRAN (R 4.3.0)
#>  P tibble       * 3.2.1      2023-03-20 [?] CRAN (R 4.3.0)
#>  P tidymodels   * 1.1.1      2023-08-24 [?] CRAN (R 4.3.0)
#>  P tidyr        * 1.3.0      2023-01-24 [?] CRAN (R 4.3.0)
#>  P tidyselect     1.2.0      2022-10-10 [?] CRAN (R 4.3.0)
#>  P timechange     0.2.0      2023-01-11 [?] CRAN (R 4.3.0)
#>  P timeDate       4022.108   2023-01-07 [?] CRAN (R 4.3.0)
#>  P tune         * 1.1.2      2023-08-23 [?] CRAN (R 4.3.0)
#>  P utf8           1.2.4      2023-10-22 [?] CRAN (R 4.3.0)
#>  P vctrs          0.6.4      2023-10-12 [?] CRAN (R 4.3.0)
#>  P withr          2.5.2      2023-10-30 [?] CRAN (R 4.3.0)
#>  P workflows    * 1.1.3      2023-02-22 [?] CRAN (R 4.3.0)
#>  P workflowsets * 1.0.1      2023-04-06 [?] CRAN (R 4.3.0)
#>  P xfun           0.41       2023-11-01 [?] CRAN (R 4.3.0)
#>  P yaml           2.3.7      2023-01-23 [?] CRAN (R 4.3.0)
#>  P yardstick    * 1.2.0      2023-04-21 [?] CRAN (R 4.3.0)
#> 
#>  [1] /Users/max/content/computing-tidymodels/renv/library/R-4.3/x86_64-apple-darwin20
#>  [2] /Users/max/Library/Caches/org.R-project.R/R/renv/sandbox/R-4.3/x86_64-apple-darwin20/b06620f4
#> 
#>  P ── Loaded and on-disk path mismatch.
#> 
#> ───────────────────────────────────────────────────────────────────────────────────

  1. The Comprehensive R Archive Network↩︎

  2. The syntax foo::bar() means that the function bar() is inside of the package foo When used together, this is often referred to as “calling the function by its namespace.”. You can do this in your code, and developers often do. However, it’s fairly ugly. ↩︎

  3. Unfortunately, this is not a guarantee but it does work most of the time.↩︎