+ - 0:00:00
Notes for current slide
Notes for next slide

Reproducibility in R

Intro to efficient Data Pipelines

Michael Jones

2021-07-20

1 / 32

What is
Reproducibility?

2 / 32

Someone else can
re-run your process
and get the same results

3 / 32

Someone (else) can
re-run your process
and get the same results

4 / 32

Someone (else) can
re-run your process
and get the same results

5 / 32

Someone (else) can
re-run your process
and get the same results

6 / 32

What does a
reproducible process
look like?

7 / 32

Well documented

8 / 32

Non-interactive

9 / 32

Keyboard-based

10 / 32

Structured consistently

11 / 32

Friendly

12 / 32

Extendible

13 / 32

The scale
of Reproducibility

14 / 32

Stage 0

15 / 32

Stage 0

  • Doing it and not telling anyone
15 / 32

Stage 0

  • Doing it and not telling anyone
  • Hand Made Artisanal Analysis
15 / 32

Stage 1

16 / 32

Stage 1

  • Pen and Paper
16 / 32

Stage 1

  • Pen and Paper
  • Excel
16 / 32

Stage 1

  • Pen and Paper
  • Excel
  • Point and Click (Mouse work)
16 / 32

Stage 1

  • Pen and Paper
  • Excel
  • Point and Click (Mouse work)
  • Doing it in R then not saving your work
16 / 32

Stage 2

  • Doing it in R and saving your work, but still not having any structure
17 / 32

Stage 3

  • Scripts like
01_load_data.R
02_fit_linear_model.R
03_fit_gam.R
04_model_summaries.R
05_plots.R
06_paper.R
18 / 32

Stage 4

  • Process defined in code
  • Using a system that knows about dependencies
  • With a define build process
  • That handles storage of results for you
  • End to end: from data to report
19 / 32

Stage 5

  • Virtual Environments
  • Containerisation (e.g. Docker)
  • Virtual Machines
20 / 32

Stage 5

  • Virtual Environments
  • Containerisation (e.g. Docker)
  • Virtual Machines

Out of scope today

21 / 32

What does reproducibility
not look like?

22 / 32

In R

df %>% filter(col < value) %>%
...
  • No clear declaration of libraries
  • No evidence of how we got df in the first place.
23 / 32

In R

setwd("path/to/firectory/that/only/exists/on/my/machine/")
  • Use shared resources (e.g. databases)
  • Use paths relative to the project root
  • Use {here}
24 / 32

In R

rm(list = ls())
  • NO
25 / 32

All Analysis
is a DAG

26 / 32
  • Graph stages of your analysis are connected somehow
  • Acyclic there is an order: earlier results feed into later results no loops
  • Directed there's a flow from start to finish
27 / 32

What does Reproducibility
feel like?

28 / 32

"I have no idea how we did that..."

29 / 32

"I have no idea how we did that..."

"This thing from 2 years ago makes perfect sense"

29 / 32

"... Oh no, that change in data is going to put another three weeks on the deadline"

30 / 32

"... Oh no, that change in data is going to put another three weeks on the deadline"

"No problem, we'll have updated results with you by lunchtime"

30 / 32

Example

31 / 32

The {targets} package

  • Structures the analysis as a data object
  • On changes, only rebuild downstream from the change
  • Store all intermediate results out of the way
  • Plays nicely with Rmarkdown
32 / 32

What is
Reproducibility?

2 / 32
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow