Reproducibility in R

# Reproducibility in R
## Intro to efficient Data Pipelines
### Michael Jones
### 2021-07-20

---

# What is **Reproducibility**?

---
class: middle, center
# Someone else can re-run your process and get the same results
---
class: middle, center
# Someone **(else)** can re-run your process and get the same results
---
class: middle, center
# Someone **(else)** can **re-run** your process and get the same results
---
class: middle, center
# Someone **(else)** can **re-run** your process and get the **same results**
---

# What does a reproducible process **look like**?

---
class: center, middle
# Well documented

---
class: center, middle
# Non-interactive

---
class: center, middle
# Keyboard-based

---
class: center, middle
# Structured consistently

---

---

---
class: center, middle, inverse

# The **scale** **of Reproducibility**

---

# Stage 0

--
- Doing it and not telling anyone
--

- Hand Made Artisanal Analysis

---

# Stage 1
--

- Pen and Paper
--

- Excel
--

- Point and Click (Mouse work)

--
- Doing it in R then not saving your work

---

# Stage 2
- Doing it in R **and** saving your work, but still not having any structure

---
# Stage 3

- Scripts like

```
01_load_data.R
02_fit_linear_model.R
03_fit_gam.R
04_model_summaries.R
05_plots.R
06_paper.R
```

---
# Stage 4

- Process defined **in code**
- Using a system that knows about **dependencies**
- With a define **build process**
- That handles storage of results **for you**
- End to end: from **data to report**

---
# Stage 5

- Virtual Environments
- Containerisation (e.g. Docker)
- Virtual Machines

---
# ~~Stage 5~~

- ~~Virtual Environments~~
- ~~Containerisation (e.g. Docker)~~
- ~~Virtual Machines~~

# Out of scope today

---
class: inverse, middle, center

# What does reproducibility **not** look like?

---

# In R

```r
df %>% filter(col < value) %>%
...
```

- No clear declaration of libraries
- No evidence of how we got `df` in the first place.

---

# In R

```r
setwd("path/to/firectory/that/only/exists/on/my/machine/")
```

- Use shared resources (e.g. databases)
- Use paths relative to the project root
- Use {here}

---

# In R

```r
rm(list = ls())
```

- NO

---
class: center, middle, inverse

# All Analysis is a **DAG**

---

- **Graph** stages of your analysis are *connected* somehow
- **Acyclic** there is an order: earlier results feed into later results  *no loops*
- **Directed** there's a *flow* from start to finish

---
class: center, middle, inverse

# What does Reproducibility **feel** like?

---

# "I have no idea how we did that..."

# **"This thing from 2 years ago makes perfect sense"**

---

# "... Oh no, that change in data is going to put another three weeks on the deadline"

# **"No problem, we'll have updated results with you by lunchtime"**

---
class: center, middle, inverse

# Example

---

# The {targets} package
- Structures the analysis as a **data object**
- On changes, only rebuild **downstream from the change**
- Store all intermediate results out of the way
- Plays nicely with Rmarkdown