+ - 0:00:00
Notes for current slide
Notes for next slide

The first step is admitting you have a Problem

Kicking the Spreadsheet Habit

Michael Jones

6 May 2021

1

2

Times are changing

3

Clients and stakeholders' needs are changing. Questions need to be answered quicker and more accurately for cheaper so we have to be more responsive Many may already be using data warehouses behind the scenes, and so we need to be comfortable with the language and technology in order to meet them where they are, or perhaps even help them get to where they want to be.

You may get a huge number of files to process and you won't be able to do it no matter how many actuarial students you throw at it.

Are we becoming
replaceable?

4

The genie is out of the bottle. Aside from reserved roles, there's nothing technical that we do that couldn't be done by a similarly skilled data scientist:

  • cheaper (since they aren't operating in a potentially inflated salary environment driven by false scarcity?)
  • better? since they haven't got a huge mountain of legacy thought

Part 1

Spreadsheets:
A biased history

5

Public domain image (GNU General Public Licence)
6

Visicalc 1979 on the Apple II

Copyright image but fair use
7

Lotus 1-2-3 in 1983 on MS-DOS

Image copyright Microsoft, used with permission of Microsoft, or at least that's What Wikipedia told me to say
8

Microsoft Excel 1995

Screenshot under Mozillla Public Licence
9

LibreOffice

People seem to like
spreadsheets

10

There are some good things

  • Rough work
  • Quick view of your data
  • Toy models
  • Data interchange

Also I've seen people make art in spreadsheets and perform standup comedy about spreadsheets.

11

But...

12

News report from The Verge

We have the most successful, most popular spreadsheet program breaking peoples' data.

The tools you use dicate
what you can do

but they also dictate
what you think is possible

14

Spreadsheets make you squash problems into a format that can be solved by a spreadsheet or make you think your problem isn't solvable because it can't be done in a spreadsheet easily.

Part 2:

Reaching their limits

15

Excel is not a database

17
  • 2003 TransAlta lost $24 million due to Excel copy-and-paste error
  • 2005 Kodak suffered an $11 million severance error due to Excel typo
18
  • 2010 MI5 bugged the wrong phones due to a spreadsheet formatting error
  • 2011 AstraZeneca accidentally released confidential information due to an Excel templating error
19

When you don't control
your primary data system,
you are at risk

20

Present and future challenges

  • Life risk modeling
  • Solvency II
  • IFRS 17
  • Complex Models
21

Spreadsheets have
inherent problems
that limit their
ongoing usefulness

22

Problems

  • The cell as the fundamental unit
  • No data types & lack of abstraction
  • Slowness
  • Conflation of model, implementation and instance
23

Problems (continuted)

  • VBA
  • Fragmentation of model instances
  • Wider Office ecosystem
  • It crashes a lot
24

What would a solution look like?

  • quicker results
  • less risk
  • more value add work
  • more fulfilling work
  • more efficient
  • more reactive
25

Part 3:

Another way

26

SQL

  • A relational model of data
  • A centralised source
  • Data constraints
  • Interoperability
27

Programming languages

  • R
  • Python
  • Julia

Separate the model, the implementation and the use

28

Code
reuse

29

Abstractions and
Data Structures

30

Speed

  • Aviva, Julia and Solvency II
  • Me, a complex client and an adventuRe
31

New or better
methods

32

Version control

33

3rd party
libraries

34

Automated
testing

35

Computable
Documents

36

Better
Plotting

37
diamonds
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
37
diamonds %>%
ggplot(aes(x = carat,
y = price,
colour = clarity))

37
diamonds %>%
ggplot(aes(x = carat,
y = price,
colour = clarity)) +
geom_point()

37
diamonds %>%
ggplot(aes(x = carat,
y = price,
colour = clarity)) +
geom_point() +
facet_wrap(~cut)

37
diamonds %>%
ggplot(aes(x = carat,
y = price,
colour = clarity)) +
geom_point() +
facet_wrap(~cut) +
theme_bw()

37
diamonds %>%
ggplot(aes(x = carat,
y = price,
colour = clarity)) +
geom_point() +
facet_wrap(~cut) +
theme_bw() +
theme(legend.position = "bottom")

37
diamonds %>%
ggplot(aes(x = carat,
y = price,
colour = clarity)) +
geom_point() +
facet_wrap(~cut) +
theme_bw() +
theme(legend.position = "bottom") +
labs(x = "Carat",
y = "Price (USD)",
title = "Some facts about Diamonds")

37
ggplot(midwest, aes(x = poptotal))

37
ggplot(midwest, aes(x = poptotal)) +
geom_histogram(bins = 30)

37
ggplot(midwest, aes(x = poptotal)) +
geom_histogram(bins = 30) +
facet_wrap(~state)

37
ggplot(midwest, aes(x = poptotal)) +
geom_histogram(bins = 30) +
facet_wrap(~state) +
scale_x_log10(labels = scales::comma)

37
ggplot(midwest, aes(x = poptotal)) +
geom_histogram(bins = 30) +
facet_wrap(~state) +
scale_x_log10(labels = scales::comma) +
theme_bw()

37
ggplot(midwest, aes(x = poptotal)) +
geom_histogram(bins = 30) +
facet_wrap(~state) +
scale_x_log10(labels = scales::comma) +
theme_bw() +
labs(x = "Total Population",
y = "Count",
title = "Distribution of Total county population in Midwestern States")

37

Control & reproducibility

38

TAS 200 (Insurance):

Implementations and realisations of models shall be reproducible.

But what does that mean?

39

All Analysis is a DAG

40

Interlude:
A live demo

41

Part 4:

How to learn

42

Pick a Language

43

Probably don't bother
with the IFoA Certificate
in Data Science

44

Start
Small

45

www.xkcd.com/1205
46

The best things in life are free

  • R for Data Science, Hadley Wickham
  • Rstudio tutorials
  • Package documentation
  • Conference recordings
47

The best things in life are free

  • People at work
  • Online communities, e.g. Twitter
  • Data Science screencasts
  • Other books and blogs
48

If someone else is paying

  • Paid training
  • Dead-tree books
  • Physical conferences
49

Come to terms
with the cost

50

Epilogue

51

If you must use spreadsheets,
use them well

  • Data Organization in Spreadsheets, Broman & Woo, The American Statistician
  • Tidy Data, Hadley Wickham, The Journal of Statistical Software
52

Final Thoughts

  • Actuaries could probably be replaced
  • Code-based work flows offer massive advantages of speed, accuracy and quality
  • It's never too late to start learning
53

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
oTile View: Overview of Slides
Esc Back to slideshow