Much more than just fitting models

Perspective of an applied statistical scientist

Olivia Angelin-Bonnet

The New Zealand Institute for Plant and Food Research Limited

22 June 2023

Hi! I’m Olivia :)

Location of Le Pont-de-Beauvoisin on a map of France

My background

  • Masters (France): Bioinformatics and modelling (biology, statistics, algebra, programming)

  • PhD in Statistics: Reconstructing genotype-phenotype interactions in the tetraploid potato
  • Lecturer in Statistics

  • Statistical scientist: Omics analyses and multi-omics integration

Introduction

What is Systems Biology?

 

The genome dictates the phenotype (physical characteristics) and response to environment of biological systems …

What is Systems Biology?

 

… through the complex interactions between the different molecular layers (genes, transcripts, proteins, metabolites, etc)

What is Systems Biology?

Network image from (Barabási and Oltvai 2004)

These interaction networks can be deciphered from measurements of the molecular actors – the omics data – through data integration

The challenges of data integration

  • Omics datasets measured with different technologies:

    • values have different meaning: represents proportions, concentrations, related to amount of biological material…

    • different scales: counts, continuous values, etc

    • missing values are not all equal

  • High dimensionality:

    • can measure the expression of 10,000s of genes, 1,000s of metabolites; but often \(\leq\) 100s observations

    • dimensionality varies with the omics measured

  • Hard to validate findings

My role as a statistical scientist


Extract information from omics datasets to answer biological questions


Multivariate analyses

Network reconstruction

Visualisation

What uni doesn’t teach you


Uni taught me about:

  • statistics

  • data analysis

  • programming in R and Python

… but that’s not enough!

Illustration by Nathan W. Pyle

Software engineering principles I wish I learned about

  • Version control

  • Computational environment management

  • Documentation

  • Workflow management

  • Unit testing

  • (and more)

 

Becoming core skills for statistical scientists!

… but why?

 

Sustainability

Mind Vectors by Vecteezy

Open science

Version control

The practice of tracking and managing changes to [pieces of information].

Atlassian

Code versioning with Git

 

Why?

  • Easy access and recovery of previous versions

  • Streamlined storage solution for code

  • Enable collaboration

  • Standardised way of publishing code

  • Everyone else is using it!

Data versioning


  • Data is not static
  • Link results to versions of the data
  • Versioning raw and processed data

How?

DVC, Git LFS, object storage solutions like MinIO

This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

Managing computational environment

Features of a computer […], such as its operating system, what software it has installed, and what versions of software packages are installed.

The Turing Way

Computational environment matters

The nightmare of package versions

Illustration by Dmitriy Zub

Dependency management tools

 

Use dependency management tools like renv (R) or conda (Python) in your work

1. Record dependencies

2. Isolate project library Illustration by Dmitriy Zub

Containers

Illustration from https://www.docker.com/

Documentation

Material that provides official information or evidence or that serves as a record.

Oxford Languages

Write it down!

 

  • Need accurate records for reports/presentations

  • Facilitate collaboration

  • Ensures the project outlives its funding period

  • Encourages good practices

README

 

  • Project scope and aims

  • Contributors and key contacts

  • Data source (where does it come from) and location (where is it stored)

  • Summary of analysis (scripts, intermediary results, etc)

  • List of output (figures, reports, cleaned data)

Literate programming

 

  • Concept: mix code and plain text

  • Goal: record reasoning, notes and interpretation alongside source code and results

  • Tools: Quarto, Rmarkdown, Jupyter notebooks…

Can be version-controlled!

Literate programming

 

(Unit) testing

A software testing method by which individual units of source code […] are tested to determine whether they are fit for use.

Wikipedia

Testing to catch issues

 

R package testthat:

test_that("multiplication works", {
  expect_equal(2 * 2, 4)
})
#> Test passed 🌈
  • Is the data wrangling returning the expected number of rows/columns?

  • Is the distribution of the transformed data making sense?

  • Does the analysis return a warning if there are missing values?

Using simulations to test statistical methods

“A key strength of simulation studies is the ability to understand the behavior of statistical methods because some ‘truth’ (usually some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias.”

Using simulations to test statistical methods

 

Testing data

 

 

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

R package validate:

rules <- validator(   
  speed >= 0,    
  dist >= 0,    
  speed/dist <= 1.5,    
  cor(speed, dist) >= 0.2 
)  
confront(cars, rules)

 

  • Are there any missing data?

  • Range of values as expected?

  • Dimensions correct?

Workflow management

The discipline of creating, documenting, monitoring and improving upon the series of steps, or workflow, that is required to complete a specific task.

TechTarget

A real-life example of a complex analysis

Schema of the analysis for one of my thesis chapters

A real-life example of a complex analysis

 

   

Typical analysis folder:

thesis/chapter4_code 
├── genomics_data_analysis 
│   ├── 00_genomics_wrangling.R 
│   ├── 01_genomics_filtering.R 
│   └── 02_genomics_eda.R 
└── transcriptomics_data_analysis 
|   ├── 00_transcriptomics_wrangling.R 
|   ├── 01_transcriptomics_normalisation.R 
|   ├── 02_transcriptomics_differential_expression.R 
|   └── 03_transcriptomics_wgcna.R 
...

Typical issues:

  • Input data has changed; what do I need to re-run?

  • In which order do I need to run these scripts? (differential expression analysis needs results from genomics analysis)

Workflow management tools

Concept: turn your analysis into a pipeline, i.e. series of steps linked through input/output, which will be executed in the correct order


Workflow management tools

Concept: turn your analysis into a pipeline, i.e. series of steps linked through input/output, which will be executed in the correct order


Conclusion

Conclusion

 

  • Statisticians and data scientists need more than statistical skills

  • Software engineering practices are crucial for work sustainability and for collaboration

  • These skills should be taught in statistical degrees (but there are lots of resources out there!)

To go further

 

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

Thank you for your attention!

olivia.angelin-bonnet@plantandfood.co.nz

References

Angelin-Bonnet, O., Biggs, P. J., Baldwin, S., Thomson, S., & Vignes, M. (2020). sismonr: simulation of in silico multi-omic networks with adjustable ploidy and post-transcriptional regulation in R. Bioinformatics, 36(9), 2938–2940. https://doi.org/10.1093/bioinformatics/btaa002
Barabási, A.-L., & Oltvai, Z. N. (2004). Network biology: understanding the cell’s functional organization. Nature Reviews Genetics, 5(2), 101–113. https://doi.org/10.1038/nrg1272
Gallagher, R., Falster, D. S., Maitner, B., Salguero-Gomez, R., Vandvik, V., Pearse, W., et al. (2019). The open traits network: Using open science principles to accelerate trait-based science across the tree of life. http://dx.doi.org/10.32942/osf.io/kac45
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086
Wang, M., Wang, L., Pu, L., Li, K., Feng, T., Zheng, P., et al. (2020). LncRNAs related key pathways and genes in ischemic stroke by weighted gene co-expression network analysis (WGCNA). Genomics, 112(3), 2302–2308. https://doi.org/10.1016/j.ygeno.2020.01.001