Much more than just fitting models

Perspective of an applied statistical scientist

Olivia Angelin-Bonnet

The New Zealand Institute for Plant and Food Research Limited

22 June 2023

Hi! I’m Olivia :)

Location of Le Pont-de-Beauvoisin on a map of France

My background

  • Masters (France): Bioinformatics and modelling (biology, statistics, algebra, programming)

  • PhD in Statistics: Reconstructing genotype-phenotype interactions in the tetraploid potato
  • Lecturer in Statistics

  • Statistical scientist: Omics analyses and multi-omics integration

Introduction

What is Systems Biology?

 

The genome dictates the phenotype (physical characteristics) and response to environment of biological systems …

What is Systems Biology?

 

… through the complex interactions between the different molecular layers (genes, transcripts, proteins, metabolites, etc)

What is Systems Biology?

Network image from (Barabási and Oltvai 2004)

These interaction networks can be deciphered from measurements of the molecular actors – the omics data – through data integration

The challenges of data integration

  • Omics datasets measured with different technologies:

    • values have different meaning: represents proportions, concentrations, related to amount of biological material…

    • different scales: counts, continuous values, etc

    • missing values are not all equal

  • High dimensionality:

    • can measure the expression of 10,000s of genes, 1,000s of metabolites; but often ≤ 100s observations

    • dimensionality varies with the omics measured

  • Hard to validate findings

My role as a statistical scientist


Extract information from omics datasets to answer biological questions


Multivariate analyses

Network reconstruction

(Wang et al. 2020)

Visualisation

What uni doesn’t teach you


Uni taught me about:

  • statistics

  • data analysis

  • programming in R and Python

… but that’s not enough!

Illustration by Nathan W. Pyle

Software engineering principles I wish I learned about

  • Version control

  • Computational environment management

  • Documentation

  • Workflow management

  • Unit testing

  • (and more)

 

Becoming core skills for statistical scientists!

… but why?

 

Sustainability

Mind Vectors by Vecteezy

Open science

From (Gallagher et al. 2019)

If nothing else, I think these concepts are invaluable to any statistical scientist to make their work sustainable. As I will show in the rest of the talk, we are now facing challenging in terms of the complexity of the work that we are doing, and these principles can ensure that we are not wasting time whenever we have to write up or revisit a project. Also, in terms of work ethics, we want to make sure that we are doing the best that we can to ensure that the work that we do is robust.

But in a more general way, there is more and more a push towards open science, along which comes the discussion of reproducibility. I don’t want to delve into this discussion deeper (I think that reproducibility doesn’t have to mean open, it means for us). But nevertheless, we will need to change our mindset and adopt these approaches.

Version control

The practice of tracking and managing changes to [pieces of information].

Atlassian

Code versioning with Git

 

Why?

  • Easy access and recovery of previous versions

  • Streamlined storage solution for code

  • Enable collaboration

  • Standardised way of publishing code

  • Everyone else is using it!

At minimum, tools like GitHub, GitLab or BitBucket provide a cloud-based and streamlined storage solution for any code. It makes your code portable (can easily access and copy to a new computer). It makes your work more sustainable: by being able to access the history of your work, you can recover previous versions if you realised you made a mistake.

But the reality of modern data science work makes code versioning almost an obligation. First, because we are having to collaborate a lot. That could be as a student that needs to develop code that their supervisors need to review; or as in my case working on a complex project where several data scientists contribute to different aspects. Tools like GitHub make it so much easier to work together on the code.

Another aspect of modern data science is that there is a strong emphasis on open science and reproducibility. Whether you agree with this or not, there is often requirements around making your code available, be it internally (within your organisation) or publicly alongside a scientific paper.

Lastly, an important aspect of being able to use Git and GitHub etc is simply the fact that it is common practice in the data science community, and so interactions with the community will require you getting use to these tools.

Data versioning


  • Data is not static
  • Link results to versions of the data
  • Versioning raw and processed data

How?

DVC, Git LFS, object storage solutions like MinIO

This illustration is created by Scriberia with The Turing Way community. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807

But what is less discussed is the concept of versioning your data. And I am not talking only about the raw data that we use as the foundation of an analysis, but also the intermediary datasets that we produce (cleaned data) as well as the results (predictions, estimates, etc).

As statistical or data scientists, data and code are our primary material. So at an individual, getting familiar with tools for version control makes your work sustainable, by facilitating recovery, collaboration, organisation. But at the level of a department or of an organisation, it ensures that our data and code are treated as the asset that they are.

Managing computational environment

Features of a computer […], such as its operating system, what software it has installed, and what versions of software packages are installed.

The Turing Way

Computational environment matters

Used under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (link).

Images from Clément H on Unsplash and panumas nikhomkhai from Pexels (link)

Whenever we’re doing any type of analysis, we’re not working in isolation: we’re relying on software (genstats, R, python) and packages. These are not static entities; people work on them, modify them, and release new versions. These changes can affect the results that we get, as the updates may introduce breaking changes.

This is important even if we’re working by ourselves; but especially when we are collaborating, being able to reproduce the computational environment in which an analysis was done is essential. That is true when working with a colleague, but also again with this aspect of open and reproducible science.

Another aspect of this problem is that nowadays, we are encouraged to use of clusters, HPCs or cloud services. This is a big change from working on your own laptop where you can control the packages and versions you install.

The nightmare of package versions

Illustration by Dmitriy Zub

Even when working by yourself, anyone that has worked on more than one project will have probably experienced this issue. You’re working on a project, using some packages, then you move on to another project, which needs a newer version of the package. You install this new version. Then want to go back, but that broke the other project! True story from my internship six years ago. Also happened much more recently when working on two packages whose latest versions were not playing nicely.

Dependency management tools

 

Use dependency management tools like renv (R) or conda (Python) in your work

1. Record dependencies

2. Isolate project library Illustration by Dmitriy Zub

Containers

Illustration from https://www.docker.com/

Documentation

Material that provides official information or evidence or that serves as a record.

Oxford Languages

Write it down!

 

  • Need accurate records for reports/presentations

  • Facilitate collaboration

  • Ensures the project outlives its funding period

  • Encourages good practices

README

 

  • Project scope and aims

  • Contributors and key contacts

  • Data source (where does it come from) and location (where is it stored)

  • Summary of analysis (scripts, intermediary results, etc)

  • List of output (figures, reports, cleaned data)

https://github.com/laufergall/Subjective_Speaker_Characteristics

Literate programming

 

  • Concept: mix code and plain text

  • Goal: record reasoning, notes and interpretation alongside source code and results

  • Tools: Quarto, Rmarkdown, Jupyter notebooks…

Can be version-controlled!

Literate programming

 

Illustration by Bruno Rodriges; used under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

(Unit) testing

A software testing method by which individual units of source code […] are tested to determine whether they are fit for use.

Wikipedia

Testing to catch issues

 

R package testthat:

test_that("multiplication works", {
  expect_equal(2 * 2, 4)
})
#> Test passed 🌈
  • Is the data wrangling returning the expected number of rows/columns?

  • Is the distribution of the transformed data making sense?

  • Does the analysis return a warning if there are missing values?

As our analyses become more complex, so does the code that we are writing to execute them. If we want to be confident in our results, we need to make sure that the code is doing what we are expecting it to do. For example, if we are transforming a dataset, we want to make sure that the result conforms to our expectations. This step is very often overlooked, or performed ad hoc (like in R on the console), but I think it is important that we become better at formally testing our code. Just in terms of our own peace of mind. I have several examples of times during my thesis where I realised a few weeks into an analysis that something I did at the very beginning didn’t work as expected. It’s especially important when working with very big datasets, where it’s hard to check all of the results.

In R, when writing an R package, there is a strong emphasis on writing up unit tests (there’s a trivial example of that on the slide). But I believe that we don’t need to be writing R packages to consider writing tests. I think that actually it’s important to make some sanity check whenever we’re running a new analysis, using a new package, etc. If you think back to the analysis pipeline that I talked about earlier, can you write some quick tests to ensure that each step does what it is supposed to do? In R, the testthat package makes it very easy to write such tests. The idea is that you can compare the results of a computation to an expected value. You could also use a very small subset of your data to inspect that a particular computation gives you the expected answer.

Using simulations to test statistical methods

“A key strength of simulation studies is the ability to understand the behavior of statistical methods because some ‘truth’ (usually some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias.”

One thing that goes hand in hand with the concept of testing is the idea of using simulations. This is something that is particularly important in the biological sciences, because it is extremely hard to validate findings. Therefore, if we want to have any confidence in the results of an analysis, it is typical to generate some simulated data and make sure that the analysis gives us the expected results.

That is why, during my first year of PhD, I developped the R package sismonr.

Using simulations to test statistical methods

 

Testing data

 

 

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

R package validate:

rules <- validator(   
  speed >= 0,    
  dist >= 0,    
  speed/dist <= 1.5,    
  cor(speed, dist) >= 0.2 
)  
confront(cars, rules)

 

  • Are there any missing data?

  • Range of values as expected?

  • Dimensions correct?

Related to the concept of testing our code, I think it is important that we test our data. I tend to work with messy datasets; datasets that have been modified in excel by hand. Again, starting a new project by checking that the data conforms to our expectations (right format, right units, range of the data…). This avoids spending time on an analysis only to realise that there was something wrong with the data from the beginning. This is an integral part of the exploratory analysis phase, but it’s easy to skip that when we are in a hurry. Again taking the example of R, there is a package called validate that allows the user to write rules, and check a dataset against it. I think for people working as consultant in organisations, where we regularly get the same type of datasets, it can really save us time to formalise these tests.

Workflow management

The discipline of creating, documenting, monitoring and improving upon the series of steps, or workflow, that is required to complete a specific task.

TechTarget

A real-life example of a complex analysis

Schema of the analysis for one of my thesis chapters

A real-life example of a complex analysis

 

   

Typical analysis folder:

thesis/chapter4_code 
├── genomics_data_analysis 
│   ├── 00_genomics_wrangling.R 
│   ├── 01_genomics_filtering.R 
│   └── 02_genomics_eda.R 
└── transcriptomics_data_analysis 
|   ├── 00_transcriptomics_wrangling.R 
|   ├── 01_transcriptomics_normalisation.R 
|   ├── 02_transcriptomics_differential_expression.R 
|   └── 03_transcriptomics_wgcna.R 
...

Typical issues:

  • Input data has changed; what do I need to re-run?

  • In which order do I need to run these scripts? (differential expression analysis needs results from genomics analysis)

Workflow management tools

Concept: turn your analysis into a pipeline, i.e. series of steps linked through input/output, which will be executed in the correct order


Workflow management tools

Concept: turn your analysis into a pipeline, i.e. series of steps linked through input/output, which will be executed in the correct order


Conclusion

Conclusion

 

  • Statisticians and data scientists need more than statistical skills

  • Software engineering practices are crucial for work sustainability and for collaboration

  • These skills should be taught in statistical degrees (but there are lots of resources out there!)

To go further

 

  • Richard McElreath’s talk about Science as Amateur Software Development

  • The Turing Way handbook to reproducible, ethical and collaborative data science

  • Bruno Rodrigues’ book on Building reproducible analytical pipelines with R

The Turing Way project illustration by Scriberia. Used under a CC-BY 4.0 licence. DOI: 10.5281/zenodo.3332807.

Thank you for your attention!

olivia.angelin-bonnet@plantandfood.co.nz

References

Angelin-Bonnet, O., Biggs, P. J., Baldwin, S., Thomson, S., & Vignes, M. (2020). sismonr: simulation of in silico multi-omic networks with adjustable ploidy and post-transcriptional regulation in R. Bioinformatics, 36(9), 2938–2940. https://doi.org/10.1093/bioinformatics/btaa002
Barabási, A.-L., & Oltvai, Z. N. (2004). Network biology: understanding the cell’s functional organization. Nature Reviews Genetics, 5(2), 101–113. https://doi.org/10.1038/nrg1272
Gallagher, R., Falster, D. S., Maitner, B., Salguero-Gomez, R., Vandvik, V., Pearse, W., et al. (2019). The open traits network: Using open science principles to accelerate trait-based science across the tree of life. http://dx.doi.org/10.32942/osf.io/kac45
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086
Wang, M., Wang, L., Pu, L., Li, K., Feng, T., Zheng, P., et al. (2020). LncRNAs related key pathways and genes in ischemic stroke by weighted gene co-expression network analysis (WGCNA). Genomics, 112(3), 2302–2308. https://doi.org/10.1016/j.ygeno.2020.01.001
1 / 40
Much more than just fitting models Perspective of an applied statistical scientist Olivia Angelin-Bonnet The New Zealand Institute for Plant and Food Research Limited 22 June 2023

  1. Slides

  2. Tools

  3. Close
  • Much more than just fitting models
  • Hi! I’m Olivia :)
  • My background
  • Introduction
  • What is Systems Biology?
  • What is Systems Biology?
  • What is Systems Biology?
  • The challenges of data integration
  • My role as a statistical scientist
  • What uni doesn’t teach you
  • Software engineering principles I wish I learned about
  • … but why?
  • Version control
  • Code versioning with Git
  • Data versioning
  • Managing computational environment
  • Computational environment matters
  • The nightmare of package versions
  • Dependency management tools
  • Containers
  • Documentation
  • Write it down!
  • README
  • Literate programming
  • Literate programming
  • (Unit) testing
  • Testing to catch issues
  • Using simulations to test statistical methods
  • Using simulations to test statistical methods
  • Testing data
  • Workflow management
  • A real-life example of a complex analysis
  • A real-life example of a complex analysis
  • Workflow management tools
  • Workflow management tools
  • Conclusion
  • Conclusion
  • To go further
  • Thank you for your attention!
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help