Perspective of an applied statistical scientist
The New Zealand Institute for Plant and Food Research Limited
22 June 2023
The genome dictates the phenotype (physical characteristics) and response to environment of biological systems …
… through the complex interactions between the different molecular layers (genes, transcripts, proteins, metabolites, etc)
These interaction networks can be deciphered from measurements of the molecular actors – the omics data – through data integration
Omics datasets measured with different technologies:
values have different meaning: represents proportions, concentrations, related to amount of biological material…
different scales: counts, continuous values, etc
missing values are not all equal
High dimensionality:
can measure the expression of 10,000s of genes, 1,000s of metabolites; but often \(\leq\) 100s observations
dimensionality varies with the omics measured
Extract information from omics datasets to answer biological questions
Multivariate analyses
Network reconstruction
Visualisation
Uni taught me about:
statistics
data analysis
programming in R and Python
… but that’s not enough!
Version control
Computational environment management
Documentation
Workflow management
Unit testing
(and more)
Becoming core skills for statistical scientists!
Sustainability
Open science
The practice of tracking and managing changes to [pieces of information].
Easy access and recovery of previous versions
Streamlined storage solution for code
Enable collaboration
Standardised way of publishing code
Everyone else is using it!
Features of a computer […], such as its operating system, what software it has installed, and what versions of software packages are installed.
Use dependency management tools like renv (R) or conda (Python) in your work
1. Record dependencies
2. Isolate project library
Material that provides official information or evidence or that serves as a record.
Need accurate records for reports/presentations
Facilitate collaboration
Ensures the project outlives its funding period
Encourages good practices
Project scope and aims
Contributors and key contacts
Data source (where does it come from) and location (where is it stored)
Summary of analysis (scripts, intermediary results, etc)
List of output (figures, reports, cleaned data)
Concept: mix code and plain text
Goal: record reasoning, notes and interpretation alongside source code and results
Tools: Quarto, Rmarkdown, Jupyter notebooks…
Can be version-controlled!
A software testing method by which individual units of source code […] are tested to determine whether they are fit for use.
R package testthat:
Is the data wrangling returning the expected number of rows/columns?
Is the distribution of the transformed data making sense?
Does the analysis return a warning if there are missing values?
“A key strength of simulation studies is the ability to understand the behavior of statistical methods because some ‘truth’ (usually some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias.”
The discipline of creating, documenting, monitoring and improving upon the series of steps, or workflow, that is required to complete a specific task.
Typical analysis folder:
thesis/chapter4_code
├── genomics_data_analysis
│ ├── 00_genomics_wrangling.R
│ ├── 01_genomics_filtering.R
│ └── 02_genomics_eda.R
└── transcriptomics_data_analysis
| ├── 00_transcriptomics_wrangling.R
| ├── 01_transcriptomics_normalisation.R
| ├── 02_transcriptomics_differential_expression.R
| └── 03_transcriptomics_wgcna.R
...
Typical issues:
Input data has changed; what do I need to re-run?
In which order do I need to run these scripts? (differential expression analysis needs results from genomics analysis)
Concept: turn your analysis into a pipeline, i.e. series of steps linked through input/output, which will be executed in the correct order
Concept: turn your analysis into a pipeline, i.e. series of steps linked through input/output, which will be executed in the correct order
Statisticians and data scientists need more than statistical skills
Software engineering practices are crucial for work sustainability and for collaboration
These skills should be taught in statistical degrees (but there are lots of resources out there!)
Richard McElreath’s talk about Science as Amateur Software Development
The Turing Way handbook to reproducible, ethical and collaborative data science
Bruno Rodrigues’ book on Building reproducible analytical pipelines with R