What can I do to my data with missing values or zeros before creating visualisations?

Some visualisations and analyses, such as PCA and heatmaps, may fail or produce inappropriate results if the dataset contains missing values or zero values.

A key consideration is that the log-transformation of zero values computationally produces -Inf values. The impact of these invalid values depends on the visualisation being generated:

PCA, heatmaps, and other visualisations that require complete numerical matrices will fail because they cannot process -Inf values. For example, PCA calculations and hierarchical clustering used in heatmaps require finite values for all features and samples.
Some visualisations may still render, such as boxplots of RLE or log2-expression values. However, these visualisations typically silently exclude invalid values during processing, meaning that a subset of the data is silently omitted from the analysis and visualisation.

The recommended preprocessing steps depend on your omics dataset type.

Mass spectrometry-style datasets

Examples:

Proteins
Peptides
PTMs
Metabolites

These datasets commonly contain missing values. Visualisations requiring complete data cannot be generated if missing values are still present.

Recommended pre-processing

Before generating the plot:

Impute missing values using the Normalization & Imputation workflow.
A requirement for a successful imputation is that, at upload, missing values are correctly represented in the dataset data is uploaded via the MD Format.
For example, if missing values were uploaded with 0 intensities, the corresponding Imputed column must be set to 1. Otherwise, the system will treat these values as real measurements instead of missing values, and they will not be imputed correctly.
When missing values are correctly represented in the dataset, several visualisations allow the option to include or exclude imputed values from the final plot.

Gene count datasets

Example:

RNA-Seq datasets

In RNA-Seq gene count datasets, zero values are valid and expected because some genes may not be detected in certain samples.

When working with RNA-Seq data, sequencing depth adjustment should be taken into account in the pre-processing. To learn more about suggested workflows for RNA-Seq data refer to "How can I analyse my transcriptomics data in Mass Dynamics?".

Recommended pre-processing

To pre-process your RNA-Seq raw counts data in Mass Dynamics to make them visualisations ready, you can use the Normalization & Imputation workflow to:

Adjust for sequencing depth by applying a Counts Per Million (CPM) transformation with prior counts to accommodate zero values (default to 2) and enable safe downstream log2-transformation.
Filter lowly expressed genes using the filtering method "by minimum abundance" to remove features with insufficient expression across samples, reducing noise and improving the quality and interpretability of downstream analyses.
Log-transformation is applied downstream when creating analyses and visualizations.

These preprocessing steps help ensure that visualisations can be generated successfully and that downstream analyses are based on informative expression measurements.

What can I do to my data with missing values or zeros before creating visualisations?

Zeroes can have different meanings across omics data types. This page outlines the suggested approaches for handling zeroes in your omics data within Mass Dynamics.

Mass spectrometry-style datasets

Recommended pre-processing

Gene count datasets

Recommended pre-processing