Scaling normalization of single-cell data. More...

Classes
struct	CenterSizeFactorsOptions
	Options for `center_size_factors()` and `center_size_factors_blocked()`. More...

struct	ChoosePseudoCountOptions
	Options for `choose_pseudo_count()`. More...

struct	NormalizeCountsOptions
	Options for `normalize_counts()`. More...

struct	SanitizeSizeFactorsOptions
	Options for `sanitize_size_factors()`. More...

struct	SizeFactorDiagnostics
	Diagnostics for the size factors. More...

Enumerations
enum class	CenterBlockMode : char { PER_BLOCK , LOWEST }

enum class	SanitizeAction : char { IGNORE , ERROR , SANITIZE }

Functions
template<typename SizeFactor_ >
SizeFactor_	center_size_factors_mean (std::size_t num, const SizeFactor_ size_factors, SizeFactorDiagnostics diagnostics, const CenterSizeFactorsOptions &options)

template<typename SizeFactor_ >
SizeFactor_	center_size_factors (std::size_t num, SizeFactor_ size_factors, SizeFactorDiagnostics diagnostics, const CenterSizeFactorsOptions &options)

template<typename SizeFactor_ , typename Block_ >
std::vector< SizeFactor_ >	center_size_factors_blocked_mean (std::size_t num, const SizeFactor_ size_factors, const Block_ block, SizeFactorDiagnostics *diagnostics, const CenterSizeFactorsOptions &options)

template<typename SizeFactor_ , typename Block_ >
std::vector< SizeFactor_ >	center_size_factors_blocked (std::size_t num, SizeFactor_ size_factors, const Block_ block, SizeFactorDiagnostics *diagnostics, const CenterSizeFactorsOptions &options)

template<typename Float_ >
Float_	choose_pseudo_count_raw (std::size_t num, Float_ *size_factors, const ChoosePseudoCountOptions &options)

template<typename Float_ >
Float_	choose_pseudo_count (std::size_t num, const Float_ *size_factors, const ChoosePseudoCountOptions &options)

template<typename OutputValue_ = double, typename InputValue_ , typename Index_ , class SizeFactors_ >
std::shared_ptr< tatami::Matrix< OutputValue_, Index_ > >	normalize_counts (std::shared_ptr< const tatami::Matrix< InputValue_, Index_ > > counts, SizeFactors_ size_factors, const NormalizeCountsOptions &options)

template<typename SizeFactor_ >
SizeFactorDiagnostics	check_size_factor_sanity (std::size_t num, const SizeFactor_ *size_factors)

template<typename SizeFactor_ >
void	sanitize_size_factors (std::size_t num, SizeFactor_ *size_factors, const SizeFactorDiagnostics &status, const SanitizeSizeFactorsOptions &options)

template<typename SizeFactor_ >
SizeFactorDiagnostics	sanitize_size_factors (std::size_t num, SizeFactor_ *size_factors, const SanitizeSizeFactorsOptions &options)

Detailed Description

Scaling normalization of single-cell data.

Enumeration Type Documentation

◆ CenterBlockMode

enum class scran_norm::CenterBlockMode : char

strong

Strategy for handling blocks when centering size factors, see CenterSizeFactorsOptions::block_mode for details.

◆ SanitizeAction

enum class scran_norm::SanitizeAction : char

strong

How invalid size factors should be handled:

IGNORE: ignore invalid size factors with no error or change.
ERROR: throw an error.
SANITIZE: fix each invalid size factor.

Function Documentation

◆ center_size_factors_mean()

template<typename SizeFactor_ >

SizeFactor_ scran_norm::center_size_factors_mean	(	std::size_t	num,
		const SizeFactor_ *	size_factors,
		SizeFactorDiagnostics *	diagnostics,
		const CenterSizeFactorsOptions &	options )

Compute the mean size factor but do not scale the size factors themselves.

Template Parameters

SizeFactor_ Floating-point type for the size factors.

Parameters

	num	Number of cells.
[in]	size_factors	Pointer to an array of length `num`, containing the size factor for each cell.
[out]	diagnostics	Diagnostics for invalid size factors. This is only used if `CenterSizeFactorsOptions::ignore_invalid = true`, in which case it is filled with invalid diagnostics for values in `size_factors`. It can also be NULL, in which case it is ignored.
	options	Further options.

Returns: The mean size factor, to be used to divide each element of size_factors.

◆ center_size_factors()

template<typename SizeFactor_ >

SizeFactor_ scran_norm::center_size_factors	(	std::size_t	num,
		SizeFactor_ *	size_factors,
		SizeFactorDiagnostics *	diagnostics,
		const CenterSizeFactorsOptions &	options )

When centering, we scale all size factors so that their mean is equal to 1. The aim is to ensure that the normalized expression values are on roughly the same scale as the original counts. This simplifies interpretation and ensures that any added pseudo-count prior to log-transformation has a predictable shrinkage effect. In general, size factors should be centered before calling normalize_counts().

Template Parameters

SizeFactor_ Floating-point type for the size factors.

Parameters

	num	Number of cells.
[in,out]	size_factors	Pointer to an array of length `num`, containing the size factor for each cell. On output, this contains centered size factors.
[out]	diagnostics	Diagnostics for invalid size factors. This is only used if `CenterSizeFactorsOptions::ignore_invalid = true`, in which case it is filled with invalid diagnostics for values in `size_factors`. It can also be NULL, in which case it is ignored.
	options	Further options.

Returns: The mean size factor.

◆ center_size_factors_blocked_mean()

template<typename SizeFactor_ , typename Block_ >

std::vector< SizeFactor_ > scran_norm::center_size_factors_blocked_mean	(	std::size_t	num,
		const SizeFactor_ *	size_factors,
		const Block_ *	block,
		SizeFactorDiagnostics *	diagnostics,
		const CenterSizeFactorsOptions &	options )

Compute the mean size factor for each block, but do not scale the size factors themselves.

Template Parameters

SizeFactor_	Floating-point type for the size factors.
Block_	Integer type for the block assignments.

Parameters

	num	Number of cells.
[in]	size_factors	Pointer to an array of length `num`, containing the size factor for each cell.
[in]	block	Pointer to an array of length `num`, containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the total number of blocks.
[out]	diagnostics	Diagnostics for invalid size factors. This is only used if `CenterSizeFactorsOptions::ignore_invalid = true`, in which case it is filled with invalid diagnostics for values in `size_factors`. It can also be NULL, in which case it is ignored.
	options	Further options.

Returns: Vector of length \(N\) containing the mean size factor for each block, to be used to scale the size factors in each block.

◆ center_size_factors_blocked()

template<typename SizeFactor_ , typename Block_ >

std::vector< SizeFactor_ > scran_norm::center_size_factors_blocked	(	std::size_t	num,
		SizeFactor_ *	size_factors,
		const Block_ *	block,
		SizeFactorDiagnostics *	diagnostics,
		const CenterSizeFactorsOptions &	options )

Center size factors within each block, using the strategy specified in CenterSizeFactorsOptions::block_mode.

Template Parameters

SizeFactor_	Floating-point type for the size factors.
Block_	Integer type for the block assignments.

Parameters

	num	Number of cells.
[in]	size_factors	Pointer to an array of length `num`, containing the size factor for each cell. On output, this contains size factors that are centered according to `CenterSizeFactorsOptions::block_mode`.
[in]	block	Pointer to an array of length `num`, containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the total number of blocks.
[out]	diagnostics	Diagnostics for invalid size factors. This is only used if `CenterSizeFactorsOptions::ignore_invalid = true`, in which case it is filled with invalid diagnostics for values in `size_factors`. It can also be NULL, in which case it is ignored.
	options	Further options.

Returns: Vector of length \(N\) containing the mean size factor for each block.

◆ choose_pseudo_count_raw()

template<typename Float_ >

Float_ scran_norm::choose_pseudo_count_raw	(	std::size_t	num,
		Float_ *	size_factors,
		const ChoosePseudoCountOptions &	options )

Choose a pseudo-count for log-transformation (see NormalizeCountsOptions::pseudo_count) that aims to control the transformation-induced bias. Specifically, the log-transform can introduce spurious differences in the expected log-normalized expression between cells with very different size factors (Lun, 2018). This bias can be mitigated by increasing the pseudo-count, which effectively shrinks all log-expression values towards the zero-expression baseline. The increased shrinkage is strongest at low counts where the transformation bias is most pronounced, while large counts are mostly unaffected.

In practice, the log-transformation bias is modest in datasets where there are stronger sources of variation. When observed, it manifests as a library size-dependent trend in the log-normalized expression values. This is difficult to regress out without also removing biology that is associated with, e.g., total RNA content; rather, a simpler solution is to increase the pseudo-count to suppress the bias.

No centering is performed by this function, so the size factors should be passed through center_size_factors() before calling functions here. Invalid size factors (e.g., zero, negative, non-finite) are automatically ignored, so prior sanitization should not be performed - this ensures that we do not include the replacement values in the various quantile calculations.

See also: Lun ATL (2018). Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. biorXiv doi:10.1101/404962

Parameters

	num	Number of size factors.
[in]	size_factors	Pointer to an array of size factors of length `num`. Values should be positive, and all non-positive values are ignored. On output, this array is arbitrarily permuted and should not be used.
	options	Further options.

Returns: The suggested pseudo-count to control the log-transformation-induced bias below the specified threshold.

◆ choose_pseudo_count()

template<typename Float_ >

Float_ scran_norm::choose_pseudo_count	(	std::size_t	num,
		const Float_ *	size_factors,
		const ChoosePseudoCountOptions &	options )

This function just wraps choose_pseudo_count_raw() with the automatic creation of a writeable buffer for the size factors.

Parameters

	num	Number of size factors.
[in]	size_factors	Pointer to an array of size factors of length `n`. Values should be positive, and all non-positive values are ignored.
	options	Further options.

Returns: The suggested pseudo-count to control the log-transformation-induced bias below the specified threshold.

◆ normalize_counts()

template<typename OutputValue_ = double, typename InputValue_ , typename Index_ , class SizeFactors_ >

std::shared_ptr< tatami::Matrix< OutputValue_, Index_ > > scran_norm::normalize_counts	(	std::shared_ptr< const tatami::Matrix< InputValue_, Index_ > >	counts,
		SizeFactors_	size_factors,
		const NormalizeCountsOptions &	options )

Given a count matrix and a set of size factors, compute log-transformed normalized expression values. All operations are done in a delayed manner using the tatami::DelayedUnaryIsometricOperation class.

For normalization, each cell's counts are divided by the cell's size factor to remove uninteresting scaling differences. The simplest and most common method for defining size factors is to use the centered library sizes, see center_size_factors() for details. This removes scaling biases caused by sequencing depth, capture efficiency etc. between cells, while the centering preserves the scale of the counts in the normalized expression values. That said, users can define size factors from any method of their choice (e.g., median-based normalization, TMM) as long as they are positive for all cells.

Normalized values are then log-transformed so that differences in log-values represent log-fold changes in expression. This ensures that downstream analyses like t-tests and distance calculations focus on relative fold-changes rather than absolute differences. The log-transformation also provides some measure of variance stabilization so that the downstream analyses are not dominated by sampling noise at large counts.

Template Parameters

OutputValue_	Floating-point type for the output matrix.
InputValue_	Data type for the input matrix.
InputIndex_	Integer type for the input matrix.
SizeFactors_	Container of floats for the size factors. This should have the `size()`, `begin()`, `end()` and `operator[]` methods.

Parameters

counts	Pointer to a `tatami::Matrix` containing counts. Rows should correspond to genes while columns should correspond to cells.
size_factors	Vector of length equal to the number of columns in `counts`, containing the size factor for each cell. All values should be positive.
options	Further options.

Returns: Matrix of normalized expression values. These are log-transformed if NormalizeCountsOptions::log = true.

◆ check_size_factor_sanity()

template<typename SizeFactor_ >

SizeFactorDiagnostics scran_norm::check_size_factor_sanity	(	std::size_t	num,
		const SizeFactor_ *	size_factors )

Check whether there are any invalid size factors. Size factors are only technically valid if they are finite and positive.

Template Parameters

SizeFactor_ Floating-point type for the size factors.

Parameters

	num	Number of size factors.
[in]	size_factors	Pointer to an array of size factors of length `num`.

Returns: Validation results, indicating whether any zero or non-finite size factors exist.

◆ sanitize_size_factors() [1/2]

template<typename SizeFactor_ >

void scran_norm::sanitize_size_factors	(	std::size_t	num,
		SizeFactor_ *	size_factors,
		const SizeFactorDiagnostics &	status,
		const SanitizeSizeFactorsOptions &	options )

Replace zero, missing or infinite values in the size factor array so that it can be used to compute well-defined normalized values. Such size factors can occasionally arise if, e.g., insufficient quality control was performed upstream. Check out the documentation in SanitizeSizeFactorsOptions to see what placeholder value is used for each type of invalid size factor.

In general, sanitization should occur after calls to center_size_factors(), choose_pseudo_count(), or any function that computes a statistic based on the distribution of size factors. This ensures that the results of those functions are not affected by the placeholder values used to replace the invalid size factors. As a rule of thumb, sanitize_size_factors() should be called just before passing those size factors to normalize_counts().

Template Parameters

SizeFactor_ Floating-point type for the size factors.

Parameters

	num	Number of size factors.
[in,out]	size_factors	Pointer to an array of positive size factors of length `n`. On output, invalid size factors are replaced.
	status	A pre-computed object indicating whether invalid size factors are present in `size_factors`. This can be useful if this information is already provided by, e.g., `check_size_factor_sanity()` or `center_size_factors()`.
	options	Further options.

◆ sanitize_size_factors() [2/2]

template<typename SizeFactor_ >

SizeFactorDiagnostics scran_norm::sanitize_size_factors	(	std::size_t	num,
		SizeFactor_ *	size_factors,
		const SanitizeSizeFactorsOptions &	options )

Overload of sanitize_size_factors() that calls check_size_factor_sanity() internally.

Template Parameters

SizeFactor_ Floating-point type for the size factors.

Parameters

	num	Number of size factors.
[in,out]	size_factors	Pointer to an array of positive size factors of length `n`. On output, invalid size factors are replaced.
	options	Further options.

Returns: An object indicating whether each type of invalid size factors is present in size_factors.

Classes

Enumerations

Functions

Detailed Description

Enumeration Type Documentation

◆ CenterBlockMode

◆ SanitizeAction

Function Documentation

◆ center_size_factors_mean()

◆ center_size_factors()

◆ center_size_factors_blocked_mean()

◆ center_size_factors_blocked()

◆ choose_pseudo_count_raw()

◆ choose_pseudo_count()

◆ normalize_counts()

◆ check_size_factor_sanity()

◆ sanitize_size_factors() [1/2]

◆ sanitize_size_factors() [2/2]