scran_norm
Scaling normalization of single-cell data
|
Scaling normalization of single-cell data. More...
Classes | |
struct | CenterSizeFactorsOptions |
Options for center_size_factors() and center_size_factors_blocked() . More... | |
struct | ChoosePseudoCountOptions |
Options for choose_pseudo_count() . More... | |
struct | NormalizeCountsOptions |
Options for normalize_counts() . More... | |
struct | SanitizeSizeFactorsOptions |
Options for sanitize_size_factors() . More... | |
struct | SizeFactorDiagnostics |
Diagnostics for the size factors. More... | |
Enumerations | |
enum class | CenterBlockMode : char { PER_BLOCK , LOWEST } |
enum class | SanitizeAction : char { IGNORE , ERROR , SANITIZE } |
Functions | |
template<typename SizeFactor_ > | |
SizeFactor_ | center_size_factors_mean (const std::size_t num, const SizeFactor_ *const size_factors, SizeFactorDiagnostics *const diagnostics, const CenterSizeFactorsOptions &options) |
template<typename SizeFactor_ > | |
SizeFactor_ | center_size_factors (const std::size_t num, SizeFactor_ *const size_factors, SizeFactorDiagnostics *const diagnostics, const CenterSizeFactorsOptions &options) |
template<typename SizeFactor_ , typename Block_ > | |
std::vector< SizeFactor_ > | center_size_factors_blocked_mean (const std::size_t num, const SizeFactor_ *const size_factors, const Block_ *const block, SizeFactorDiagnostics *const diagnostics, const CenterSizeFactorsOptions &options) |
template<typename SizeFactor_ , typename Block_ > | |
std::vector< SizeFactor_ > | center_size_factors_blocked (const std::size_t num, SizeFactor_ *const size_factors, const Block_ *const block, SizeFactorDiagnostics *const diagnostics, const CenterSizeFactorsOptions &options) |
template<typename Float_ > | |
Float_ | choose_pseudo_count_raw (std::size_t num, Float_ *const size_factors, const ChoosePseudoCountOptions &options) |
template<typename Float_ > | |
Float_ | choose_pseudo_count (const std::size_t num, const Float_ *const size_factors, const ChoosePseudoCountOptions &options) |
template<typename OutputValue_ = double, typename InputValue_ , typename Index_ , class SizeFactors_ > | |
std::shared_ptr< tatami::Matrix< OutputValue_, Index_ > > | normalize_counts (std::shared_ptr< const tatami::Matrix< InputValue_, Index_ > > counts, SizeFactors_ size_factors, const NormalizeCountsOptions &options) |
template<typename SizeFactor_ > | |
SizeFactorDiagnostics | check_size_factor_sanity (const std::size_t num, const SizeFactor_ *const size_factors) |
template<typename SizeFactor_ > | |
void | sanitize_size_factors (const std::size_t num, SizeFactor_ *const size_factors, const SizeFactorDiagnostics &status, const SanitizeSizeFactorsOptions &options) |
template<typename SizeFactor_ > | |
SizeFactorDiagnostics | sanitize_size_factors (const std::size_t num, SizeFactor_ *const size_factors, const SanitizeSizeFactorsOptions &options) |
Scaling normalization of single-cell data.
|
strong |
Strategy for handling blocks when centering size factors, see CenterSizeFactorsOptions::block_mode
for details.
|
strong |
How invalid size factors should be handled:
IGNORE
: ignore invalid size factors with no error or change.ERROR
: throw an error.SANITIZE
: fix each invalid size factor. SizeFactor_ scran_norm::center_size_factors_mean | ( | const std::size_t | num, |
const SizeFactor_ *const | size_factors, | ||
SizeFactorDiagnostics *const | diagnostics, | ||
const CenterSizeFactorsOptions & | options ) |
Compute the mean size factor but do not scale the size factors themselves.
SizeFactor_ | Floating-point type of the size factors. |
num | Number of cells. | |
[in] | size_factors | Pointer to an array of length num , containing the size factor for each cell. |
[out] | diagnostics | Diagnostics for invalid size factors. This is only used if CenterSizeFactorsOptions::ignore_invalid = true , in which case it is filled with diagnostics for invalid values in size_factors . It can also be NULL , in which case it is ignored. |
options | Further options. |
size_factors
. SizeFactor_ scran_norm::center_size_factors | ( | const std::size_t | num, |
SizeFactor_ *const | size_factors, | ||
SizeFactorDiagnostics *const | diagnostics, | ||
const CenterSizeFactorsOptions & | options ) |
Centering the size factors involves scaling all of the size factors so that the mean across cells is equal to 1. The aim is to ensure that the normalized expression values are on roughly the same scale as the original counts. This simplifies interpretation and ensures that any pseudo-count added prior to log-transformation has a predictable shrinkage effect. In general, size factors should be centered before calling normalize_counts()
.
SizeFactor_ | Floating-point type of the size factors. |
num | Number of cells. | |
[in,out] | size_factors | Pointer to an array of length num , containing the size factor for each cell. On output, this contains the centered size factors. |
[out] | diagnostics | Diagnostics for invalid size factors. This is only used if CenterSizeFactorsOptions::ignore_invalid = true , in which case it is filled with diagnostics for invalid values in size_factors . It can also be NULL , in which case it is ignored. |
options | Further options. |
std::vector< SizeFactor_ > scran_norm::center_size_factors_blocked_mean | ( | const std::size_t | num, |
const SizeFactor_ *const | size_factors, | ||
const Block_ *const | block, | ||
SizeFactorDiagnostics *const | diagnostics, | ||
const CenterSizeFactorsOptions & | options ) |
Compute the mean size factor for each block, but do not scale the size factors themselves. This is the blocked version of center_size_factors_mean()
.
SizeFactor_ | Floating-point type of the size factors. |
Block_ | Integer type for the block assignments. |
num | Number of cells. | |
[in] | size_factors | Pointer to an array of length num , containing the size factor for each cell. |
[in] | block | Pointer to an array of length num , containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the total number of blocks. |
[out] | diagnostics | Diagnostics for invalid size factors. This is only used if CenterSizeFactorsOptions::ignore_invalid = true , in which case it is filled with diagnostics for invalid values in size_factors . It can also be NULL , in which case it is ignored. |
options | Further options. |
std::vector< SizeFactor_ > scran_norm::center_size_factors_blocked | ( | const std::size_t | num, |
SizeFactor_ *const | size_factors, | ||
const Block_ *const | block, | ||
SizeFactorDiagnostics *const | diagnostics, | ||
const CenterSizeFactorsOptions & | options ) |
Center size factors within each block to obtain interpretable values after normalization, as discussed in center_size_factors()
. The exact strategy for handling blocks is controlled by CenterSizeFactorsOptions::block_mode
.
SizeFactor_ | Floating-point type of the size factors. |
Block_ | Integer type for the block assignments. |
num | Number of cells. | |
[in] | size_factors | Pointer to an array of length num , containing the size factor for each cell. On output, this contains size factors that are centered according to CenterSizeFactorsOptions::block_mode . |
[in] | block | Pointer to an array of length num , containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the total number of blocks. |
[out] | diagnostics | Diagnostics for invalid size factors. This is only used if CenterSizeFactorsOptions::ignore_invalid = true , in which case it is filled with invalid diagnostics for values in size_factors . It can also be NULL, in which case it is ignored. |
options | Further options. |
Float_ scran_norm::choose_pseudo_count_raw | ( | std::size_t | num, |
Float_ *const | size_factors, | ||
const ChoosePseudoCountOptions & | options ) |
Choose a pseudo-count for log-transformation (see NormalizeCountsOptions::pseudo_count
) that aims to control the transformation-induced bias.
Log-transformation is commonly applied to sequencing count data prior to further analyses (see NormalizeCountsOptions::log
). However, this can introduce spurious differences in the expected log-normalized expression between cells with very different size factors (Lun, 2018). This bias is typically modest in datasets where there are stronger sources of variation, but when observed, it manifests as a library size-dependent trend in the log-normalized expression values. It is difficult to regress out without also removing biology that is associated with, e.g., total RNA content.
A simpler solution is to increase the pseudo-count to suppress the bias. This shrinks all log-expression values towards the zero-expression baseline, thus also shrinking log-differences between cells towards zero. The increased shrinkage is strongest at low counts where the data is least informative and the transformation bias is most pronounced. At large counts, the shrinkage has less effect as the log-differences are driven by the data. Our aim is to pick a pseudo-count that is large enough to mitigate the bias while being small enough to avoid shrinking the biological differences.
No centering is performed by this function, so the size factors should be passed through center_size_factors()
before calling this function. Invalid size factors (e.g., zero, negative, non-finite) are automatically ignored, so prior sanitization should not be performed - this ensures that we do not include the replacement values in the various quantile calculations.
Float_ | Floating-point type of the size factors. |
num | Number of size factors. | |
[in] | size_factors | Pointer to an array of size factors of length num . Values should be positive, and all non-positive values are ignored. On output, this array is arbitrarily permuted and should not be used. |
options | Further options. |
Float_ scran_norm::choose_pseudo_count | ( | const std::size_t | num, |
const Float_ *const | size_factors, | ||
const ChoosePseudoCountOptions & | options ) |
This function just wraps choose_pseudo_count_raw()
with the automatic creation of a writeable buffer for the size factors.
Float_ | Floating-point type of the size factors. |
num | Number of size factors. | |
[in] | size_factors | Pointer to an array of size factors of length n . Values should be positive, and all non-positive values are ignored. |
options | Further options. |
std::shared_ptr< tatami::Matrix< OutputValue_, Index_ > > scran_norm::normalize_counts | ( | std::shared_ptr< const tatami::Matrix< InputValue_, Index_ > > | counts, |
SizeFactors_ | size_factors, | ||
const NormalizeCountsOptions & | options ) |
Given a count matrix and a set of size factors, compute log-transformed normalized expression values. All operations are done in a delayed manner using the tatami::DelayedUnaryIsometricOperation
class.
For normalization, each cell's counts are divided by the cell's size factor to remove uninteresting scaling differences. The simplest and most common method for defining size factors is to use the centered library sizes (see center_size_factors()
for details). This removes scaling biases caused by differences in sequencing depth, capture efficiency etc. between cells. The centering preserves the scale of the counts in the normalized expression values. That said, users can define size factors from any method of their choice (e.g., median-based normalization, TMM) as long as they are positive for all cells.
Normalized values are then typically log-transformed so that differences in log-values represent log-fold changes in expression. This ensures that downstream analyses like t-tests and distance calculations focus on relative fold-changes rather than absolute differences. The log-transformation also provides some measure of variance stabilization so that the downstream analyses are not dominated by sampling noise at large counts.
OutputValue_ | Floating-point type of the output matrix. |
InputValue_ | Data type of the input matrix. |
InputIndex_ | Integer type of the input matrix. |
SizeFactors_ | Container of floats of the size factors. This should have the size() , begin() , end() and operator[] methods. |
counts | Pointer to a matrix of non-negative counts. Rows should correspond to genes while columns should correspond to cells. |
size_factors | Vector of length equal to the number of columns in counts , containing the size factor for each cell. All values should be positive, and any invalid values should be replaced with sanitize_size_factors() . In most applications, the size factors should also be centered via, e.g., center_size_factors() . |
options | Further options. |
NormalizeCountsOptions::log = true
. SizeFactorDiagnostics scran_norm::check_size_factor_sanity | ( | const std::size_t | num, |
const SizeFactor_ *const | size_factors ) |
Check whether there are any invalid size factors. Size factors are only valid if they are finite and positive.
SizeFactor_ | Floating-point type of the size factors. |
num | Number of size factors. | |
[in] | size_factors | Pointer to an array of size factors of length num . |
void scran_norm::sanitize_size_factors | ( | const std::size_t | num, |
SizeFactor_ *const | size_factors, | ||
const SizeFactorDiagnostics & | status, | ||
const SanitizeSizeFactorsOptions & | options ) |
Replace zero, missing or infinite values in the size factor array so that they can be used to compute well-defined normalized values in normalize_counts()
. Such size factors can occasionally arise if, e.g., insufficient quality control was performed upstream. Check out the documentation in SanitizeSizeFactorsOptions
to see what placeholder value is used for each type of invalid size factor.
In general, sanitization should occur after calls to center_size_factors()
, choose_pseudo_count()
, or any function that computes a statistic based on the distribution of size factors. This ensures that the results of those functions are not affected by the placeholder values used to replace the invalid size factors. As a rule of thumb, sanitize_size_factors()
should be called just before passing those size factors to normalize_counts()
.
SizeFactor_ | Floating-point type of the size factors. |
num | Number of size factors. | |
[in,out] | size_factors | Pointer to an array of positive size factors of length n . On output, invalid size factors may be replaced depending on the settings in options . |
status | A pre-computed object indicating whether invalid size factors are present in size_factors . This can be useful if this information is already provided by, e.g., check_size_factor_sanity() or center_size_factors() . | |
options | Further options. |
SizeFactorDiagnostics scran_norm::sanitize_size_factors | ( | const std::size_t | num, |
SizeFactor_ *const | size_factors, | ||
const SanitizeSizeFactorsOptions & | options ) |
Overload of sanitize_size_factors()
that calls check_size_factor_sanity()
internally.
SizeFactor_ | Floating-point type of the size factors. |
num | Number of size factors. | |
[in,out] | size_factors | Pointer to an array of positive size factors of length n . On output, invalid size factors are replaced. |
options | Further options. |
size_factors
.