Principal component analysis on single-cell data. More...

Classes
struct	BlockedPcaOptions
	Options for `blocked_pca()`. More...

struct	BlockedPcaResults
	Results of `blocked_pca()`. More...

struct	SimplePcaOptions
	Options for `simple_pca()`. More...

struct	SimplePcaResults
	Results of `simple_pca()`. More...

Functions
template<typename Value_ , typename Index_ , typename EigenMatrix_ , class EigenVector_ >
void	simple_pca (const tatami::Matrix< Value_, Index_ > &mat, const SimplePcaOptions &options, SimplePcaResults< EigenMatrix_, EigenVector_ > &output)

template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ >
SimplePcaResults< EigenMatrix_, EigenVector_ >	simple_pca (const tatami::Matrix< Value_, Index_ > &mat, const SimplePcaOptions &options)

template<typename Value_ , typename Index_ , typename Block_ , typename EigenMatrix_ , class EigenVector_ >
void	blocked_pca (const tatami::Matrix< Value_, Index_ > &mat, const Block_ *block, const BlockedPcaOptions &options, BlockedPcaResults< EigenMatrix_, EigenVector_ > &output)

template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ , typename Block_ >
BlockedPcaResults< EigenMatrix_, EigenVector_ >	blocked_pca (const tatami::Matrix< Value_, Index_ > &mat, const Block_ *block, const BlockedPcaOptions &options)

Detailed Description

Principal component analysis on single-cell data.

Function Documentation

◆ blocked_pca() [1/2]

template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ , typename Block_ >

BlockedPcaResults< EigenMatrix_, EigenVector_ > scran_pca::blocked_pca	(	const tatami::Matrix< Value_, Index_ > &	mat,
		const Block_ *	block,
		const BlockedPcaOptions &	options )

Overload of blocked_pca() that allocates memory for the output.

Template Parameters

EigenMatrix_	A floating-point `Eigen::Matrix` class.
EigenVector_	A floating-point `Eigen::Vector` class.
Value_	Type of the matrix data.
Index_	Integer type for the indices.
Block_	Integer type for the blocking factor.

Parameters

[in]	mat	Input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values.
[in]	block	Pointer to an array of length equal to the number of cells, containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the number of blocks.
	options	Further options.

Returns: Results of the PCA on the residuals.

◆ blocked_pca() [2/2]

template<typename Value_ , typename Index_ , typename Block_ , typename EigenMatrix_ , class EigenVector_ >

void scran_pca::blocked_pca	(	const tatami::Matrix< Value_, Index_ > &	mat,
		const Block_ *	block,
		const BlockedPcaOptions &	options,
		BlockedPcaResults< EigenMatrix_, EigenVector_ > &	output )

As mentioned in simple_pca(), it is desirable to obtain the top PCs for downstream cell-based analyses. However, in the presence of a blocking factor (e.g., batches, samples), we want to ensure that the PCA is not driven by uninteresting differences between blocks. To achieve this, blocked_pca() centers the expression of each gene in each blocking level and uses the residuals for PCA. The gene-gene covariance matrix will thus focus on variation within each batch, ensuring that the top rotation vectors/principal components capture biological heterogeneity instead of inter-block differences. Internally, blocked_pca() defers the residual calculation until the matrix multiplication steps within IRLBA. This yields the same results as the naive calculation of residuals but is much faster as it can take advantage of efficient sparse operations.

By default, the principal components are computed from the (conceptual) matrix of residuals. This yields a low-dimensional space where all inter-block differences have been removed, assuming that all blocks have the same composition and the inter-block differences are consistent for all cell subpopulations. Under these assumptions, we could use these components for downstream analysis without any concern for block-wise effects. In practice, these assumptions do not hold and more sophisticated batch correction methods like MNN correction are required. Some of these methods accept a low-dimensional embedding of cells as input, which can be created by blocked_pca() with BlockedPcaOptions::components_from_residuals = false. In this mode, only the rotation vectors are computed from the residuals. The original expression values for each cell are then projected onto the associated subspace to obtain PC coordinates that can be used for further batch correction. This approach aims to avoid any strong assumptions about the nature of inter-block differences, while still leveraging the benefits of blocking to focus on intra-block biology.

If one batch has many more cells than the others, it will dominate the PCA by driving the axes of maximum variance. This may mask interesting aspects of variation in the smaller batches. To mitigate this, we scale each batch in inverse proportion to its size (see BlockedPcaOptions::block_weight_policy). This ensures that each batch contributes equally to the (conceptual) gene-gene covariance matrix and thus the rotation vectors. The vector of residuals for each cell (or the original expression values, if BlockedPcaOptions::components_from_residuals = false) is then projected to the subspace defined by these rotation vectors to obtain that cell's PC coordinates.

Template Parameters

Value_	Type of the matrix data.
Index_	Integer type for the indices.
Block_	Integer type for the blocking factor.
EigenMatrix_	A floating-point `Eigen::Matrix` class.
EigenVector_	A floating-point `Eigen::Vector` class.

Parameters

[in]	mat	Input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values.
[in]	block	Pointer to an array of length equal to the number of cells, containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the number of blocks.
	options	Further options.
[out]	output	On output, the results of the PCA on the residuals. This can be re-used across multiple calls to `blocked_pca()`.

◆ simple_pca() [1/2]

template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ >

SimplePcaResults< EigenMatrix_, EigenVector_ > scran_pca::simple_pca	(	const tatami::Matrix< Value_, Index_ > &	mat,
		const SimplePcaOptions &	options )

Overload of simple_pca() that allocates memory for the output.

Template Parameters

EigenMatrix_	A floating-point `Eigen::Matrix` class.
EigenVector_	A floating-point `Eigen::Vector` class.
Value_	Type of the matrix data.
Index_	Integer type for the indices.

Parameters

[in]	mat	The input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values.
	options	Further options.

Returns: Results of the PCA.

◆ simple_pca() [2/2]

template<typename Value_ , typename Index_ , typename EigenMatrix_ , class EigenVector_ >

void scran_pca::simple_pca	(	const tatami::Matrix< Value_, Index_ > &	mat,
		const SimplePcaOptions &	options,
		SimplePcaResults< EigenMatrix_, EigenVector_ > &	output )

Principal components analysis (PCA) for compression and denoising of single-cell expression data.

The premise is that most of the variation in the dataset is driven by biology, as changes in pathway activity drive coordinated changes across multiple genes. In contrast, technical noise is random and not synchronized across any one axis in the high-dimensional space. This suggests that the earlier principal components (PCs) should be enriched for biological heterogeneity while the later PCs capture random noise.

Our aim is to reduce the size of the data and reduce noise by only using the earlier PCs for downstream cell-based analyses (e.g., neighbor detection, clustering). Most practitioners will keep the first 10-50 PCs, though the exact choice is fairly arbitrary - see SimplePcaOptions::number to specify the number of PCs. As we are only interested in the top PCs, we can use approximate algorithms for faster computation, in particular IRLBA.

Template Parameters

Value_	Type of the matrix data.
Index_	Integer type for the indices.
EigenMatrix_	A floating-point `Eigen::Matrix` class.
EigenVector_	A floating-point `Eigen::Vector` class.

Parameters

[in]	mat	The input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values.
	options	Further options.
[out]	output	On output, the results of the PCA on `mat`. This can be re-used across multiple calls to `simple_pca()`.

Classes

Functions

Detailed Description

Function Documentation

◆ blocked_pca() [1/2]

◆ blocked_pca() [2/2]

◆ simple_pca() [1/2]

◆ simple_pca() [2/2]