|
| template<typename Value_ , typename Index_ , typename EigenMatrix_ , class EigenVector_ > |
| void | simple_pca (const tatami::Matrix< Value_, Index_ > &mat, const SimplePcaOptions< EigenVector_ > &options, SimplePcaResults< EigenMatrix_, EigenVector_ > &output) |
| |
| template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ > |
| SimplePcaResults< EigenMatrix_, EigenVector_ > | simple_pca (const tatami::Matrix< Value_, Index_ > &mat, const SimplePcaOptions< EigenVector_ > &options) |
| |
| template<typename Value_ , typename Index_ , typename Block_ , typename EigenMatrix_ , class EigenVector_ > |
| void | blocked_pca (const tatami::Matrix< Value_, Index_ > &mat, const Block_ *block, const BlockedPcaOptions< EigenVector_ > &options, BlockedPcaResults< EigenMatrix_, EigenVector_ > &output) |
| |
| template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ , typename Block_ > |
| BlockedPcaResults< EigenMatrix_, EigenVector_ > | blocked_pca (const tatami::Matrix< Value_, Index_ > &mat, const Block_ *block, const BlockedPcaOptions< EigenVector_ > &options) |
| |
| template<typename Value_ , typename Index_ , typename SubsetVector_ , typename EigenMatrix_ , class EigenVector_ > |
| void | subset_pca (const tatami::Matrix< Value_, Index_ > &mat, const SubsetVector_ &subset, const SubsetPcaOptions< EigenVector_ > &options, SubsetPcaResults< EigenMatrix_, EigenVector_ > &output) |
| |
| template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ , class SubsetVector_ > |
| SubsetPcaResults< EigenMatrix_, EigenVector_ > | subset_pca (const tatami::Matrix< Value_, Index_ > &mat, const SubsetVector_ &subset, const SubsetPcaOptions< EigenVector_ > &options) |
| |
| template<typename Value_ , typename Index_ , class SubsetVector_ , typename Block_ , typename EigenMatrix_ , class EigenVector_ > |
| void | subset_pca_blocked (const tatami::Matrix< Value_, Index_ > &mat, const SubsetVector_ &subset, const Block_ *block, const SubsetPcaBlockedOptions< EigenVector_ > &options, SubsetPcaBlockedResults< EigenMatrix_, EigenVector_ > &output) |
| |
| template<typename EigenMatrix_ = Eigen::MatrixXd, class EigenVector_ = Eigen::VectorXd, typename Value_ , typename Index_ , class SubsetVector_ , typename Block_ > |
| SubsetPcaBlockedResults< EigenMatrix_, EigenVector_ > | subset_pca_blocked (const tatami::Matrix< Value_, Index_ > &mat, const SubsetVector_ &subset, const Block_ *block, const SubsetPcaBlockedOptions< EigenVector_ > &options) |
| |
Principal component analysis on single-cell data.
template<typename Value_ , typename Index_ , typename Block_ , typename EigenMatrix_ , class EigenVector_ >
Principal components analysis on residuals, after regressing out a blocking factor across cells.
As discussed in simple_pca(), we extract the top PCs from a single-cell dataset for downstream cell-based procedures like clustering. In the presence of a blocking factor (e.g., batches, samples), we want to ensure that the PCA is not driven by uninteresting differences between blocks of cells. To achieve this, blocked_pca() centers the expression of each gene within each blocking level and uses the residuals for PCA. This ensures that the gene-gene covariance matrix will only contain variation within each batch, such that the top rotation vectors/principal components capture biological heterogeneity instead of inter-block differences.
In addition, blocked_pca() will weight each block of cells to control its relative contribution to the PCA. By default, blocked_pca() scales the expression values for each block so that each "sufficiently large" block contributes equally to the gene-gene covariance matrix and thus the rotation vectors. This ensures that the definition of the axes of maximum variance are not dominated by the largest block, potentially masking interesting variation in the smaller blocks. (See BlockedPcaOptions::block_weight_policy for the choice of weighting scheme.)
The PC scores themselves are computed by projecting each cell's expression profile onto the subspace defined by the rotation vectors, and then centering them according to BlockedPcaOptions::center_scores_by_block. The interpretation of these scores depends on the choice of centering mode:
- If
false (the default), the dataset is globally shifted so that the centroid across all cells lies at the origin. This does not explicitly remove differences between blocks. Any differences in expression that are not orthogonal to the rotation vectors will still manifest in the PC scores. In this mode, blocking only reduces the impact of inter-block differences on the identification of the rotation vectors.
- If
true, the scores are centered within each block, i.e., each block of cells is centered at the origin. Without weighting, this is equivalent to the PC scores that would be obtained from PCA on the residuals. This represents a low-dimensional space where inter-block differences have been "corrected", assuming that all blocks have the same subpopulation composition and the inter-block differences are consistent for all cell subpopulations.
We default to false as the assumptions mentioned above for true are usually too strong. Per-block centering can distort the differences between blocks when these assumptions are violated, even in the absence of any differences between blocks. Global centering avoids any distortion while mitigating the impact of uninteresting inter-block differences on the scores. Any remaining differences can be corrected by processing the scores with more sophisticated batch correction methods like MNN correction.
Internally, blocked_pca() defers the residual calculation until the matrix multiplication steps within IRLBA. This yields the same results as the naive calculation of residuals but is much faster as it can take advantage of efficient sparse operations.
- Template Parameters
-
| Value_ | Type of the matrix data. |
| Index_ | Integer type of the indices. |
| Block_ | Integer type of the blocking factor. |
| EigenMatrix_ | A floating-point column-major Eigen::Matrix class. |
| EigenVector_ | A floating-point Eigen::Vector class. |
- Parameters
-
| [in] | mat | Input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values. |
| [in] | block | Pointer to an array of length equal to the number of cells, containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the number of blocks. |
| options | Further options. |
| [out] | output | On output, the results of the PCA on the residuals. This can be re-used across multiple calls to blocked_pca(). |
template<typename Value_ , typename Index_ , typename EigenMatrix_ , class EigenVector_ >
Principal components analysis (PCA) for compression and denoising of single-cell expression data.
We assume that most variation in the dataset is driven by biological differences between subpopulations that drive coordinated changes across multiple genes in the same pathways. In contrast, technical noise is random and not synchronized across any one axis in the high-dimensional space. This suggests that the earlier principal components (PCs) should be enriched for biological heterogeneity while the later PCs capture random noise.
Our aim is to reduce the size of the data and eliminate noise by only using the earlier PCs for downstream cell-based analyses (e.g., neighbor detection, clustering). Most practitioners will keep the first 10-50 PCs, though the exact choice is fairly arbitrary - see SimplePcaOptions::number to specify the number of PCs. As we are only interested in the top PCs, we can use approximate algorithms for faster computation, in particular IRLBA.
- Template Parameters
-
| Value_ | Type of the matrix data. |
| Index_ | Integer type of the indices. |
| EigenMatrix_ | A floating-point column-major Eigen::Matrix class. |
| EigenVector_ | A floating-point Eigen::Vector class. |
- Parameters
-
| [in] | mat | The input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values. |
| options | Further options. |
| [out] | output | On output, the results of the PCA on mat. This can be re-used across multiple calls to simple_pca(). |
template<typename Value_ , typename Index_ , typename SubsetVector_ , typename EigenMatrix_ , class EigenVector_ >
Principal components analysis on a subset of features.
This function performs PCA on a subset of features (e.g., from highly variable genes) in the input matrix. The results are almost equivalent to subsetting the input matrix and then running simple_pca(). However, subset_pca() will also populate the rotation matrix, centering vector and scaling vector for features outside of the subset. For the rotation matrix, this is done by projecting the unused features into the low-dimensional space defined by the PCs. The goal is to allow callers to create a low-rank approximation of the entire input matrix, even if only a subset of the features are relevant to the PCA.
- Template Parameters
-
| Value_ | Type of the matrix data. |
| Index_ | Integer type of the indices. |
| SubsetVector_ | Container of the row indices. Should support [], size() and copy construction. |
| EigenMatrix_ | A floating-point column-major Eigen::Matrix class. |
| EigenVector_ | A floating-point Eigen::Vector class. |
- Parameters
-
| [in] | mat | The input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values. |
| subset | Vector of indices for rows to be used in the PCA. This should be sorted and unique. |
| options | Further options. |
| [out] | output | On output, the results of the PCA on mat. This can be re-used across multiple calls to subset_pca(). |
template<typename Value_ , typename Index_ , class SubsetVector_ , typename Block_ , typename EigenMatrix_ , class EigenVector_ >
Principal components analysis on a subset of features in the input matrix, with blocking.
This function performs PCA on a subset of interesting features (e.g., from highly variable genes) while accounting for a blocking factor. The results are almost equivalent to subsetting the input matrix and then running blocked_pca(). However, subset_pca_blocked() will also populate the rotation matrix, centering matrix and scaling vector for features outside of the subset. For the rotation matrix, this is done by projecting the unused features into the low-dimensional space defined by the top PCs. The goal is to allow callers to create a low-rank approximation of the entire input matrix, even if only a subset of the features are relevant to the PCA.
- Template Parameters
-
| Value_ | Type of the matrix data. |
| Index_ | Integer type of the indices. |
| SubsetVector_ | Container of the row indices. Should support [], size() and copy construction. |
| EigenMatrix_ | A floating-point column-major Eigen::Matrix class. |
| EigenVector_ | A floating-point Eigen::Vector class. |
- Parameters
-
| [in] | mat | The input matrix. Columns should contain cells while rows should contain genes. Matrix entries are typically log-expression values. |
| subset | Vector of indices for rows to be used in the PCA. This should be sorted and unique. |
| [in] | block | Pointer to an array of length equal to the number of cells, containing the block assignment for each cell. Each assignment should be an integer in \([0, N)\) where \(N\) is the number of blocks. |
| options | Further options. |
| [out] | output | On output, the results of the PCA on mat. This can be re-used across multiple calls to subset_pca_blocked(). |