scran_pca
Principal component analysis for single-cell data
Loading...
Searching...
No Matches
Principal components analysis, duh

Unit tests Documentation Codecov

Overview

As the name suggests, this repository implements functions to perform a PCA on the gene-by-cell expression matrix, returning low-dimensional coordinates for each cell that can be used for efficient downstream analyses, e.g., clustering, visualization. The code itself was originally derived from the scran and batchelor R packages, factored out into a separate C++ library for easier re-use.

Quick start

Given a tatami::Matrix, the scran_pca::simple_pca() function will compute the PCA to obtain a low-dimensional representation of the cells:

const tatami::Matrix<double, int>& mat = some_data_source();
// Take the top 20 PCs:
opt.number = 20;
auto res = scran_pca::simple_pca(mat, opt);
res.components; // rows are PCs, columns are cells.
res.rotation; // rows are genes, columns correspond to PCs.
res.variance_explained; // one per PC, in decreasing order.
res.total_variance; // total variance in the dataset.
void simple_pca(const tatami::Matrix< Value_, Index_ > &mat, const SimplePcaOptions &options, SimplePcaResults< EigenMatrix_, EigenVector_ > &output)
Definition simple_pca.hpp:430
Principal component analysis on single-cell data.
Options for simple_pca().
Definition simple_pca.hpp:28
int number
Definition simple_pca.hpp:45

Advanced users can fiddle with more of the options:

opt.scale = true;
opt.num_threads = 4;
opt.realize_matrix = false;
auto res2 = scran_pca::simple_pca(mat, opt);
bool realize_matrix
Definition simple_pca.hpp:64
int num_threads
Definition simple_pca.hpp:70
bool scale
Definition simple_pca.hpp:52

Check out the reference documentation for more details.

With blocking

In the presence of multiple blocks, we can perform the PCA on the residuals after regressing out the blocking factor. This ensures that the inter-block differences do not contribute to the first few PCs, instead favoring the representation of intra-block variation.

std::vector<int> blocks = some_blocks();
bopt.number = 10; // taking the top 10 PCs this time.
auto bres = scran_pca::blocked_pca(mat, blocks.data(), bopt);
bres.components; // rows are PCs, columns are cells.
bres.center; // rows are blocks, columns are genes.
void blocked_pca(const tatami::Matrix< Value_, Index_ > &mat, const Block_ *block, const BlockedPcaOptions &options, BlockedPcaResults< EigenMatrix_, EigenVector_ > &output)
Definition blocked_pca.hpp:1134
Options for blocked_pca().
Definition blocked_pca.hpp:30
int number
Definition blocked_pca.hpp:47

The components derived from the residuals will only be free of inter-block differences under certain conditions (equal population composition with a consistent shift between blocks). If this is not the case, more sophisticated batch correction methods are required such as MNN correction. If those methods accept a low-dimensional representation for the cells as input, we can use scran_pca::blocked_pca() to obtain an appropriate matrix that focuses on intra-block variation without making assumptions about the inter-block differences:

auto bres2 = scran_pca::blocked_pca(mat, blocks.data(), bopt);
bool components_from_residuals
Definition blocked_pca.hpp:85

Feature subsets

If we have only a subset of features of interest, the obvious approach is to subset the input matrix like so:

std::vector<int> subset{ 0, 1, 10, 1000};
auto sumat = tatami::make_DelayedSubset(mat, subset);
auto subres = scran_pca::simple_pca(*submat, opt);
subres.rotation; // has rows equal to subset.size()
std::shared_ptr< Matrix< Value_, Index_ > > make_DelayedSubset(std::shared_ptr< const Matrix< Value_, Index_ > > matrix, SubsetStorage_ subset, const bool by_row)

This is fine for the PC scores but will only report the rotation matrix and centering/scaling vectors for the subset of features. If we want to, say, create a low-rank approximation of the entire input matrix, we should instead do:

auto subres2 = scran_pca::subset_pca(mat, subset, opt);
subres2.rotation; // has rows equal to mat.nrow()
void subset_pca(const tatami::Matrix< Value_, Index_ > &mat, const SubsetVector_ &subset, const SubsetPcaOptions &options, SubsetPcaResults< EigenMatrix_, EigenVector_ > &output)
Definition subset_pca.hpp:147

This returns a rotation matrix that contains entries for all features, not just those in the subset of interest. We can then easily compute the low-rank approximation for any feature in our input matrix:

int feat_of_interest = 99;
Eigen::VectorXd approximated = subres2.rotation.row(feat_of_interest) * subres2.components;

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
scran_pca
GIT_REPOSITORY https://github.com/libscran/scran_pca
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(scran_pca)

Then you can link to scran_pca to make the headers available during compilation:

# For executables:
target_link_libraries(myexe libscran::scran_pca)
# For libaries
target_link_libraries(mylib INTERFACE libscran::scran_pca)

CMake with find_package()

find_package(libscran_scran_pca CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::scran_pca)

To install the library, use:

mkdir build && cd build
cmake .. -DSCRAN_PCA_TESTS=OFF
cmake --build . --target install

By default, this will use FetchContent to fetch all external dependencies. If you want to install them manually, use -DSCRAN_PCA_FETCH_EXTERN=OFF. See the tags in extern/CMakeLists.txt to find compatible versions of each dependency.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This requires the external dependencies listed in extern/CMakeLists.txt.