|
mumosa
Multi-modal analyses of single-cell data
|
Scale multi-modal embeddings to adjust for differences in variance. More...
Classes | |
| class | BlockedIndicesFactory |
| Factory for creating nearest-neighbor search indices for each block. More... | |
| struct | BlockedOptions |
Options for compute_distance_blocked(). More... | |
| struct | BlockedWorkspace |
Workspace for compute_distance_blocked(). More... | |
| struct | Options |
Options for compute_distance(). More... | |
Functions | |
| template<typename Distance_ , typename Index_ > | |
| BlockedWorkspace< Distance_ > | create_workspace (const std::vector< Index_ > &block_sizes, const BlockedOptions &options) |
| template<typename Index_ , typename Input_ , typename Distance_ > | |
| std::pair< Distance_, Distance_ > | compute_distance_blocked (const std::vector< std::shared_ptr< const knncolle::Prebuilt< Index_, Input_, Distance_ > > > &prebuilts, BlockedWorkspace< Distance_ > &workspace, const BlockedOptions &options) |
| template<typename Index_ , typename Input_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>> | |
| std::vector< std::shared_ptr< const knncolle::Prebuilt< Index_, Input_, Distance_ > > > | build_blocked_indices (const std::size_t num_dim, const std::vector< Index_ > block_sizes, const Input_ *const data, const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > &builder) |
| template<typename Index_ , typename Input_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>> | |
| std::pair< Distance_, Distance_ > | compute_distance_blocked (const std::size_t num_dim, const std::vector< Index_ > &block_sizes, const Input_ *const data, const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > &builder, const BlockedOptions &options) |
| template<typename Index_ , typename Input_ , typename Block_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>> | |
| std::pair< Distance_, Distance_ > | compute_distance_blocked (const std::size_t num_dim, const Index_ num_cells, const Input_ *const data, const Block_ *const block, const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > &builder, const BlockedOptions &options) |
| template<typename Index_ , typename Input_ , typename Scale_ , typename Output_ > | |
| void | combine_scaled_embeddings (const std::vector< std::size_t > &num_dims, const Index_ num_cells, const std::vector< Input_ * > &embeddings, const std::vector< Scale_ > &scaling, Output_ *const output) |
| template<typename Distance_ > | |
| Distance_ | compute_scale (const std::pair< Distance_, Distance_ > &ref, const std::pair< Distance_, Distance_ > &target) |
| template<typename Distance_ > | |
| std::vector< Distance_ > | compute_scale (const std::vector< std::pair< Distance_, Distance_ > > &distances) |
| template<typename Index_ , typename Distance_ > | |
| std::pair< Distance_, Distance_ > | compute_distance (const Index_ num_cells, Distance_ *const distances) |
| template<typename Index_ , typename Input_ , typename Distance_ > | |
| std::pair< Distance_, Distance_ > | compute_distance (const knncolle::Prebuilt< Index_, Input_, Distance_ > &prebuilt, Distance_ *const distances, const Options &options) |
| template<typename Index_ , typename Input_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>> | |
| std::pair< Distance_, Distance_ > | compute_distance (const std::size_t num_dim, const Index_ num_cells, const Input_ *const data, const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > &builder, const Options &options) |
Scale multi-modal embeddings to adjust for differences in variance.
| BlockedWorkspace< Distance_ > mumosa::create_workspace | ( | const std::vector< Index_ > & | block_sizes, |
| const BlockedOptions & | options ) |
| Index_ | Integer type of the number of cells. |
| Distance_ | Floating-point type of the distances. |
| block_sizes | Vector of length equal to the number of blocks, containing the number of observations in each block. |
| options | Further options. |
compute_distance_blocked() calls with the same block_sizes. | std::pair< Distance_, Distance_ > mumosa::compute_distance_blocked | ( | const std::vector< std::shared_ptr< const knncolle::Prebuilt< Index_, Input_, Distance_ > > > & | prebuilts, |
| BlockedWorkspace< Distance_ > & | workspace, | ||
| const BlockedOptions & | options ) |
NOTES:
The local neighborhood variance can be considered as the variance within a particular region of the high-dimensional space. The expectation of this variance should not be affected by the number of cells, but the distance to the neighbors will be affected if the density of cells changes.
We do not apply block-specific scaling factors as we don't want to alter the relative values within the same modality. We shouldn't have to do it in the first place - as it's the same modality! - but more importantly, we could introduce spurious differences between blocks. In the simplest case, two blocks have the same subpopulation structure but the number of cells is different. We would get different distances in each block due to density, causing us to scale each block differently. More generally, we could expect differences in subpopulation structure between blocks, leading to different distances even in the absence of any batch effects. (Mind you, differences in subpopulation structure also interfere with accurate scaling between modalities, but any errors in scaling modalities are much less obvious than those from scaling blocks.) Systematic differences between blocks can artificially inflate the distances to the nearest neighbors within a modality's embedding. Specifically, strong batch effects can reduce the density of the local neighborhood by shifting cells elsewhere. This increases the distance to the nearest neighbors compared to a modality without any batch effects, even if the variance in the local neighborhood is the same between modalities.
If the magnitude of the batch effects differ between modalities, this may introduce spurious differences in the median distance-to-neighbor. To improve accuracy in the presence of blocks, this function calls compute_distance() on each entry of prebuilts separately. It then computes a weighted average of the median distance and RMSDs across blocks (see scran_blocks::compute_weights() for details). This ensures that arbitrary shifts in location between blocks have no effect on the distances to the nearest neighbors for each modality.
| Index_ | Integer type of the number of cells. |
| Input_ | Numeric type of the input data used to build the search index. This is only required to define the knncolle::Prebuilt class and is otherwise ignored. |
| Distance_ | Floating-point type of the distances. |
| prebuilts | Vector of length equal to the number of blocks. Each entry contains a prebuilt neighbor search index for a single block. A block with no observations may be represented by a null pointer. |
| workspace | Workspace object, constructed with block sizes that match the number of observations in each entry of prebuilts. This can be re-used across multiple compute_distance_blocked() calls with the same block sizes. |
| options | Further options. |
compute_scale(). | std::vector< std::shared_ptr< const knncolle::Prebuilt< Index_, Input_, Distance_ > > > mumosa::build_blocked_indices | ( | const std::size_t | num_dim, |
| const std::vector< Index_ > | block_sizes, | ||
| const Input_ *const | data, | ||
| const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > & | builder ) |
Build nearest-neighbor search indices from an embedding where cells from the same block occupy contiguous columns.
| Index_ | Integer type of the number of cells. |
| Input_ | Numeric type of the input data. |
| Distance_ | Floating-point type of the distances. |
| Matrix_ | Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface. |
| num_dim | Number of dimensions in the embedding. | |
| block_sizes | Number of cells in each block. | |
| [in] | data | Pointer to an array containing the embedding matrix for a modality. This should be stored in column-major layout where each row is a dimension and each column is a cell. The number of rows should be equal to num_dim and the number of columns should be equal to the sum of block_sizes. Cells from the first block should be stored in the first block_sizes[0] columns, cells from the second block should be stored in the next block_sizes[1] columns, and so on. |
| builder | Algorithm to use for the neighbor search. |
compute_distance_blocked(). Empty blocks will be represented by null pointers. | std::pair< Distance_, Distance_ > mumosa::compute_distance_blocked | ( | const std::size_t | num_dim, |
| const std::vector< Index_ > & | block_sizes, | ||
| const Input_ *const | data, | ||
| const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > & | builder, | ||
| const BlockedOptions & | options ) |
Overload of compute_distance() that accepts an embedding matrix with contiguous blocks.
| Index_ | Integer type of the number of cells. |
| Input_ | Numeric type of the input data. |
| Distance_ | Floating-point type of the distances. |
| Matrix_ | Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface. |
| num_dim | Number of dimensions in the embedding. | |
| block_sizes | Number of cells in each block. | |
| [in] | data | Pointer to an array containing the embedding matrix for a modality. This should be stored in column-major layout where each row is a dimension and each column is a cell, see build_blocked_indices() for details. |
| builder | Algorithm to use for the neighbor search. | |
| options | Further options. |
compute_scale(). | std::pair< Distance_, Distance_ > mumosa::compute_distance_blocked | ( | const std::size_t | num_dim, |
| const Index_ | num_cells, | ||
| const Input_ *const | data, | ||
| const Block_ *const | block, | ||
| const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > & | builder, | ||
| const BlockedOptions & | options ) |
Overload of compute_distance() that accepts an embedding matrix with non-contiguous block assignments.
| Index_ | Integer type of the number of cells. |
| Input_ | Numeric type of the input data. |
| Distance_ | Floating-point type of the distances. |
| Matrix_ | Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface. |
| num_dim | Number of dimensions in the embedding. | |
| num_cells | Number of cells. | |
| [in] | data | Pointer to an array containing the embedding matrix for a modality. This should be stored in column-major layout where each row is a dimension and each column is a cell. The number of rows and columns should be equal to num_dim and num_cells, respectively. |
| block | Pointer to an array of length equal to num_cells, containing the block assignment for each column of data. Each value should be a non-negative integer in \([0, B)\) where \(B\) is the number of blocks. | |
| builder | Algorithm to use for the neighbor search. | |
| options | Further options. |
compute_scale(). | void mumosa::combine_scaled_embeddings | ( | const std::vector< std::size_t > & | num_dims, |
| const Index_ | num_cells, | ||
| const std::vector< Input_ * > & | embeddings, | ||
| const std::vector< Scale_ > & | scaling, | ||
| Output_ *const | output ) |
Scale the embedding for each modality and combine all embeddings from different modalities into a single matrix for further analyses. Each cell in the combined matrix will contain a concatenation of the scaled coordinates from all of the individual embeddings.
| Index_ | Integer type of the number of cells. |
| Input_ | Floating-point type of the input data. |
| Scale_ | Floating-point type of the scaling factor. |
| Output_ | Floating-point type of the output data. |
| num_dims | Vector containing the number of dimensions in each embedding. | |
| num_cells | Number of cells in each embedding. | |
| embeddings | Vector of pointers of length equal to that of num_dims. Each pointer refers to an array containing an embedding matrix for a single modality, which should be in column-major format with dimensions in rows and cells in columns. The number of rows of the i-th matrix should be equal to num_dims[i] and the number of columns should be equal to num_cells. | |
| scaling | Scaling to apply to each embedding, usually from compute_scale(). This should be of length equal to that of num_dims. | |
| [out] | output | Pointer to the output array. This should be of length equal to the product of num_cells and the sum of num_dims. On completion, output is filled with the combined embeddings in column-major format. Each row corresponds to a dimension while each column corresponds to a cell. |
| Distance_ mumosa::compute_scale | ( | const std::pair< Distance_, Distance_ > & | ref, |
| const std::pair< Distance_, Distance_ > & | target ) |
Compute the scaling factor to be applied to an embedding of a "target" modality relative to a "reference" modality. The aim is to scale the target so that the average variance in the local neighborhood is equal to that of the reference, to ensure that high noise in one modality does not drown out interesting biology in another modality in downstream analyses.
This approach assumes that the median distance to the Options::num_neighbors-th nearest neighbor is proportional to the neighborhood variance. The scaling factor is defined as the ratio of the median distances in the reference to the target. If either of the median distances is zero, this function instead returns the ratio of the RMSDs as a fallback.
Advanced users may want to scale the target so that its variance is some \(S\)-fold of the reference, e.g., to give more weight to more important modalities. This can be achieved by multiplying the returned factor by \(\sqrt{S}\) prior to the actual scaling.
| Distance_ | Floating-point type of the distances. |
| ref | Results of compute_distance() for the embedding of the reference modality. The first value contains the median distance while the second value contains the root-mean squared distance (RMSD). |
| target | Results of compute_distance() for the embedding of the target modality. |
| std::vector< Distance_ > mumosa::compute_scale | ( | const std::vector< std::pair< Distance_, Distance_ > > & | distances | ) |
Compute the scaling factors for a group of embeddings, given the neighbor distances computed by compute_distance(). This aims to scale each embedding so that the neighborhood variances are equal across embeddings as described in compute_scale(). The "reference" modality is defined as the first embedding with a non-zero RMSD to ensure that the scaling is well-defined for every sample; other than this requirement, the exact choice of reference has no actual impact on the relative values of the scaling factors.
| Distance_ | Floating-point type of the distances. |
| distances | Vector of distances for embeddings, as computed by compute_distance() on each embedding. |
distances, to be applied to each embedding. This is equivalent to running compute_scale() on each entry of distances against the chosen reference. | std::pair< Distance_, Distance_ > mumosa::compute_distance | ( | const Index_ | num_cells, |
| Distance_ *const | distances ) |
| Index_ | Integer type of the number of cells. |
| Distance_ | Floating-point type of the distances. |
| num_cells | Number of cells. | |
| [in,out] | distances | Pointer to an array of length num_cells, containing the distances from each cell to its \(k\)-nearest neighbor. It is expected that the same \(k\) was used for each cell. On output, the order of values may be arbitrarily altered during the median calculation; if this is undesirable, users should pass in a copy of the array. |
compute_scale(). | std::pair< Distance_, Distance_ > mumosa::compute_distance | ( | const knncolle::Prebuilt< Index_, Input_, Distance_ > & | prebuilt, |
| Distance_ *const | distances, | ||
| const Options & | options ) |
Overload of compute_distance() that accepts a pre-built neighbor search index.
| Index_ | Integer type of the number of cells. |
| Input_ | Numeric type of the input data used to build the search index. This is only required to define the knncolle::Prebuilt class and is otherwise ignored. |
| Distance_ | Floating-point type of the distances. |
| prebuilt | A prebuilt neighbor search index for a modality-specifi embedding. | |
| [out] | distances | Pointer to an array of length prebuilt.num_observations(), containing the distances from each cell to its \(k\)-nearest neighbor. This may not be ordered on output. |
| options | Further options. |
Options::num_neighbors-th nearest neighbor (first) and the root-mean-squared distance across all cells (second). These values can be used in compute_scale(). | std::pair< Distance_, Distance_ > mumosa::compute_distance | ( | const std::size_t | num_dim, |
| const Index_ | num_cells, | ||
| const Input_ *const | data, | ||
| const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > & | builder, | ||
| const Options & | options ) |
Overload of compute_distance() that accepts an embedding matrix.
| Index_ | Integer type of the number of cells. |
| Input_ | Numeric type of the input data. |
| Distance_ | Floating-point type of the distances. |
| Matrix_ | Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface. |
| num_dim | Number of dimensions in the embedding. | |
| num_cells | Number of cells in the embedding. | |
| [in] | data | Pointer to an array containing the embedding matrix for a modality. This should be stored in column-major layout where each row is a dimension and each column is a cell. |
| builder | Algorithm to use for the neighbor search. | |
| options | Further options. |
Options::num_neighbors-th nearest neighbor (first) and the root-mean-squared distance across all cells (second). These values can be used in compute_scale().