mnncorrect
Batch correction with mutual nearest neighbors
Loading...
Searching...
No Matches
Classes | Enumerations | Functions
mnncorrect Namespace Reference

Batch correction with mutual nearest neighbors. More...

Classes

struct  Details
 Correction details from compute(). More...
 
struct  Options
 Options for compute(). More...
 

Enumerations

enum class  ReferencePolicy : char { INPUT , MAX_SIZE , MAX_VARIANCE , MAX_RSS }
 

Functions

template<typename Dim_ , typename Index_ , typename Float_ >
Details compute (size_t num_dim, const std::vector< size_t > &num_obs, const std::vector< const Float_ * > &batches, Float_ *output, const Options< Dim_, Index_, Float_ > &options)
 
template<typename Dim_ , typename Index_ , typename Float_ >
Details compute (size_t num_dim, const std::vector< size_t > &num_obs, const Float_ *input, Float_ *output, const Options< Dim_, Index_, Float_ > &options)
 
template<typename Dim_ , typename Index_ , typename Float_ , typename Batch_ >
Details compute (size_t num_dim, size_t num_obs, const Float_ *input, const Batch_ *batch, Float_ *output, const Options< Dim_, Index_, Float_ > &options)
 
template<typename Task_ , class Run_ >
void parallelize (int num_workers, Task_ num_tasks, Run_ run_task_range)
 

Detailed Description

Batch correction with mutual nearest neighbors.

Enumeration Type Documentation

◆ ReferencePolicy

enum class mnncorrect::ReferencePolicy : char
strong

Policy for choosing the first reference batch with the automatic merging procedure.

  • INPUT will use the first supplied batch in the input order. This is useful in cases where one batch is known to contain most subpopulations and should be used as the reference, but there is no obvious ordering for the other batches.
  • MAX_SIZE will use the largest batch (i.e., with the most observations). This is simple to compute and was the previous default; it does, at least, ensure that the initial reference has enough cells for stable correction.
  • MAX_VARIANCE will use the batch with the greatest variance. This improves the likelihood of obtaining an reference that contains a diversity of subpopulations and thus is more likely to form sensible MNN pairs with subsequent batches.
  • MAX_RSS will use the batch with the greatest residual sum of squares (RSS). This is similar to MAX_VARIANCE but it puts more weight on batches with more cells, so as to avoid picking small batches with few cells and unstable population strcuture.

Function Documentation

◆ compute() [1/3]

template<typename Dim_ , typename Index_ , typename Float_ >
Details mnncorrect::compute ( size_t  num_dim,
const std::vector< size_t > &  num_obs,
const Float_ *  input,
Float_ *  output,
const Options< Dim_, Index_, Float_ > &  options 
)

A convenience overload to merge contiguous batches contained in the same array.

Template Parameters
Dim_Integer type for the dimensions of the neighbor search.
Index_Integer type for the observation index of the neighbor search.
Float_Floating-point type for the distances in the neighbor search.
Parameters
num_dimNumber of dimensions.
num_obsVector of length equal to the number of batches. The i-th entry contains the number of observations in batch i.
[in]inputPointer to an array containing a column-major matrix of uncorrected values from all batches. The number of rows is equal to num_dim and the number of columns is equal to the sum of num_obs. The first num_obs[0] columns contain the uncorrected data for the first batch, the next num_obs[1] columns contain observations for the second batch, and so on.
[out]outputPointer to an array containing a column-major matrix of the same dimensions as that in input, where the corrected values for all batches are stored. On output, the first num_obs[0] columns contain the corrected values of the first batch, the second num_obs[1] columns contain the corrected values of the second batch, and so on.
optionsFurther options.
Returns
Statistics about the merge process.

◆ compute() [2/3]

template<typename Dim_ , typename Index_ , typename Float_ >
Details mnncorrect::compute ( size_t  num_dim,
const std::vector< size_t > &  num_obs,
const std::vector< const Float_ * > &  batches,
Float_ *  output,
const Options< Dim_, Index_, Float_ > &  options 
)

Batch correction using mutual nearest neighbors.

This function implements a variant of the MNN correction method described by Haghverdi et al. (2018). Two cells from different batches can form an MNN pair if they each belong in each other's set of nearest neighbors. The MNN pairs are assumed to represent cells from corresponding subpopulations across the two batches. Any differences in location between the paired cells can be interpreted as the batch effect and targeted for removal.

We consider one batch to be the "reference" and the other to be the "target", where the aim is to correct the latter to the (unchanged) former. For each observation in the target batch, we find the closest MNN pairs (based on the locations of the paired observation in the same batch) and we compute a robust average of the correction vectors involving those pairs. This average is used to obtain a single correction vector that is applied to the target observation to obtain corrected values.

Each MNN pair's correction vector is computed between the "center of mass" locations for the paired observations. The center of mass for each observation is defined as a robust average of a subset of neighboring observations from the same batch. Robustification is performed by iterations of trimming of observations that are furthest from the mean. In addition, we explicitly remove observations that are more than a certain distance from the observation in the MNN pair.

See also
Haghverdi L et al. (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotech. 36, 421-427
Template Parameters
Dim_Integer type for the dimensions of the neighbor search.
Index_Integer type for the observation index of the neighbor search.
Float_Floating-point type for the distances in the neighbor search.
Parameters
num_dimNumber of dimensions.
num_obsVector of length equal to the number of batches. The i-th entry contains the number of observations in batch i.
[in]batchesVector of length equal to the number of batches. The i-th entry points to a column-major dimension-by-observation array containing the uncorrected data for batch i, where the number of rows is equal to num_dim and the number of columns is equal to num_obs[i].
[out]outputPointer to an array containing a column-major matrix with number of rows equal to num_dim and number of columns equal to the sum of num_obs. On output, the first num_obs[0] columns contain the corrected values of the first batch, the second num_obs[1] columns contain the corrected values of the second batch, and so on.
optionsFurther options.
Returns
Statistics about the merge process.

◆ compute() [3/3]

template<typename Dim_ , typename Index_ , typename Float_ , typename Batch_ >
Details mnncorrect::compute ( size_t  num_dim,
size_t  num_obs,
const Float_ *  input,
const Batch_ *  batch,
Float_ *  output,
const Options< Dim_, Index_, Float_ > &  options 
)

Merge batches where observations are arbitrarily ordered in the same array.

Template Parameters
Dim_Integer type for the dimensions of the neighbor search.
Index_Integer type for the observation index of the neighbor search.
Float_Floating-point type for the distances in the neighbor search.
Batch_Integer type for the batch IDs.
Parameters
num_dimNumber of dimensions.
num_obsNumber of observations across all batches.
[in]inputPointer to an array containing a column-major matrix of uncorrected values from all batches. The number of rows is equal to num_dim and the number of columns is equal to num_obs. Observations from the same batch do not need to be stored in adjacent columns.
[in]batchPointer to an array of length num_obs containing the batch identity for each observation. IDs should be zero-indexed and lie within \([0, N)\) where \(N\) is the number of unique batches.
[out]outputPointer to an array containing a column-major matrix of the same dimensions as that in input, where the corrected values for all batches are stored. The order of observations in output is the same as that in the input.
optionsFurther options.
Returns
Statistics about the merge process.

◆ parallelize()

template<typename Task_ , class Run_ >
void mnncorrect::parallelize ( int  num_workers,
Task_  num_tasks,
Run_  run_task_range 
)
Template Parameters
Task_Integer type for the number of tasks.
Run_Function to execute a range of tasks.
Parameters
num_workersNumber of workers.
num_tasksNumber of tasks.
run_task_rangeFunction to iterate over a range of tasks within a worker.

By default, this is an alias to subpar::parallelize_range(). However, if the MNNCORRECT_CUSTOM_PARALLEL function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range().