Batch correction with mutual nearest neighbors. More...

Classes
struct	Options
	Options for `compute()`. More...

Typedefs
typedef std::size_t	BatchIndex

Enumerations
enum class	MergePolicy : char { INPUT , SIZE , VARIANCE , RSS }

Functions
template<typename Task_ , class Run_ >
void	parallelize (int num_workers, Task_ num_tasks, Run_ run_task_range)

template<typename Index_ , typename Float_ , class Matrix_ >
void	compute (std::size_t num_dim, const std::vector< Index_ > &num_obs, const std::vector< const Float_ * > &batches, Float_ *output, const Options< Index_, Float_, Matrix_ > &options)

template<typename Index_ , typename Float_ , class Matrix_ >
void	compute (std::size_t num_dim, const std::vector< Index_ > &num_obs, const Float_ input, Float_ output, const Options< Index_, Float_, Matrix_ > &options)

template<typename Index_ , typename Float_ , typename Batch_ , class Matrix_ >
void	compute (std::size_t num_dim, Index_ num_obs, const Float_ input, const Batch_ batch, Float_ *output, const Options< Index_, Float_, Matrix_ > &options)

Detailed Description

Batch correction with mutual nearest neighbors.

Typedef Documentation

◆ BatchIndex

typedef std::size_t mnncorrect::BatchIndex

Integer type of the batch indices.

Enumeration Type Documentation

◆ MergePolicy

enum class mnncorrect::MergePolicy : char

strong

Policy for choosing the order of batches to merge.

INPUT will use the input order of the batches. Observations in the last batch are corrected first, and then the second-last batch, and so on. This allows users to control the merge order by simply changing the inputs.
SIZE will merge batches in order of increasing size (i.e., the number of observations). So, the smallest batch is corrected first while the largest batch is unchanged. The aim is to lower compute time by reducing the number of observations that need to be reprocessed in later merge steps.
VARIANCE will merge batches in order of increasing variance between observations. So, the batch with the lowest variance is corrected first while the batch with the highest variance is unchanged. The aim is to lower compute time by encouraging more observations to be corrected to the most variable batch, thus avoid reprocessing in later merge steps.
RSS will merge batches in order of increasing residual sum of squares (RSS). This is effectively a compromise between VARIANCE and SIZE.

Function Documentation

◆ compute() [1/3]

template<typename Index_ , typename Float_ , class Matrix_ >

void mnncorrect::compute	(	std::size_t	num_dim,
		const std::vector< Index_ > &	num_obs,
		const Float_ *	input,
		Float_ *	output,
		const Options< Index_, Float_, Matrix_ > &	options )

A convenience overload to merge contiguous batches contained in the same array.

Template Parameters

Index_	Integer type for the observation index.
Float_	Floating-point type for the input/output data.
Matrix_	Class of the input data matrix for the neighbor search. This should satisfy the `knncolle::Matrix` interface. Alternatively, it may be a `knncolle::SimpleMatrix`.

Parameters

	num_dim	Number of dimensions.
	num_obs	Vector of length equal to the number of batches. The `i`-th entry contains the number of observations in batch `i`.
[in]	input	Pointer to an array containing a column-major matrix of uncorrected values from all batches. The number of rows is equal to `num_dim` and the number of columns is equal to the sum of `num_obs`. The first `num_obs[0]` columns contain the uncorrected data for the first batch, the next `num_obs[1]` columns contain observations for the second batch, and so on.
[out]	output	Pointer to an array containing a column-major matrix of the same dimensions as that in `input`, where the corrected values for all batches are stored. On output, the first `num_obs[0]` columns contain the corrected values of the first batch, the second `num_obs[1]` columns contain the corrected values of the second batch, and so on.
	options	Further options.

◆ compute() [2/3]

template<typename Index_ , typename Float_ , class Matrix_ >

void mnncorrect::compute	(	std::size_t	num_dim,
		const std::vector< Index_ > &	num_obs,
		const std::vector< const Float_ * > &	batches,
		Float_ *	output,
		const Options< Index_, Float_, Matrix_ > &	options )

This function implements a variant of the mutual nearest neighbors (MNN) method for batch correction (Haghverdi et al., 2018). Two cells from different batches can form an MNN pair if they each belong in each other's set of nearest neighbors. The MNN pairs are assumed to represent cells from corresponding subpopulations across the two batches. Any differences in location between the paired cells represents an estimate of the batch effect in that part of the high-dimensional space.

We consider one batch to be the "reference" and the other to be the "target", where the aim is to correct the latter to the (unchanged) former. For each Each MNN pair is used to define a correction vector For each observation in the target batch, we find the closest MNN pairs (based on the locations of the paired observation in the same batch) and we compute a robust average of the correction vectors involving those pairs. This average is used to obtain a single correction vector that is applied to the target observation to obtain corrected values.

Each MNN pair's correction vector is computed between the "center of mass" locations for the paired observations. The center of mass for each observation is defined by recursively searching the neighbors of each MNN-involved observation (and then the neighbors of those neighbors, up to a recursion depth of Options::num_steps) and computing the mean of their coordinates. This improves the correction by mitigating the "kissing effect", i.e., where the correction vectors only form between the surfaces of the mass of points in each batch.

See also: Haghverdi L et al. (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotech. 36, 421-427

Template Parameters

Index_	Integer type for the observation index.
Float_	Floating-point type for the input/output data.
Matrix_	Class of the input data matrix for the neighbor search. This should satisfy the `knncolle::Matrix` interface. Alternatively, it may be a `knncolle::SimpleMatrix`.

Parameters

	num_dim	Number of dimensions.
	num_obs	Vector of length equal to the number of batches. The `i`-th entry contains the number of observations in batch `i`.
[in]	batches	Vector of length equal to the number of batches. The `i`-th entry points to a column-major dimension-by-observation array containing the uncorrected data for batch `i`, where the number of rows is equal to `num_dim` and the number of columns is equal to `num_obs[i]`.
[out]	output	Pointer to an array containing a column-major matrix with number of rows equal to `num_dim` and number of columns equal to the sum of `num_obs`. On output, the first `num_obs[0]` columns contain the corrected values of the first batch, the second `num_obs[1]` columns contain the corrected values of the second batch, and so on.
	options	Further options.

◆ compute() [3/3]

template<typename Index_ , typename Float_ , typename Batch_ , class Matrix_ >

void mnncorrect::compute	(	std::size_t	num_dim,
		Index_	num_obs,
		const Float_ *	input,
		const Batch_ *	batch,
		Float_ *	output,
		const Options< Index_, Float_, Matrix_ > &	options )

Merge batches where observations are arbitrarily ordered in the same array.

Template Parameters

Index_	Integer type for the observation index.
Float_	Floating-point type for the input/output data.
Matrix_	Class of the input data matrix for the neighbor search. This should satisfy the `knncolle::Matrix` interface. Alternatively, it may be a `knncolle::SimpleMatrix`.
Batch_	Integer type for the batch IDs.

Parameters

	num_dim	Number of dimensions.
	num_obs	Number of observations across all batches.
[in]	input	Pointer to an array containing a column-major matrix of uncorrected values from all batches. The number of rows is equal to `num_dim` and the number of columns is equal to `num_obs`. Observations from the same batch do not need to be stored in adjacent columns.
[in]	batch	Pointer to an array of length `num_obs` containing the batch identity for each observation. IDs should be zero-indexed and lie within \([0, N)\) where \(N\) is the number of unique batches.
[out]	output	Pointer to an array containing a column-major matrix of the same dimensions as that in `input`, where the corrected values for all batches are stored. The order of observations in `output` is the same as that in the `input`.
	options	Further options.

◆ parallelize()

template<typename Task_ , class Run_ >

void mnncorrect::parallelize	(	int	num_workers,
		Task_	num_tasks,
		Run_	run_task_range )

Template Parameters

Task_	Integer type for the number of tasks.
Run_	Function to execute a range of tasks.

Parameters

num_workers	Number of workers.
num_tasks	Number of tasks.
run_task_range	Function to iterate over a range of tasks within a worker.

By default, this is an alias to subpar::parallelize_range(). However, if the MNNCORRECT_CUSTOM_PARALLEL function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range().

Classes

Typedefs

Enumerations

Functions

Detailed Description

Typedef Documentation

◆ BatchIndex

Enumeration Type Documentation

◆ MergePolicy

Function Documentation

◆ compute() [1/3]

◆ compute() [2/3]

◆ compute() [3/3]

◆ parallelize()