mnncorrect
Batch correction with mutual nearest neighbors
|
Batch correction with mutual nearest neighbors. More...
Classes | |
struct | Options |
Options for compute() . More... | |
Typedefs | |
typedef std::size_t | BatchIndex |
Enumerations | |
enum class | MergePolicy : char { INPUT , SIZE , VARIANCE , RSS } |
Functions | |
template<typename Task_ , class Run_ > | |
void | parallelize (const int num_workers, const Task_ num_tasks, Run_ run_task_range) |
template<typename Index_ , typename Float_ , class Matrix_ > | |
void | compute (const std::size_t num_dim, const std::vector< Index_ > &num_obs, const std::vector< const Float_ * > &batches, Float_ *const output, const Options< Index_, Float_, Matrix_ > &options) |
template<typename Index_ , typename Float_ , class Matrix_ > | |
void | compute (const std::size_t num_dim, const std::vector< Index_ > &num_obs, const Float_ *const input, Float_ *const output, const Options< Index_, Float_, Matrix_ > &options) |
template<typename Index_ , typename Float_ , typename Batch_ , class Matrix_ > | |
void | compute (const std::size_t num_dim, const Index_ num_obs, const Float_ *const input, const Batch_ *const batch, Float_ *const output, const Options< Index_, Float_, Matrix_ > &options) |
Batch correction with mutual nearest neighbors.
typedef std::size_t mnncorrect::BatchIndex |
Integer type of the batch indices.
|
strong |
Policy for choosing the order of batches to merge.
INPUT
will use the input order of the batches. Observations in the last batch are corrected first, and then the second-last batch, and so on. This allows users to control the merge order by simply changing the inputs.SIZE
will merge batches in order of increasing size (i.e., the number of observations). So, the smallest batch is corrected first while the largest batch is unchanged. The aim is to lower compute time by reducing the number of observations that need to be reprocessed in later merge steps.VARIANCE
will merge batches in order of increasing variance between observations. So, the batch with the lowest variance is corrected first while the batch with the highest variance is unchanged. The aim is to lower compute time by encouraging more observations to be corrected to the most variable batch, thus avoid reprocessing in later merge steps.RSS
will merge batches in order of increasing residual sum of squares (RSS). This is effectively a compromise between VARIANCE
and SIZE
. void mnncorrect::compute | ( | const std::size_t | num_dim, |
const Index_ | num_obs, | ||
const Float_ *const | input, | ||
const Batch_ *const | batch, | ||
Float_ *const | output, | ||
const Options< Index_, Float_, Matrix_ > & | options ) |
Overload of compute()
to merge batches where observations are arbitrarily ordered in the same array.
Index_ | Integer type of the observation index. |
Float_ | Floating-point type of the input/output data. |
Matrix_ | Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface. Alternatively, it may be a knncolle::SimpleMatrix . |
Batch_ | Integer type of the batch IDs. |
num_dim | Number of dimensions. | |
num_obs | Number of observations across all batches. | |
[in] | input | Pointer to an array containing a column-major matrix of uncorrected values from all batches. The number of rows is equal to num_dim and the number of columns is equal to num_obs . Observations from the same batch do not need to be stored in adjacent columns. |
[in] | batch | Pointer to an array of length num_obs containing the batch identity for each observation. IDs should be zero-indexed and lie within \([0, N)\) where \(N\) is the number of unique batches. |
[out] | output | Pointer to an array containing a column-major matrix of the same dimensions as that in input , where the corrected values for all batches are stored. The order of observations in output is the same as that in the input . |
options | Further options. |
void mnncorrect::compute | ( | const std::size_t | num_dim, |
const std::vector< Index_ > & | num_obs, | ||
const Float_ *const | input, | ||
Float_ *const | output, | ||
const Options< Index_, Float_, Matrix_ > & | options ) |
Overload of compute()
to merge contiguous batches contained in the same array.
Index_ | Integer type of the observation index. |
Float_ | Floating-point type of the input/output data. |
Matrix_ | Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface. Alternatively, it may be a knncolle::SimpleMatrix . |
num_dim | Number of dimensions. | |
num_obs | Vector of length equal to the number of batches. The i -th entry contains the number of observations in batch i . | |
[in] | input | Pointer to an array containing a column-major matrix of uncorrected values from all batches. The number of rows is equal to num_dim and the number of columns is equal to the sum of num_obs . The first num_obs[0] columns contain the uncorrected data for the first batch, the next num_obs[1] columns contain observations for the second batch, and so on. |
[out] | output | Pointer to an array containing a column-major matrix of the same dimensions as that in input , where the corrected values for all batches are stored. On output, the first num_obs[0] columns contain the corrected values of the first batch, the second num_obs[1] columns contain the corrected values of the second batch, and so on. |
options | Further options. |
void mnncorrect::compute | ( | const std::size_t | num_dim, |
const std::vector< Index_ > & | num_obs, | ||
const std::vector< const Float_ * > & | batches, | ||
Float_ *const | output, | ||
const Options< Index_, Float_, Matrix_ > & | options ) |
This function implements a variant of the mutual nearest neighbors (MNN) method for batch correction (Haghverdi et al., 2018). Two observations from different batches can form an MNN pair if they each belong in each other's set of nearest neighbors. The MNN pairs are assumed to represent observations from corresponding subpopulations across the two batches. Any differences in location between the paired observations represents an estimate of the batch effect in that part of the high-dimensional space.
We consider one batch to be the "reference" and the other to be the "target", where the aim is to correct the latter to the former. Each MNN pair defines a correction vector that moves the target observation towards its paired reference observation. For each observation in the target batch, we identify the closest observation in the same batch that is part of a MNN pair (i.e., "MNN-involved observations"). We apply that pair's correction vector to the observation to obtain its corrected coordinates.
Each MNN pair's correction vector is computed between the "center of mass" locations for the paired observations. The center of mass for each observation is defined by recursively searching the neighbors of each MNN-involved observation (and then the neighbors of those neighbors, up to a recursion depth of Options::num_steps
) and computing the mean of their coordinates. This improves the correction by mitigating the "kissing effect", i.e., where the correction vectors only form between the surfaces of the mass of points in each batch.
In the case of >2 batches, we define a merge order based on Options::merge_policy
. For the first batch to be merged, we identify MNN pairs to all other batches at once. The subsequent correction effectively distributes the first batch's observations to all other batches. This process is repeated for all remaining batches until only one batch remains that contains all observations.
Index_ | Integer type of the observation index. |
Float_ | Floating-point type of the input/output data. |
Matrix_ | Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface. Alternatively, it may be a knncolle::SimpleMatrix . |
num_dim | Number of dimensions. | |
num_obs | Vector of length equal to the number of batches. The i -th entry contains the number of observations in batch i . | |
[in] | batches | Vector of length equal to the number of batches. The i -th entry points to a column-major dimension-by-observation array containing the uncorrected data for batch i , where the number of rows is equal to num_dim and the number of columns is equal to num_obs[i] . |
[out] | output | Pointer to an array containing a column-major matrix with number of rows equal to num_dim and number of columns equal to the sum of num_obs . On output, the first num_obs[0] columns contain the corrected values of the first batch, the second num_obs[1] columns contain the corrected values of the second batch, and so on. |
options | Further options. |
void mnncorrect::parallelize | ( | const int | num_workers, |
const Task_ | num_tasks, | ||
Run_ | run_task_range ) |
Task_ | Integer type for the number of tasks. |
Run_ | Function to execute a range of tasks. |
num_workers | Number of workers. |
num_tasks | Number of tasks. |
run_task_range | Function to iterate over a range of tasks within a worker. |
By default, this is an alias to subpar::parallelize_range()
. However, if the MNNCORRECT_CUSTOM_PARALLEL
function-like macro is defined, it is called instead. Any user-defined macro should accept the same arguments as subpar::parallelize_range()
.