|
template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ > |
void | compute (Index_ num_obs, GetNeighbors_ get_neighbors, GetIndex_ get_index, GetMaxDistance_ get_max_distance, const Options &options, std::vector< Index_ > &selected) |
|
template<typename Index_ , typename Distance_ > |
std::vector< Index_ > | compute (const knncolle::NeighborList< Index_, Distance_ > &neighbors, const Options &options) |
|
template<typename Dim_ , typename Index_ , typename Float_ > |
std::vector< Index_ > | compute (const knncolle::Prebuilt< Dim_, Index_, Float_ > &prebuilt, const Options &options) |
|
template<typename Dim_ , typename Index_ , typename Value_ , typename Float_ > |
std::vector< Index_ > | compute (Dim_ num_dims, Index_ num_obs, const Value_ *data, const knncolle::Builder< knncolle::SimpleMatrix< Dim_, Index_, Value_ >, Float_ > &knn_method, const Options &options) |
|
Nearest-neighbors subsampling.
template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >
void nenesub::compute |
( |
Index_ |
num_obs, |
|
|
GetNeighbors_ |
get_neighbors, |
|
|
GetIndex_ |
get_index, |
|
|
GetMaxDistance_ |
get_max_distance, |
|
|
const Options & |
options, |
|
|
std::vector< Index_ > & |
selected |
|
) |
| |
This function generates a deterministic subsampling of a dataset based on nearest neighbors. We first identify the \(k\)-nearest neighbors of each observation and use that to define its local neighborhood. We select an observation for subsampling if it:
- Does not belong in the local neighborhood of any previously selected observation.
- Has the most neighbors that are not selected or in the local neighborhoods of previously selected observations. Ties are broken using the smallest distance to each observation's \(k\)-th neighbor (i.e., the densest region of space).
- Has at least \(m\) neighbors that are not selected or in the local neighborhoods of any other selected observation.
We repeat this process until there are no more observations that satisfy these requirements.
Each selected observation effectively serves as a representative for up to \(k\) of its nearest neighbors. As such, the rate of subsampling is roughly proportional to the choice of \(k\). A non-zero \(m\) ensures that there are enough neighbors to warrant the selection of an observation, to protect against overrepresentation of outlier points that are not in any observation's neighborhood. Some testing suggests that the dataset is subsampled by a factor of \(k\), though this can increase or decrease for smaller or larger \(m\), respectively.
The nenesub approach ensures that the subsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. In contrast, random sampling does not provide strong guarantees for capture of a rare subpopulation. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions. This simplifies the interpretation of analysis results generated from the subsetted dataset.
- Template Parameters
-
Index_ | Integer type for the observation indices. |
GetNeighbors_ | Function that accepts an Index_ index and returns a (const reference to a) container-like object. The container should be support the [] operator and have a size() method. |
GetIndex_ | Function that accepts a (const reference to a) container of the type returned by GetNeighbors_ and an Index_ into that container, and returns Index_ . |
GetNeighbors_ | Function that accepts an Index_ index and returns a distance value, typically floating-point. |
- Parameters
-
| num_obs | Number of observations in the dataset. |
| get_neighbors | Function that accepts an integer observation index in [0, num_obs) and returns a container of that observation's neighbors. Each element of the container specifies the index of a neighboring observation. It is generally expected that the returned containers have the same size for all indices. |
| get_index | Function to return the index of each neighbor, given the container returned by get_neighbors and an index into that container. |
| get_max_distance | Function that accepts an integer observation index in [0, num_obs) and returns the distance from that observation to its furthest neighbor. |
| options | Further options. Note that Options::num_neighbors and Options::num_threads are ignored here. |
[out] | selected | On output, the indices of the observations that were subsampled. These are sorted in ascending order. |