template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >
void nenesub::compute |
( |
const Index_ | num_obs, |
|
|
const GetNeighbors_ | get_neighbors, |
|
|
const GetIndex_ | get_index, |
|
|
const GetMaxDistance_ | get_max_distance, |
|
|
const Options & | options, |
|
|
std::vector< Index_ > & | selected ) |
Deterministically subsample a dataset based on nearest neighbors.
We use the \(k\)-nearest neighbors of each observation to define its local neighborhood. We select an observation in the subsampled dataset if it:
- Does not belong in the local neighborhood (i.e., is not a neighbor) of any previously selected observation.
- Has at least \(m\) neighbors that are not selected or in the local neighborhoods of any other selected observation.
- Has the most neighbors that are not selected or in the local neighborhoods of previously selected observations, out of all observations that satisfy (1) and (2). Ties are broken using the smallest distance to each observation's \(k\)-th neighbor, i.e., it lies in the densest region of space.
We repeat this process until there are no more observations that satisfy these requirements.
Each selected observation effectively serves as a representative for up to \(k\) of its nearest neighbors. As such, the rate of subsampling is roughly proportional to the choice of \(k\). A non-zero \(m\) ensures that there are enough neighbors to warrant the selection of an observation, to protect against overrepresentation of outlier points that are not in any observation's neighborhood. Some testing suggests that the dataset is subsampled by a factor of \(k\), though this can increase or decrease for smaller or larger \(m\), respectively.
The nenesub approach ensures that the subsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. In contrast, random sampling does not provide strong guarantees for capture of a rare subpopulation. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions. This simplifies the interpretation of analysis results generated from the subsampled dataset.
More aggressive subsampling can be achieved by recursively applying the nenesub method, i.e., subsample the subsampled dataset. This allows users to achieve higher subsampling rates without long compute times from a very large \(k\).
- Template Parameters
-
Index_ | Integer type of the observation indices. |
GetNeighbors_ | Function that accepts an Index_ index and returns a (const reference to a) container-like object. The container should implement the [] operator and have a size() method. |
GetIndex_ | Function that accepts a (const reference to a) container of the type returned by GetNeighbors_ and an Index_ into that container, and returns Index_ . |
GetMaxDistance_ | Function that accepts an Index_ index and returns a distance, typically floating-point. |
- Parameters
-
| num_obs | Number of observations in the dataset. |
| get_neighbors | Function that accepts an integer observation index in [0, num_obs) and returns a container of that observation's neighbors. Each element of the container specifies the index of a neighboring observation. |
| get_index | Function to return the index of each neighbor, given the container returned by get_neighbors and a position in that container. |
| get_max_distance | Function that accepts an integer observation index in [0, num_obs) and returns the distance from that observation to its furthest neighbor. |
| options | Further options. Note that Options::num_neighbors and Options::num_threads are ignored here. |
[out] | selected | On output, the indices of the observations that were subsampled. These are sorted in ascending order. |