nenesub
Nearest neighbors subsampling
Loading...
Searching...
No Matches
nenesub Namespace Reference

Nearest-neighbors subsampling. More...

Classes

struct  Options
 Options for compute(). More...
 

Functions

template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >
void compute (const Index_ num_obs, const GetNeighbors_ get_neighbors, const GetIndex_ get_index, const GetMaxDistance_ get_max_distance, const Options &options, std::vector< Index_ > &selected)
 
template<typename Index_ , typename Distance_ >
std::vector< Index_ > compute (const knncolle::NeighborList< Index_, Distance_ > &neighbors, const Options &options)
 
template<typename Index_ , typename Input_ , typename Distance_ >
std::vector< Index_ > compute (const knncolle::Prebuilt< Index_, Input_, Distance_ > &prebuilt, const Options &options)
 
template<typename Index_ , typename Input_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>>
std::vector< Index_ > compute (const std::size_t num_dims, const Index_ num_obs, const Input_ *data, const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > &knn_method, const Options &options)
 

Detailed Description

Nearest-neighbors subsampling.

Function Documentation

◆ compute() [1/4]

template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >
void nenesub::compute ( const Index_ num_obs,
const GetNeighbors_ get_neighbors,
const GetIndex_ get_index,
const GetMaxDistance_ get_max_distance,
const Options & options,
std::vector< Index_ > & selected )

Deterministically subsample a dataset based on nearest neighbors.

We use the \(k\)-nearest neighbors of each observation to define its local neighborhood. We select an observation in the subsampled dataset if it:

  1. Does not belong in the local neighborhood (i.e., is not a neighbor) of any previously selected observation.
  2. Has at least \(m\) neighbors that are not selected or in the local neighborhoods of any other selected observation.
  3. Has the most neighbors that are not selected or in the local neighborhoods of previously selected observations, out of all observations that satisfy (1) and (2). Ties are broken using the smallest distance to each observation's \(k\)-th neighbor, i.e., it lies in the densest region of space.

We repeat this process until there are no more observations that satisfy these requirements.

Each selected observation effectively serves as a representative for up to \(k\) of its nearest neighbors. As such, the rate of subsampling is roughly proportional to the choice of \(k\). A non-zero \(m\) ensures that there are enough neighbors to warrant the selection of an observation, to protect against overrepresentation of outlier points that are not in any observation's neighborhood. Some testing suggests that the dataset is subsampled by a factor of \(k\), though this can increase or decrease for smaller or larger \(m\), respectively.

The nenesub approach ensures that the subsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. In contrast, random sampling does not provide strong guarantees for capture of a rare subpopulation. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions. This simplifies the interpretation of analysis results generated from the subsampled dataset.

More aggressive subsampling can be achieved by recursively applying the nenesub method, i.e., subsample the subsampled dataset. This allows users to achieve higher subsampling rates without long compute times from a very large \(k\).

Template Parameters
Index_Integer type of the observation indices.
GetNeighbors_Function that accepts an Index_ index and returns a (const reference to a) container-like object. The container should implement the [] operator and have a size() method.
GetIndex_Function that accepts a (const reference to a) container of the type returned by GetNeighbors_ and an Index_ into that container, and returns Index_.
GetMaxDistance_Function that accepts an Index_ index and returns a distance, typically floating-point.
Parameters
num_obsNumber of observations in the dataset.
get_neighborsFunction that accepts an integer observation index in [0, num_obs) and returns a container of that observation's neighbors. Each element of the container specifies the index of a neighboring observation.
get_indexFunction to return the index of each neighbor, given the container returned by get_neighbors and a position in that container.
get_max_distanceFunction that accepts an integer observation index in [0, num_obs) and returns the distance from that observation to its furthest neighbor.
optionsFurther options. Note that Options::num_neighbors and Options::num_threads are ignored here.
[out]selectedOn output, the indices of the observations that were subsampled. These are sorted in ascending order.

◆ compute() [2/4]

template<typename Index_ , typename Distance_ >
std::vector< Index_ > nenesub::compute ( const knncolle::NeighborList< Index_, Distance_ > & neighbors,
const Options & options )

Overload of compute() to enable convenient usage with pre-computed neighbors from knncolle.

Template Parameters
Index_Integer type of the neighbor indices.
Distance_Floating-point type of the distances.
Parameters
neighborsVector of nearest-neighbor search results for each observation. Each entry is a vector containing (index, distance) pairs for all neighbors, should by increasing distance. The same number of neighbors should be present for each observation.
optionsFurther options. Note that Options::num_neighbors and Options::num_threads are ignored here.
Returns
A sorted vector of the indices of the subsampled observations.

◆ compute() [3/4]

template<typename Index_ , typename Input_ , typename Distance_ >
std::vector< Index_ > nenesub::compute ( const knncolle::Prebuilt< Index_, Input_, Distance_ > & prebuilt,
const Options & options )

Overload of compute() for convenient usage with a prebuilt nearest-neighbor search index from knncolle.

Template Parameters
Dim_Integer type of the dimension index.
Index_Integer type of the observation index.
Input_Numeric type of the input data used to build the search index. This is only required to define the knncolle::Prebuilt class and is otherwise ignored.
Distance_Floating-point type of the distances.
Parameters
[in]prebuiltA prebuilt nearest-neighbor search index on the observations of interest.
optionsFurther options.
Returns
A sorted vector of the indices of the subsampled observations.

◆ compute() [4/4]

template<typename Index_ , typename Input_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>>
std::vector< Index_ > nenesub::compute ( const std::size_t num_dims,
const Index_ num_obs,
const Input_ * data,
const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > & knn_method,
const Options & options )

Overload of compute() for convenient usage with a column-major array of coordinates for each observation.

Template Parameters
Dim_Integer type of the dimension index.
Index_Integer type of the observation index.
Input_Numeric type of the input data.
Distance_Floating-point type of the distances.
Matrix_Class of the input data matrix for the neighbor search. This should satisfy the knncolle::Matrix interface.
Parameters
num_dimsNumber of dimensions for the observation coordinates.
num_obsNumber of observations in the dataset.
[in]dataPointer to a num_dims-by-num_observations column-major array of observation coordinates where rows are dimensions and columns are observations.
knn_methodSpecification of the nearest-neighbor search algorithm, e.g., knncolle::VptreeBuilder, knncolle::KmknnBuilder.
optionsFurther options.
Returns
A sorted vector of the indices of the subsampled observations.