nenesub
Nearest neighbors subsampling
Loading...
Searching...
No Matches
Classes | Functions
nenesub Namespace Reference

Nearest-neighbors subsampling. More...

Classes

struct  Options
 Options for compute(). More...
 

Functions

template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >
void compute (Index_ num_obs, GetNeighbors_ get_neighbors, GetIndex_ get_index, GetMaxDistance_ get_max_distance, const Options &options, std::vector< Index_ > &selected)
 
template<typename Index_ , typename Distance_ >
std::vector< Index_ > compute (const knncolle::NeighborList< Index_, Distance_ > &neighbors, const Options &options)
 
template<typename Dim_ , typename Index_ , typename Float_ >
std::vector< Index_ > compute (const knncolle::Prebuilt< Dim_, Index_, Float_ > &prebuilt, const Options &options)
 
template<typename Dim_ , typename Index_ , typename Value_ , typename Float_ >
std::vector< Index_ > compute (Dim_ num_dims, Index_ num_obs, const Value_ *data, const knncolle::Builder< knncolle::SimpleMatrix< Dim_, Index_, Value_ >, Float_ > &knn_method, const Options &options)
 

Detailed Description

Nearest-neighbors subsampling.

Function Documentation

◆ compute() [1/4]

template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >
void nenesub::compute ( Index_  num_obs,
GetNeighbors_  get_neighbors,
GetIndex_  get_index,
GetMaxDistance_  get_max_distance,
const Options options,
std::vector< Index_ > &  selected 
)

This function generates a deterministic subsampling of a dataset based on nearest neighbors. We first identify the \(k\)-nearest neighbors of each observation and use that to define its local neighborhood. We select an observation for subsampling if it:

  • Does not belong in the local neighborhood of any previously selected observation.
  • Has the most neighbors that are not selected or in the local neighborhoods of previously selected observations. Ties are broken using the smallest distance to each observation's \(k\)-th neighbor (i.e., the densest region of space).
  • Has at least \(m\) neighbors that are not selected or in the local neighborhoods of any other selected observation.

We repeat this process until there are no more observations that satisfy these requirements.

Each selected observation effectively serves as a representative for up to \(k\) of its nearest neighbors. As such, the rate of subsampling is roughly proportional to the choice of \(k\). A non-zero \(m\) ensures that there are enough neighbors to warrant the selection of an observation, to protect against overrepresentation of outlier points that are not in any observation's neighborhood. Some testing suggests that the dataset is subsampled by a factor of \(k\), though this can increase or decrease for smaller or larger \(m\), respectively.

The nenesub approach ensures that the subsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. In contrast, random sampling does not provide strong guarantees for capture of a rare subpopulation. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions. This simplifies the interpretation of analysis results generated from the subsetted dataset.

Template Parameters
Index_Integer type for the observation indices.
GetNeighbors_Function that accepts an Index_ index and returns a (const reference to a) container-like object. The container should be support the [] operator and have a size() method.
GetIndex_Function that accepts a (const reference to a) container of the type returned by GetNeighbors_ and an Index_ into that container, and returns Index_.
GetNeighbors_Function that accepts an Index_ index and returns a distance value, typically floating-point.
Parameters
num_obsNumber of observations in the dataset.
get_neighborsFunction that accepts an integer observation index in [0, num_obs) and returns a container of that observation's neighbors. Each element of the container specifies the index of a neighboring observation. It is generally expected that the returned containers have the same size for all indices.
get_indexFunction to return the index of each neighbor, given the container returned by get_neighbors and an index into that container.
get_max_distanceFunction that accepts an integer observation index in [0, num_obs) and returns the distance from that observation to its furthest neighbor.
optionsFurther options. Note that Options::num_neighbors and Options::num_threads are ignored here.
[out]selectedOn output, the indices of the observations that were subsampled. These are sorted in ascending order.

◆ compute() [2/4]

template<typename Index_ , typename Distance_ >
std::vector< Index_ > nenesub::compute ( const knncolle::NeighborList< Index_, Distance_ > &  neighbors,
const Options options 
)

Overload to enable convenient usage with pre-computed neighbors from knncolle.

Template Parameters
Index_Integer type for the neighbor indices.
Distance_Floating-point type for the distances.
Parameters
neighborsVector of nearest-neighbor search results for each observation. Each entry is a pair containing a vector of neighbor indices and a vector of distances to those neighbors. Neighbors should be sorted by increasing distance. The same number of neighbors should be present for each observation.
optionsFurther options. Note that Options::num_neighbors and Options::num_threads are ignored here.
Returns
A sorted vector of the indices of the subsampled observations.

◆ compute() [3/4]

template<typename Dim_ , typename Index_ , typename Float_ >
std::vector< Index_ > nenesub::compute ( const knncolle::Prebuilt< Dim_, Index_, Float_ > &  prebuilt,
const Options options 
)

Overload to enable convenient usage with a prebuilt nearest-neighbor search index from knncolle.

Template Parameters
Dim_Integer type for the dimension index.
Index_Integer type for the observation index.
Float_Floating-point type for the distances.
Parameters
[in]prebuiltA prebuilt nearest-neighbor search index on the observations of interest.
optionsFurther options.
Returns
A sorted vector of the indices of the subsampled observations.

◆ compute() [4/4]

template<typename Dim_ , typename Index_ , typename Value_ , typename Float_ >
std::vector< Index_ > nenesub::compute ( Dim_  num_dims,
Index_  num_obs,
const Value_ *  data,
const knncolle::Builder< knncolle::SimpleMatrix< Dim_, Index_, Value_ >, Float_ > &  knn_method,
const Options options 
)

Overload to enable convenient usage with a column-major array of coordinates for each observation.

Template Parameters
Dim_Integer type for the dimension index.
Index_Integer type for the observation index.
Value_Numeric type for the input data.
Float_Floating-point type for the distances.
Parameters
num_dimsNumber of dimensions for the observation coordinates.
num_obsNumber of observations in the dataset.
[in]dataPointer to a num_dims-by-num_observations column-major array of observation coordinates where rows are dimensions and columns are observations.
knn_methodSpecification of the nearest-neighbor search algorithm, e.g., knncolle::VptreeBuilder, knncolle::KmknnBuilder.
optionsFurther options.
Returns
A sorted vector of the indices of the subsampled observations.