Nearest-neighbors subsampling. More...

Classes
struct	Options
	Options for `compute()`. More...

Functions
template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >
void	compute (const Index_ num_obs, const GetNeighbors_ get_neighbors, const GetIndex_ get_index, const GetMaxDistance_ get_max_distance, const Options &options, std::vector< Index_ > &selected)

template<typename Index_ , typename Distance_ >
std::vector< Index_ >	compute (const knncolle::NeighborList< Index_, Distance_ > &neighbors, const Options &options)

template<typename Index_ , typename Input_ , typename Distance_ >
std::vector< Index_ >	compute (const knncolle::Prebuilt< Index_, Input_, Distance_ > &prebuilt, const Options &options)

template<typename Index_ , typename Input_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>>
std::vector< Index_ >	compute (const std::size_t num_dims, const Index_ num_obs, const Input_ *data, const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > &knn_method, const Options &options)

Detailed Description

Nearest-neighbors subsampling.

Function Documentation

◆ compute() [1/4]

template<typename Index_ , class GetNeighbors_ , class GetIndex_ , class GetMaxDistance_ >

void nenesub::compute	(	const Index_	num_obs,
		const GetNeighbors_	get_neighbors,
		const GetIndex_	get_index,
		const GetMaxDistance_	get_max_distance,
		const Options &	options,
		std::vector< Index_ > &	selected )

Deterministically subsample a dataset based on nearest neighbors.

We use the \(k\)-nearest neighbors of each observation to define its local neighborhood. We select an observation in the subsampled dataset if it:

Does not belong in the local neighborhood (i.e., is not a neighbor) of any previously selected observation.
Has at least \(m\) neighbors that are not selected or in the local neighborhoods of any other selected observation.
Has the most neighbors that are not selected or in the local neighborhoods of previously selected observations, out of all observations that satisfy (1) and (2). Ties are broken using the smallest distance to each observation's \(k\)-th neighbor, i.e., it lies in the densest region of space.

We repeat this process until there are no more observations that satisfy these requirements.

Each selected observation effectively serves as a representative for up to \(k\) of its nearest neighbors. As such, the rate of subsampling is roughly proportional to the choice of \(k\). A non-zero \(m\) ensures that there are enough neighbors to warrant the selection of an observation, to protect against overrepresentation of outlier points that are not in any observation's neighborhood. Some testing suggests that the dataset is subsampled by a factor of \(k\), though this can increase or decrease for smaller or larger \(m\), respectively.

The nenesub approach ensures that the subsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. In contrast, random sampling does not provide strong guarantees for capture of a rare subpopulation. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions. This simplifies the interpretation of analysis results generated from the subsampled dataset.

More aggressive subsampling can be achieved by recursively applying the nenesub method, i.e., subsample the subsampled dataset. This allows users to achieve higher subsampling rates without long compute times from a very large \(k\).

Template Parameters

Index_	Integer type of the observation indices.
GetNeighbors_	Function that accepts an `Index_` index and returns a (const reference to a) container-like object. The container should implement the `[]` operator and have a `size()` method.
GetIndex_	Function that accepts a (const reference to a) container of the type returned by `GetNeighbors_` and an `Index_` into that container, and returns `Index_`.
GetMaxDistance_	Function that accepts an `Index_` index and returns a distance, typically floating-point.

Parameters

	num_obs	Number of observations in the dataset.
	get_neighbors	Function that accepts an integer observation index in `[0, num_obs)` and returns a container of that observation's neighbors. Each element of the container specifies the index of a neighboring observation.
	get_index	Function to return the index of each neighbor, given the container returned by `get_neighbors` and a position in that container.
	get_max_distance	Function that accepts an integer observation index in `[0, num_obs)` and returns the distance from that observation to its furthest neighbor.
	options	Further options. Note that `Options::num_neighbors` and `Options::num_threads` are ignored here.
[out]	selected	On output, the indices of the observations that were subsampled. These are sorted in ascending order.

◆ compute() [2/4]

template<typename Index_ , typename Distance_ >

std::vector< Index_ > nenesub::compute	(	const knncolle::NeighborList< Index_, Distance_ > &	neighbors,
		const Options &	options )

Overload of compute() to enable convenient usage with pre-computed neighbors from knncolle.

Template Parameters

Index_	Integer type of the neighbor indices.
Distance_	Floating-point type of the distances.

Parameters

neighbors	Vector of nearest-neighbor search results for each observation. Each entry is a vector containing (index, distance) pairs for all neighbors, should by increasing distance. The same number of neighbors should be present for each observation.
options	Further options. Note that `Options::num_neighbors` and `Options::num_threads` are ignored here.

Returns: A sorted vector of the indices of the subsampled observations.

◆ compute() [3/4]

template<typename Index_ , typename Input_ , typename Distance_ >

std::vector< Index_ > nenesub::compute	(	const knncolle::Prebuilt< Index_, Input_, Distance_ > &	prebuilt,
		const Options &	options )

Overload of compute() for convenient usage with a prebuilt nearest-neighbor search index from knncolle.

Template Parameters

Dim_	Integer type of the dimension index.
Index_	Integer type of the observation index.
Input_	Numeric type of the input data used to build the search index. This is only required to define the `knncolle::Prebuilt` class and is otherwise ignored.
Distance_	Floating-point type of the distances.

Parameters

[in]	prebuilt	A prebuilt nearest-neighbor search index on the observations of interest.
	options	Further options.

Returns: A sorted vector of the indices of the subsampled observations.

◆ compute() [4/4]

template<typename Index_ , typename Input_ , typename Distance_ , class Matrix_ = knncolle::Matrix<Index_, Input_>>

std::vector< Index_ > nenesub::compute	(	const std::size_t	num_dims,
		const Index_	num_obs,
		const Input_ *	data,
		const knncolle::Builder< Index_, Input_, Distance_, Matrix_ > &	knn_method,
		const Options &	options )

Overload of compute() for convenient usage with a column-major array of coordinates for each observation.

Template Parameters

Dim_	Integer type of the dimension index.
Index_	Integer type of the observation index.
Input_	Numeric type of the input data.
Distance_	Floating-point type of the distances.
Matrix_	Class of the input data matrix for the neighbor search. This should satisfy the `knncolle::Matrix` interface.

Parameters

	num_dims	Number of dimensions for the observation coordinates.
	num_obs	Number of observations in the dataset.
[in]	data	Pointer to a `num_dims`-by-`num_observations` column-major array of observation coordinates where rows are dimensions and columns are observations.
	knn_method	Specification of the nearest-neighbor search algorithm, e.g., `knncolle::VptreeBuilder`, `knncolle::KmknnBuilder`.
	options	Further options.

Returns: A sorted vector of the indices of the subsampled observations.

Classes

Functions

Detailed Description

Function Documentation

◆ compute() [1/4]

◆ compute() [2/4]

◆ compute() [3/4]

◆ compute() [4/4]