nenesub
Nearest neighbors subsampling
Loading...
Searching...
No Matches
Nearest-neighbors subsampling

Unit tests Documentation Codecov

Overview

nenesub implements a simple algorithm for deterministic subsampling of a dataset based on nearest neighbors. Starting from dense regions of the high-dimensional space, we select an observation for inclusion into the subsampled set. Every time we select an observation, we remove it and all of its nearest neighbors from the dataset. We then select the next observation with the most remaining neighbors, with ties broken by density; this is repeated until there are no more observations.

The general idea is that each selected observation serves as a representative for its nearest neighbors. This ensures that the subsampled points are well-distributed across the dataset. Low-frequency subpopulations will always have at least a few representatives if they are sufficiently distant from other subpopulations. We also preserve the relative density across the dataset as more representatives will be generated from high-density regions.

Quick start

Given a column-major array of coordinates (possibly in some low-dimensional space), we can subsample the observations using their nearest neighbors:

size_t nobs = 1000;
size_t ndims = 100;
std::vector<double> coordinates(ndims * nobs);
// Fill it with some coordinates...
opt.num_neighbors = 20;
opt.min_remaining = 10;
opt.num_threads = 3;
auto selected = nenesub::compute(
ndims,
nobs,
coordinates.data(),
knncolle::VptreeBuilder<>(), // any NN algorithm can be used here.
opt
);
void compute(Index_ num_obs, GetNeighbors_ get_neighbors, GetIndex_ get_index, GetMaxDistance_ get_max_distance, const Options &options, std::vector< Index_ > &selected)
Definition nenesub.hpp:87
Nearest-neighbors subsampling.
Options for compute().
Definition nenesub.hpp:24
int num_neighbors
Definition nenesub.hpp:29
int num_threads
Definition nenesub.hpp:42
int min_remaining
Definition nenesub.hpp:35

Alternatively, we can supply a precomputed list of neighbors:

// Getting the nearest neighbors:
knncolle::SimpleMatrix kmatrix(ndims, nobs, coordinates.data());
auto prebuilt = algorithm.build_unique(kmatrix);
auto neighbors = knncolle::find_nearest_neighbors(*prebuilt, /* k = */ 20);
auto selected = nenesub::compute(neighbors, opt);
std::unique_ptr< Prebuilt< typename Matrix_::dimension_type, typename Matrix_::index_type, Float_ > > build_unique(const Matrix_ &data) const
NeighborList< Index_, Float_ > find_nearest_neighbors(const Prebuilt< Dim_, Index_, Float_ > &index, int k, int num_threads=1)

Check out the reference documentation for more details.

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
nenesub
GIT_REPOSITORY https://github.com/libscran/nenesub
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(nenesub)

Then you can link to nenesub to make the headers available during compilation:

# For executables:
target_link_libraries(myexe libscran::nenesub)
# For libaries
target_link_libraries(mylib INTERFACE libscran::nenesub)

CMake with find_package()

find_package(libscran_nenesub CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::nenesub)

To install the library, use:

mkdir build && cd build
cmake .. -DNENESUB_TESTS=OFF
cmake --build . --target install

By default, this will use FetchContent to fetch all external dependencies. If you want to install them manually, use -DNENESUB_FETCH_EXTERN=OFF. See the tags in extern/CMakeLists.txt to find compatible versions of each dependency.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This requires the external dependencies listed in extern/CMakeLists.txt, which also need to be made available during compilation.