partisub
Subsampling within partitions
Loading...
Searching...
No Matches
Subsampling within partitions

Unit tests Documentation Codecov

Overview

Consider a dataset where observations are grouped into discrete partitions, e.g., clusters, factors. We would like to sample a subset of observations for further analysis, typically in time-consuming steps where the full dataset would be too large. In doing so, we would also like to preserve the distribution of cells across partitions within the subset. This improves the relevance of subset's results when extrapolated to the full dataset.

partisub implements a simple algorithm for subsampling a dataset with user-defined partitions. It subsamples observations within each partition to minimize the effect of sampling noise on the relative frequencies of partitions in the subset. Optionally, it can also force partitions to be represented by at least one observation. This is useful for guaranteeing the presence of low-frequency partitions, e.g., rare cell types in single-cell applications.

Quick start

int nobs = 1000;
std::vector<int> clusters(nobs); // or some type of partition.
std::fill(clusters.begin() + 500, clusters.end(), 1);
// Subsampling to 100 observations.
auto selected = partisub::compute(nobs, clusters.data(), 100, {});
// Each partition will be represented by default, even if it is rare:
clusters[0] = 2;
auto selected2 = partisub::compute(nobs, clusters.data(), 100, {});
void compute(const Index_ num_obs, const Partition_ *partition, const Index_ target, const Options &options, std::vector< Index_ > &output)
Definition partisub.hpp:70
Subsampling in partitions.

Check out the reference documentation for more details.

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
partisub
GIT_REPOSITORY https://github.com/libscran/partisub
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(partisub)

Then you can link to partisub to make the headers available during compilation:

# For executables:
target_link_libraries(myexe libscran::partisub)
# For libaries
target_link_libraries(mylib INTERFACE libscran::partisub)

By default, this will use FetchContent to fetch all external dependencies. Applications should consider pinning versions of all dependencies - see extern/CMakeLists.txt for suggested versions. If you want to install them manually, use -DPARTISUB_FETCH_EXTERN=OFF.

CMake with find_package()

find_package(libscran_partisub CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE libscran::partisub)

To install the library, use:

mkdir build && cd build
cmake .. -DPARTISUB_TESTS=OFF
cmake --build . --target install

Again, this will use FetchContent to retrieve dependencies, see comments above.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This also requires the external dependencies listed in extern/CMakeLists.txt.