|
partisub
Subsampling within partitions
|
Consider a dataset where observations are grouped into discrete partitions, e.g., clusters, factors. We would like to sample a subset of observations for further analysis, typically in time-consuming steps where the full dataset would be too large. In doing so, we would also like to preserve the distribution of cells across partitions within the subset. This improves the relevance of subset's results when extrapolated to the full dataset.
partisub implements a simple algorithm for subsampling a dataset with user-defined partitions. It subsamples observations within each partition to minimize the effect of sampling noise on the relative frequencies of partitions in the subset. Optionally, it can also force partitions to be represented by at least one observation. This is useful for guaranteeing the presence of low-frequency partitions, e.g., rare cell types in single-cell applications.
Check out the reference documentation for more details.
FetchContentIf you're using CMake, you just need to add something like this to your CMakeLists.txt:
Then you can link to partisub to make the headers available during compilation:
By default, this will use FetchContent to fetch all external dependencies. Applications should consider pinning versions of all dependencies - see extern/CMakeLists.txt for suggested versions. If you want to install them manually, use -DPARTISUB_FETCH_EXTERN=OFF.
find_package()To install the library, use:
Again, this will use FetchContent to retrieve dependencies, see comments above.
If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This also requires the external dependencies listed in extern/CMakeLists.txt.