partisub
Subsampling within partitions
Loading...
Searching...
No Matches
partisub Namespace Reference

Subsampling in partitions. More...

Classes

struct  Options
 Options for compute(). More...
 

Functions

template<typename Index_ , typename Partition_ >
void compute (const Index_ num_obs, const Partition_ *partition, const Index_ target, const Options &options, std::vector< Index_ > &output)
 
template<typename Index_ , typename Partition_ >
std::vector< Index_ > compute (const Index_ num_obs, const Partition_ *partition, const Index_ target, const Options &options)
 

Detailed Description

Subsampling in partitions.

Function Documentation

◆ compute() [1/2]

template<typename Index_ , typename Partition_ >
void partisub::compute ( const Index_ num_obs,
const Partition_ * partition,
const Index_ target,
const Options & options,
std::vector< Index_ > & output )

Subsample observations within each partition.

Consider a dataset where observations are grouped into discrete partitions, e.g., clusters, factors. We would like to sample a subset of observations for further analysis, typically in time-consuming steps where the full dataset would be too large. compute() creates a subset of the specified target size while ensuring that each partition is represented. Specifically:

  • Each non-empty partition will always be represented by at least one of its constituent observations in the sampled subset. This ensures that even small partitions will be present in the subset. As a result, though, the reported number of observations may exceed target if there are many small partitions. Can be disabled by setting Options::force_non_empty = false.
  • The number of observations sampled from each partition is roughly proportional to the size of the partition. More specifically, the sampling is done within each partition to minimize the effect of sampling noise on the relative partition frequencies. This aims to preserve differences in frequencies across partitions so that the subset accurately reflects the full dataset. Otherwise, any discrepancies may make it difficult to extrapolate the subset's results to the full dataset.
  • All observations are returned if the requested target is greater than or equal to num_obs.
Template Parameters
Index_Integer type of the observation indices.
Partition_Integer type of the partition assignments.
Parameters
num_obsNumber of observations, should be non-negative.
[in]partitionPointer to an array of length num_obs containing partition assignments.
targetDesired number of observations in the subset. Note that the actual number of observations returned in output may be different.
optionsFurther options.
[out]outputVector of indices for observations selected in the subset. Indices are guaranteed to be unique and sorted.

◆ compute() [2/2]

template<typename Index_ , typename Partition_ >
std::vector< Index_ > partisub::compute ( const Index_ num_obs,
const Partition_ * partition,
const Index_ target,
const Options & options )

Overload of compute() that allocates the output vector.

Template Parameters
Index_Integer type of the observation indices.
Partition_Integer type of the partition assignments.
Parameters
num_obsNumber of observations, should be non-negative.
[in]partitionPointer to an array of length num_obs containing partition assignments.
targetDesired number of observations in the subset. Note that the actual number of observations returned in output may be different.
optionsFurther options.
Returns
Vector of indices for observations selected in the subset. Indices are guaranteed to be unique and sorted.