umappp
A C++ library for UMAP
Loading...
Searching...
No Matches
umappp::Options Struct Reference

Options for initialize(). More...

#include <Options.hpp>

Collaboration diagram for umappp::Options:

Public Attributes

double local_connectivity = 1
 
double bandwidth = 1
 
double mix_ratio = 1
 
double spread = 1
 
double min_dist = 0.1
 
std::optional< double > a
 
std::optional< double > b
 
double repulsion_strength = 1
 
InitializeMethod initialize_method = InitializeMethod::SPECTRAL
 
bool initialize_random_on_spectral_fail = true
 
irlba::Options initialize_spectral_irlba_options
 
double initialize_spectral_scale = 10
 
bool initialize_spectral_jitter = false
 
double initialize_spectral_jitter_sd = 0.0001
 
double initialize_random_scale = 10
 
RngEngine::result_type initialize_seed = sanisizer::cap<typename RngEngine::result_type>(9876543210)
 
std::optional< int > num_epochs
 
double learning_rate = 1
 
double negative_sample_rate = 5
 
int num_neighbors = 15
 
RngEngine::result_type optimize_seed = sanisizer::cap<typename RngEngine::result_type>(1234567890)
 
int num_threads = 1
 
int parallel_optimization = false
 

Detailed Description

Member Data Documentation

◆ a

std::optional<double> umappp::Options::a

Positive value for the \(a\) parameter for the fuzzy set membership confidence calculations. Larger values yield a sharper decay in membership confidence with increasing distance between observations.

If this or Options::b are unset, a suitable value for this parameter is automatically determined from Options::spread and Options::min_dist.

◆ b

std::optional<double> umappp::Options::b

Value in \((0, 1)\) for the \(b\) parameter for the fuzzy set membership confidence calculations. Larger values yield an earlier decay in membership confidence with increasing distance between observations.

If this or Options::a are unset, a suitable value for this parameter is automatically determined from Options::spread and Options::min_dist.

◆ bandwidth

double umappp::Options::bandwidth = 1

Effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure.

◆ initialize_method

InitializeMethod umappp::Options::initialize_method = InitializeMethod::SPECTRAL

How to initialize the embedding.

◆ initialize_random_on_spectral_fail

bool umappp::Options::initialize_random_on_spectral_fail = true

Whether to fall back to random sampling from a normal distribution (i.e., same as InitializeMethod::RANDOM) if spectral initialization fails. If false, any existing values in the input array will be used, i.e., same as InitializeMethod::NONE. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL and spectral initialization fails.

◆ initialize_random_scale

double umappp::Options::initialize_random_scale = 10

Scale of the randomly generated initial coordinates. Specifically, Coordinates are sampled from a uniform distribution from \([-x, x)\) where \(x\) is initialize_random_scale. Only relevant if Options::initialize_method = InitializeMethod::RANDOM, or Options::initialize_method = InitializeMethod::SPECTRAL and spectral initialization fails and Options::initialize_random_on_spectral_fail = true.

◆ initialize_seed

RngEngine::result_type umappp::Options::initialize_seed = sanisizer::cap<typename RngEngine::result_type>(9876543210)

Seed for the random number generator during initialization. Only relevant if Options::initialize_method = InitializeMethod::RANDOM; or Options::initialize_method = InitializeMethod::SPECTRAL and Options::initialize_spectral_jitter = true; or Options::initialize_method = InitializeMethod::SPECTRAL and spectral initialization fails and Options::initialize_random_on_spectral_fail = true.

◆ initialize_spectral_irlba_options

irlba::Options umappp::Options::initialize_spectral_irlba_options

Further options to pass to irlba::compute() for spectral initialization.

◆ initialize_spectral_jitter

bool umappp::Options::initialize_spectral_jitter = false

Whether to jitter the coordinates after spectral initialization to separate duplicate observations (e.g., to avoid overplotting). This is done using normally-distributed noise of mean zero and standard deviation of Options::initialize_spectral_jitter_sd. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL and spectral initialization does not fail.

◆ initialize_spectral_jitter_sd

double umappp::Options::initialize_spectral_jitter_sd = 0.0001

Standard deviation of the jitter to apply after spectral initialization. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL and spectral initialization does not fail and Options::initialize_spectral_jitter = true.

◆ initialize_spectral_scale

double umappp::Options::initialize_spectral_scale = 10

Maximum absolute magnitude of the coordinates after spectral initialization. All initial coordinates are scaled such that the maximum of their absolute values is equal to initialize_spectral_scale. This ensures that outlier observations will not have large absolute distances that may interfere with optimization. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL and spectral initialization does not fail.

◆ learning_rate

double umappp::Options::learning_rate = 1

Initial learning rate used in the gradient descent. Larger values can accelerate convergence but at the risk of skipping over suitable local optima.

◆ local_connectivity

double umappp::Options::local_connectivity = 1

Number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. This may be a fractional number of neighbors, in which case interpolation is performed when computing the membership confidence.

◆ min_dist

double umappp::Options::min_dist = 0.1

Minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favor a more even distribution of observations throughout the low-dimensional space. This is interpreted relative to Options::spread. Ignored if both Options::a and Options::b are provided.

◆ mix_ratio

double umappp::Options::mix_ratio = 1

Mixing ratio in \([0, 1]\) when combining fuzzy sets. This symmetrizes the sets so that the confidence of observation \(A\) belonging to observation \(B\)'s set is the same as that of \(B\) belonging to \(A\)'s set. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values favor connectivity and more global structure.

◆ negative_sample_rate

double umappp::Options::negative_sample_rate = 5

Rate of sampling negative observations to compute repulsive forces. Greater values will improve accuracy but increase compute time.

◆ num_epochs

std::optional<int> umappp::Options::num_epochs

Number of epochs for the gradient descent, i.e., optimization iterations. Larger values improve accuracy at the cost of increased compute time. If no value is provided, one is automatically chosen based on the size of the dataset:

  • For datasets with no more than 10000 observations, the number of epochs is set to 500.
  • For larger datasets with more than 10000 observations, the number of epochs is inversely proportional to the number of observations. Specifically, the number of epochs starts at 500 for 10000 observations and decreases asymptotically to a lower limit of 200. This choice aims to reduce computational work for very large datasets.

◆ num_neighbors

int umappp::Options::num_neighbors = 15

Number of neighbors to use to define the fuzzy sets. Larger values improve connectivity and favor preservation of global structure, at the cost of increased compute time. This argument is only used in certain initialize() overloads that perform identification of the nearest neighbors.

◆ num_threads

int umappp::Options::num_threads = 1

Number of threads to use. The parallelization scheme is determined by parallelize() for most calculations. The exception is the nearest-neighbor search in some of the initialize() overloads, where the scheme is determined by knncolle::parallelize() instead.

If Options::parallel_optimization = true, this option will also affect the layout optimization, i.e., the gradient descent iterations.

◆ optimize_seed

RngEngine::result_type umappp::Options::optimize_seed = sanisizer::cap<typename RngEngine::result_type>(1234567890)

Seed for the random number generator when sampling negative observations in the optimization step.

◆ parallel_optimization

int umappp::Options::parallel_optimization = false

Whether to enable parallel optimization. If set to true, this will use the number of threads specified in Options::num_threads for the layout optimization step.

By default, this is set to false as the increase in the number of threads is usually not cost-effective for layout optimization. Specifically, while CPU usage scales with the number of threads, the time spent does not decrease by the same factor. We also expect that the number of available CPUs is at least equal to the requested number of threads, otherwise contention will greatly degrade performance. Nonetheless, users can enable parallel optimization if cost is no issue - usually a higher number of threads (above 4) is required to see a significant speed-up.

If the UMAPPP_NO_PARALLEL_OPTIMIZATION macro is defined, umappp will not be compiled with support for parallel optimization. This may be desirable in environments that have no support for threading or atomics, or to reduce the binary size if parallelization is not of interest. In such cases, enabling parallel optimization and calling Status::run() will throw an error.

◆ repulsion_strength

double umappp::Options::repulsion_strength = 1

Modifier for the repulsive force. Larger values increase repulsion and favor local structure.

◆ spread

double umappp::Options::spread = 1

Scale of the coordinates of the final low-dimensional embedding. Ignored if both Options::a and Options::b are provided.


The documentation for this struct was generated from the following file: