umappp
A C++ library for UMAP
|
Options for initialize()
.
More...
#include <Options.hpp>
Public Attributes | |
double | local_connectivity = 1 |
double | bandwidth = 1 |
double | mix_ratio = 1 |
double | spread = 1 |
double | min_dist = 0.1 |
std::optional< double > | a |
std::optional< double > | b |
double | repulsion_strength = 1 |
InitializeMethod | initialize_method = InitializeMethod::SPECTRAL |
bool | initialize_random_on_spectral_fail = true |
irlba::Options | initialize_spectral_irlba_options |
double | initialize_spectral_scale = 10 |
bool | initialize_spectral_jitter = false |
double | initialize_spectral_jitter_sd = 0.0001 |
double | initialize_random_scale = 10 |
RngEngine::result_type | initialize_seed = sanisizer::cap<typename RngEngine::result_type>(9876543210) |
std::optional< int > | num_epochs |
double | learning_rate = 1 |
double | negative_sample_rate = 5 |
int | num_neighbors = 15 |
RngEngine::result_type | optimize_seed = sanisizer::cap<typename RngEngine::result_type>(1234567890) |
int | num_threads = 1 |
int | parallel_optimization = false |
Options for initialize()
.
std::optional<double> umappp::Options::a |
Positive value for the \(a\) parameter for the fuzzy set membership confidence calculations. Larger values yield a sharper decay in membership confidence with increasing distance between observations.
If this or Options::b
are unset, a suitable value for this parameter is automatically determined from Options::spread
and Options::min_dist
.
std::optional<double> umappp::Options::b |
Value in \((0, 1)\) for the \(b\) parameter for the fuzzy set membership confidence calculations. Larger values yield an earlier decay in membership confidence with increasing distance between observations.
If this or Options::a
are unset, a suitable value for this parameter is automatically determined from Options::spread
and Options::min_dist
.
double umappp::Options::bandwidth = 1 |
Effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure.
InitializeMethod umappp::Options::initialize_method = InitializeMethod::SPECTRAL |
How to initialize the embedding.
bool umappp::Options::initialize_random_on_spectral_fail = true |
Whether to fall back to random sampling from a normal distribution (i.e., same as InitializeMethod::RANDOM
) if spectral initialization fails. If false
, any existing values in the input array will be used, i.e., same as InitializeMethod::NONE
. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL
and spectral initialization fails.
double umappp::Options::initialize_random_scale = 10 |
Scale of the randomly generated initial coordinates. Specifically, Coordinates are sampled from a uniform distribution from \([-x, x)\) where \(x\) is initialize_random_scale
. Only relevant if Options::initialize_method = InitializeMethod::RANDOM
, or Options::initialize_method = InitializeMethod::SPECTRAL
and spectral initialization fails and Options::initialize_random_on_spectral_fail = true
.
RngEngine::result_type umappp::Options::initialize_seed = sanisizer::cap<typename RngEngine::result_type>(9876543210) |
Seed for the random number generator during initialization. Only relevant if Options::initialize_method = InitializeMethod::RANDOM
; or Options::initialize_method = InitializeMethod::SPECTRAL
and Options::initialize_spectral_jitter = true
; or Options::initialize_method = InitializeMethod::SPECTRAL
and spectral initialization fails and Options::initialize_random_on_spectral_fail = true
.
irlba::Options umappp::Options::initialize_spectral_irlba_options |
Further options to pass to irlba::compute()
for spectral initialization.
bool umappp::Options::initialize_spectral_jitter = false |
Whether to jitter the coordinates after spectral initialization to separate duplicate observations (e.g., to avoid overplotting). This is done using normally-distributed noise of mean zero and standard deviation of Options::initialize_spectral_jitter_sd
. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL
and spectral initialization does not fail.
double umappp::Options::initialize_spectral_jitter_sd = 0.0001 |
Standard deviation of the jitter to apply after spectral initialization. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL
and spectral initialization does not fail and Options::initialize_spectral_jitter = true
.
double umappp::Options::initialize_spectral_scale = 10 |
Maximum absolute magnitude of the coordinates after spectral initialization. All initial coordinates are scaled such that the maximum of their absolute values is equal to initialize_spectral_scale
. This ensures that outlier observations will not have large absolute distances that may interfere with optimization. Only relevant if Options::initialize_method = InitializeMethod::SPECTRAL
and spectral initialization does not fail.
double umappp::Options::learning_rate = 1 |
Initial learning rate used in the gradient descent. Larger values can accelerate convergence but at the risk of skipping over suitable local optima.
double umappp::Options::local_connectivity = 1 |
Number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. This may be a fractional number of neighbors, in which case interpolation is performed when computing the membership confidence.
double umappp::Options::min_dist = 0.1 |
Minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favor a more even distribution of observations throughout the low-dimensional space. This is interpreted relative to Options::spread
. Ignored if both Options::a
and Options::b
are provided.
double umappp::Options::mix_ratio = 1 |
Mixing ratio in \([0, 1]\) when combining fuzzy sets. This symmetrizes the sets so that the confidence of observation \(A\) belonging to observation \(B\)'s set is the same as that of \(B\) belonging to \(A\)'s set. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values favor connectivity and more global structure.
double umappp::Options::negative_sample_rate = 5 |
Rate of sampling negative observations to compute repulsive forces. Greater values will improve accuracy but increase compute time.
std::optional<int> umappp::Options::num_epochs |
Number of epochs for the gradient descent, i.e., optimization iterations. Larger values improve accuracy at the cost of increased compute time. If no value is provided, one is automatically chosen based on the size of the dataset:
int umappp::Options::num_neighbors = 15 |
Number of neighbors to use to define the fuzzy sets. Larger values improve connectivity and favor preservation of global structure, at the cost of increased compute time. This argument is only used in certain initialize()
overloads that perform identification of the nearest neighbors.
int umappp::Options::num_threads = 1 |
Number of threads to use. The parallelization scheme is determined by parallelize()
for most calculations. The exception is the nearest-neighbor search in some of the initialize()
overloads, where the scheme is determined by knncolle::parallelize()
instead.
If Options::parallel_optimization = true
, this option will also affect the layout optimization, i.e., the gradient descent iterations.
RngEngine::result_type umappp::Options::optimize_seed = sanisizer::cap<typename RngEngine::result_type>(1234567890) |
Seed for the random number generator when sampling negative observations in the optimization step.
int umappp::Options::parallel_optimization = false |
Whether to enable parallel optimization. If set to true
, this will use the number of threads specified in Options::num_threads
for the layout optimization step.
By default, this is set to false
as the increase in the number of threads is usually not cost-effective for layout optimization. Specifically, while CPU usage scales with the number of threads, the time spent does not decrease by the same factor. We also expect that the number of available CPUs is at least equal to the requested number of threads, otherwise contention will greatly degrade performance. Nonetheless, users can enable parallel optimization if cost is no issue - usually a higher number of threads (above 4) is required to see a significant speed-up.
If the UMAPPP_NO_PARALLEL_OPTIMIZATION
macro is defined, umappp will not be compiled with support for parallel optimization. This may be desirable in environments that have no support for threading or atomics, or to reduce the binary size if parallelization is not of interest. In such cases, enabling parallel optimization and calling Status::run()
will throw an error.
double umappp::Options::repulsion_strength = 1 |
Modifier for the repulsive force. Larger values increase repulsion and favor local structure.
double umappp::Options::spread = 1 |
Scale of the coordinates of the final low-dimensional embedding. Ignored if both Options::a
and Options::b
are provided.