qdtsne
A quick and dirty t-SNE C++ library
|
This repository contains a header-only C++ library implementing the Barnes-Hut t-distributed stochastic neighbor embedding (t-SNE) algorithm (van der Maaten and Hinton, 2008; van der Maaten, 2014). t-SNE is a non-linear dimensionality reduction technique that enables visualization of complex datasets by placing each observation on a low-dimensional (usually 2D) map based on its neighbors. The general idea is that neighboring observations are placed next to each other, thus preserving the local structure of the dataset from its original high-dimensional form. The non-linear nature of the algorithm gives it the flexibility to accommodate complicated structure that is not possible in linear techniques like PCA.
The code here is derived from the C++ code in the Rtsne R package, itself taken from the 2014 paper. It has been updated to use more modern C++, along with some additional options to sacrifice accuracy for speed - see below for details.
Using this library is as simple as including the header file in your source code:
You can change the parameters with the relevant setters:
You can also stop and start the algorithm:
See the reference documentation for more details.
van der Maaten (2014) proposed the use of the Barnes-Hut approximation for the repulsive force calculations in t-SNE. The algorithm consolidates a group of distant points into a single center of mass, avoiding the need to calculate forces between individual points. The definition of "distant" is determined by the theta
parameter, where larger values sacrifice accuracy for speed.
In qdtsne, we introduce an extra max_depth
parameter that bounds the depth of the tree used for the Barnes-Hut force calculations. Setting a maximum depth of $m$ is equivalent to the following procedure:
The approximation is based on ignoring the distribution within each grid cell, which should be acceptable at large $m$ where the intervals are small. Smaller values of $m$ reduce computational time by limiting the depth of the recursion, at the cost of approximation quality for the repulsive force calculation. A value of 7 to 10 seems to be a good compromise for most applications.
We can go even further by using the center of mass for $i$'s leaf node to approximate $i$'s repulsive forces with all other leaf nodes. We compute the repulsive forces once per leaf node, and then re-use those values for all points assigned to the same node. This eliminates near-redundant searches through the tree for neighboring points; the only extra calculation per point $i$ is the repulsion between $i$ and its leaf node. We call this approach the "leaf approximation", which is enabled through the leaf_approximation
parameter. Note that this only has an effect in max_depth
-bounded trees where multiple points are assigned to a leaf node.
Some testing indicates that both approximations can significantly speed up calculation of the embeddings. Timings are shown below in seconds, based on a mock dataset containing 50,000 points (see tests/R/examples/basic.R
for details).
Strategy | Serial | Parallel (OpenMP, n = 4) |
---|---|---|
Default | 136 | 42 |
max_depth = 7 | 85 | 26 |
max_depth = 7 , leaf_approximation = true | 46 | 16 |
FetchContent
If you're already using CMake, you can add something like this to your CMakeLists.txt
:
And then:
find_package()
To install the library, use:
By default, this will use FetchContent
to fetch all external dependencies. If you want to install them manually, use -DQDTSNE_FETCH_EXTERN=OFF
. See extern/CMakeLists.txt
to find compatible versions of each dependency.
If you're not using CMake, you can just copy the header files in include/
into some location that is visible to your compiler. This requires the manual inclusion of the dependencies listed in extern/CMakeLists.txt
.
Most of the code in this repository is MIT-licensed (see the [LICENSE
](LICENSE)). The exception is the code that was derived from the Rtsne R package, which in turn was taken from the van der Maaten (2014) paper. Readers are referred to the relevant source files (SPTree.hpp
and Status.hpp
) for their licensing conditions.
van der Maaten, L.J.P. and Hinton, G.E. (2008). Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605.
van der Maaten, L.J.P. (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15, 3221-3245.