kmeans
k-means clustering in C++
Loading...
Searching...
No Matches
RefineHartiganWong.hpp
Go to the documentation of this file.
1#ifndef KMEANS_HARTIGAN_WONG_HPP
2#define KMEANS_HARTIGAN_WONG_HPP
3
4#include <vector>
5#include <algorithm>
6#include <numeric>
7#include <cstdint>
8
9#include "sanisizer/sanisizer.hpp"
10
11#include "Refine.hpp"
12#include "Details.hpp"
13#include "QuickSearch.hpp"
14#include "parallelize.hpp"
15#include "compute_centroids.hpp"
16#include "is_edge_case.hpp"
17#include "utils.hpp"
18
25namespace kmeans {
26
57
61namespace RefineHartiganWong_internal {
62
63/*
64 * Alright, get ready for a data dump, because this is where it gets real.
65 * Here I attempt to untangle the code spaghetti from the original Fortran
66 * implementation, which is exemplified by the NCP and LIVE arrays.
67 *
68 * Okay, NCP first. Each element L in NCP has a dual interpretation:
69 *
70 * - In the optimal-transfer stage, NCP(L) stores the step at which cluster L
71 * was last updated. Each step is just the observation index as the optimal
72 * transfer only does one pass through the dataset.
73 * - In the quick-transfer stage, NCP(L) stores the step at which cluster L is
74 * last updated plus M (i.e., the number of observations). Here, the step
75 * corresponding to an observation will be 'M * X + obs' for some integer X
76 * >= 0, where X is the iteration of the quick transfer. This means that the
77 * stored value is defined as '(M + 1) * X + obs'.
78 * - After each quick_transfer call, NCP(L) is set back to zero so any existing
79 * values will not carry over into the next optimal_transfer call.
80 *
81 * Note that these two definitions bleed into each other as the NCP(L) set by
82 * optimal_transfer is still referenced in the first few iterations of
83 * quick_transfer before it eventually gets overwritten. The easiest way to
84 * interpret this is to consider the optimal transfer as iteration 'M = -1'
85 * from the perspective of the quick transfer iterations.
86 *
87 * In short, the NCP array is used to determine whether a cluster was modified
88 * within the last M steps of the optimal_transfer or quick_transfer.
89 *
90 * Let's move onto LIVE, which has an even more painful interpretation:
91 *
92 * - Before each optimal_transfer call, LIVE(L) stores the observation at which
93 * cluster L was updated in the _previous_ optimal_transfer call.
94 * - During the optimal_transfer call, LIVE(L) is updated to the observation at
95 * which L was updated in this call, plus M (i.e., number of observations).
96 * - After the optimal_transfer call, LIVE(L) is updated by subtracting M, so
97 * that the interpretation is correct in the next call.
98 * - After the quick_transfer call, LIVE(L) is set to M + 1 if a quick transfer
99 * took place for L, effectively mimicking a "very recent" update.
100 *
101 * It basically tells us whether there was a recent transfer (optimal or quick)
102 * within the last M steps of optimal_transfer. If so, the cluster is "live".
103 * This differs very slightly from NCP as LIVE only counts optimal transfer
104 * steps while NCP counts both optimal and quick transfer steps.
105 *
106 * We simplify this mess by consolidating NCP and LIVE into a single update
107 * history that can serve both purposes. This is possible as LIVE is not used
108 * in quick_transfer, while any modifications made to NCP by quick_transfer are
109 * ignored and overwritten in optimal_transfer. Thus, there won't be any
110 * conflicts between the two meanings in a consolidated update history.
111 *
112 * The update history for cluster L now holds the iteration ('X') and the
113 * observation ('obs') at which the cluster was last modified. This has
114 * an obvious mapping to NCP(L), which was already defined in terms of
115 * 'X' and 'obs' anyway. For LIVE(L), the mapping is more complex:
116 *
117 * - If an optimal_transfer occurred but no quick_transfer occurs for L,
118 * we set 'X = -2'. 'obs' is not modified as it is the observation at
119 * which L was modified in the previous optimal_transfer iteration;
120 * this is basically just the definition of LIVE(L).
121 * - If a quick transfer does occur, we set 'X = -2' and 'obs = M + 1'.
122 * Again, this is equivalent to the behavior of LIVE(L).
123 * - If neither a quick_transfer nor optimal_transfer occurs for L,
124 * we set 'X = -3' to indicate that L didn't change in any manner.
125 * - In optimal_transfer, we consider a cluster to be live at a particular
126 * observation if its index is greater than 'obs' and its 'X' is -2.
127 *
128 * Note that Fortran is 1-based so the actual code is a little different than
129 * described above for 'obs' - specifically, we just need to set it to 'M'
130 * when a quick transfer occurs. Meaning of 'X' is unchanged.
131 *
132 * Incidentally, we can easily detect if a quick transfer occurs based on
133 * whether 'X > -1'. This obviates the need for the ITRAN array.
134 */
135template<typename Index_>
136class UpdateHistory {
137private:
138 Index_ my_last_observation = 0;
139
140 static constexpr int current_optimal_transfer = -1;
141 static constexpr int previous_optimal_transfer = -2;
142 static constexpr int ancient_history = -3;
143
144 // Starting at -3 as we can't possibly have any updates if we haven't done
145 // any transfers, optimal or quick!
146 int my_last_iteration = ancient_history;
147
148public:
149 void reset(const Index_ total_obs) {
150 if (my_last_iteration > current_optimal_transfer) { // i.e., quick_transfer.
151 my_last_observation = total_obs;
152 my_last_iteration = previous_optimal_transfer;
153 } else if (my_last_iteration == current_optimal_transfer) {
154 // Preserve the existing 'my_last_observation', just bump the iteration downwards.
155 my_last_iteration = previous_optimal_transfer;
156 } else {
157 my_last_iteration = ancient_history;
158 }
159 }
160
161 void set_optimal(const Index_ obs) {
162 my_last_observation = obs;
163 my_last_iteration = current_optimal_transfer;
164 }
165
166 // Here, iter should be from '[0, max_quick_transfer_iterations)'.
167 void set_quick(const int iter, const Index_ obs) {
168 my_last_iteration = iter;
169 my_last_observation = obs;
170 }
171
172public:
173 bool changed_after(const int iter, const Index_ obs) const {
174 if (my_last_iteration == iter) {
175 return my_last_observation > obs;
176 } else {
177 return my_last_iteration > iter;
178 }
179 }
180
181 bool changed_after_or_at(const int iter, const Index_ obs) const {
182 if (my_last_iteration == iter) {
183 return my_last_observation >= obs;
184 } else {
185 return my_last_iteration > iter;
186 }
187 }
188
189 bool is_live(const Index_ obs) const {
190 return changed_after(previous_optimal_transfer, obs);
191 }
192};
193
194template<typename Float_, typename Index_, typename Cluster_>
195struct Workspace {
196 // Array arguments in the same order as supplied to R's kmns_ function.
197 std::vector<Cluster_> best_destination_cluster; // i.e., IC2
198 std::vector<Index_> cluster_sizes; // i.e., NC
199
200 std::vector<Float_> loss_multiplier; // i.e., AN1
201 std::vector<Float_> gain_multiplier; // i.e., AN2
202 std::vector<Float_> wcss_loss; // i.e., D
203
204 std::vector<UpdateHistory<Index_> > update_history; // i.e., NCP, LIVE, and ITRAN.
205
206 Index_ optra_steps_since_last_transfer = 0; // i.e., INDX
207
208public:
209 Workspace(Index_ nobs, Cluster_ ncenters) :
210 // Sizes taken from the .Fortran() call in stats::kmeans().
211 best_destination_cluster(sanisizer::cast<I<decltype(best_destination_cluster.size())> >(nobs)),
212 cluster_sizes(sanisizer::cast<I<decltype(cluster_sizes.size())> >(ncenters)),
213 loss_multiplier(sanisizer::cast<I<decltype(loss_multiplier.size())> >(ncenters)),
214 gain_multiplier(sanisizer::cast<I<decltype(gain_multiplier.size())> >(ncenters)),
215 wcss_loss(sanisizer::cast<I<decltype(wcss_loss.size())> >(nobs)),
216 update_history(sanisizer::cast<I<decltype(update_history.size())> >(ncenters))
217 {}
218};
219
220template<typename Data_, typename Float_>
221Float_ squared_distance_from_cluster(const Data_* const data, const Float_* const center, const std::size_t ndim) {
222 Float_ output = 0;
223 for (I<decltype(ndim)> d = 0; d < ndim; ++d) {
224 const Float_ delta = static_cast<Float_>(data[d]) - center[d]; // cast to float for consistent precision regardless of Data_.
225 output += delta * delta;
226 }
227 return output;
228}
229
230template<class Matrix_, typename Cluster_, typename Float_>
231void find_closest_two_centers(
232 const Matrix_& data,
233 const Cluster_ ncenters,
234 const Float_* const centers,
235 Cluster_* const best_cluster,
236 std::vector<Cluster_>& best_destination_cluster,
237 const int nthreads)
238{
239 const auto ndim = data.num_dimensions();
240
241 // We assume that there are at least two centers here, otherwise we should
242 // have detected that this was an edge case in RefineHartiganWong::run.
243 internal::QuickSearch<Float_, Cluster_> index(ndim, ncenters);
244 index.reset(centers);
245
246 const auto nobs = data.num_observations();
247 parallelize(nthreads, nobs, [&](const int, const I<decltype(nobs)> start, const I<decltype(nobs)> length) -> void {
248 auto matwork = data.new_known_extractor(start, length);
249 auto qswork = index.new_workspace();
250 for (I<decltype(start)> obs = start, end = start + length; obs < end; ++obs) {
251 const auto optr = matwork->get_observation();
252 const auto res2 = index.find2(optr, qswork);
253 best_cluster[obs] = res2[0];
254 best_destination_cluster[obs] = res2[1];
255 }
256 });
257}
258
259template<typename Float_>
260constexpr Float_ big_number() {
261 return 1e30; // Some very big number.
262}
263
264template<typename Data_, typename Index_, typename Cluster_, typename Float_>
265void transfer_point(
266 const std::size_t ndim,
267 const Data_* const obs_ptr,
268 const Index_ obs_id,
269 const Cluster_ l1,
270 const Cluster_ l2,
271 Float_* const centers,
272 Cluster_* const best_cluster,
273 Workspace<Float_, Index_, Cluster_>& work)
274{
275 // Yes, casts to float are deliberate here, so that the
276 // multipliers can be computed correctly.
277 const Float_ al1 = work.cluster_sizes[l1], alw = al1 - 1;
278 const Float_ al2 = work.cluster_sizes[l2], alt = al2 + 1;
279
280 for (I<decltype(ndim)> d = 0; d < ndim; ++d) {
281 const Float_ oval = obs_ptr[d]; // cast to float for consistent precision regardless of Data_.
282 auto& c1 = centers[sanisizer::nd_offset<std::size_t>(d, ndim, l1)];
283 c1 = (c1 * al1 - oval) / alw;
284 auto& c2 = centers[sanisizer::nd_offset<std::size_t>(d, ndim, l2)];
285 c2 = (c2 * al2 + oval) / alt;
286 }
287
288 --work.cluster_sizes[l1];
289 ++work.cluster_sizes[l2];
290
291 work.gain_multiplier[l1] = alw / al1;
292 work.loss_multiplier[l1] = (alw > 1 ? alw / (alw - 1) : big_number<Float_>());
293 work.loss_multiplier[l2] = alt / al2;
294 work.gain_multiplier[l2] = alt / (alt + 1);
295
296 best_cluster[obs_id] = l2;
297 work.best_destination_cluster[obs_id] = l1;
298}
299
300/* ALGORITHM AS 136.1 APPL. STATIST. (1979) VOL.28, NO.1
301 * This is the OPtimal TRAnsfer stage.
302 * ----------------------
303 * Each point is re-assigned, if necessary, to the cluster that will induce a
304 * maximum reduction in the within-cluster sum of squares. In this stage,
305 * there is only one pass through the data.
306 */
307template<class Matrix_, typename Cluster_, typename Float_>
308bool optimal_transfer(
309 const Matrix_& data, Workspace<Float_, Index<Matrix_>, Cluster_>& work,
310 const Cluster_ ncenters,
311 Float_* const centers,
312 Cluster_* const best_cluster,
313 const bool all_live)
314{
315 const auto nobs = data.num_observations();
316 const auto ndim = data.num_dimensions();
317 auto matwork = data.new_known_extractor();
318
319 for (I<decltype(nobs)> obs = 0; obs < nobs; ++obs) {
320 ++work.optra_steps_since_last_transfer;
321
322 const auto l1 = best_cluster[obs];
323 if (work.cluster_sizes[l1] != 1) {
324 const auto obs_ptr = matwork->get_observation(obs);
325
326 // The original Fortran implementation re-used the WCSS loss for
327 // each observation, only recomputing it if its cluster had
328 // experienced an optimal transfer for an earlier observation. In
329 // theory, this sounds great to avoid recomputation, but the
330 // existing WCSS loss was computed in a running fashion during the
331 // quick transfers. This makes them susceptible to accumulation of
332 // numerical errors, even after the centroids are freshly
333 // recomputed in the run() loop. So, we simplify matters and
334 // improve accuracy by just recomputing the loss all the time,
335 // which doesn't take too much extra effort.
336 auto& wcss_loss = work.wcss_loss[obs];
337 const auto l1_ptr = centers + sanisizer::product_unsafe<std::size_t>(ndim, l1);
338 wcss_loss = squared_distance_from_cluster(obs_ptr, l1_ptr, ndim) * work.loss_multiplier[l1];
339
340 // Find the cluster with minimum WCSS gain.
341 auto l2 = work.best_destination_cluster[obs];
342 const auto original_l2 = l2;
343 const auto l2_ptr = centers + sanisizer::product_unsafe<std::size_t>(ndim, l2);
344 auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
345
346 const auto update_destination_cluster = [&](const Cluster_ cen) -> void {
347 auto cen_ptr = centers + sanisizer::product_unsafe<std::size_t>(ndim, cen);
348 auto candidate = squared_distance_from_cluster(obs_ptr, cen_ptr, ndim) * work.gain_multiplier[cen];
349 if (candidate < wcss_gain) {
350 wcss_gain = candidate;
351 l2 = cen;
352 }
353 };
354
355 // If the current assigned cluster is live, we need to check all
356 // other clusters as potential transfer destinations, because the
357 // gain/loss comparison has changed. Otherwise, we only need to
358 // consider other live clusters as transfer destinations; the
359 // non-live clusters were already rejected as transfer destinations
360 // when compared to the current assigned cluster, so there's no
361 // point checking them again if they didn't change in the meantime.
362 //
363 // The exception is for the first call to optimal_transfer, where we
364 // consider all clusters as live (i.e., all_live = true). This is
365 // because no observation really knows its best transfer
366 // destination yet - the second-closest cluster is just a
367 // guesstimate - so we need to compute it exhaustively.
368 if (all_live || work.update_history[l1].is_live(obs)) {
369 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
370 if (cen != l1 && cen != original_l2) {
371 update_destination_cluster(cen);
372 }
373 }
374 } else {
375 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
376 if (cen != l1 && cen != original_l2 && work.update_history[cen].is_live(obs)) {
377 update_destination_cluster(cen);
378 }
379 }
380 }
381
382 // Deciding whether to make the transfer based on the change to the WCSS.
383 if (wcss_gain >= wcss_loss) {
384 work.best_destination_cluster[obs] = l2;
385 } else {
386 work.optra_steps_since_last_transfer = 0;
387 work.update_history[l1].set_optimal(obs);
388 work.update_history[l2].set_optimal(obs);
389 transfer_point(ndim, obs_ptr, obs, l1, l2, centers, best_cluster, work);
390 }
391 }
392
393 // Stop if we've iterated through the entire dataset and no transfer of
394 // any kind took place, be it optimal or quick.
395 if (work.optra_steps_since_last_transfer == nobs) {
396 return true;
397 }
398 }
399
400 return false;
401}
402
403/* ALGORITHM AS 136.2 APPL. STATIST. (1979) VOL.28, NO.1
404 * This is the Quick TRANsfer stage.
405 * --------------------
406 * IC1(I) is the cluster which point I currently belongs to.
407 * IC2(I) is the cluster which point I is most likely to be transferred to.
408 *
409 * For each point I, IC1(I) & IC2(I) are switched, if necessary, to reduce
410 * within-cluster sum of squares. The cluster centres are updated after each
411 * step. In this stage, we loop through the data until no further change is to
412 * take place, or we hit an iteration limit, whichever is first.
413 */
414template<class Matrix_, typename Cluster_, typename Float_>
415std::pair<bool, bool> quick_transfer(
416 const Matrix_& data,
417 Workspace<Float_, Index<Matrix_>, Cluster_>& work,
418 Float_* const centers,
419 Cluster_* const best_cluster,
420 const int quick_iterations)
421{
422 bool had_transfer = false;
423
424 const auto nobs = data.num_observations();
425 const auto ndim = data.num_dimensions();
426 auto matwork = data.new_known_extractor();
427
428 I<decltype(nobs)> steps_since_last_quick_transfer = 0; // i.e., ICOUN in the original Fortran implementation.
429
430 for (int it = 0; it < quick_iterations; ++it) {
431 const int prev_it = it - 1;
432
433 for (I<decltype(nobs)> obs = 0; obs < nobs; ++obs) {
434 ++steps_since_last_quick_transfer;
435 const auto l1 = best_cluster[obs];
436
437 if (work.cluster_sizes[l1] != 1) {
438 I<decltype(matwork->get_observation(obs))> obs_ptr = NULL;
439
440 // Need to update the WCSS loss if the cluster was updated recently.
441 // Otherwise, we must have already updated the WCSS in a previous
442 // iteration of the outermost loop, so this can be skipped.
443 //
444 // Note that we use changed_at_or_after; if the same
445 // observation was changed in the previous iteration of the
446 // outermost loop, its WCSS loss won't have been updated yet.
447 auto& history1 = work.update_history[l1];
448 if (history1.changed_after_or_at(prev_it, obs)) {
449 const auto l1_ptr = centers + sanisizer::product_unsafe<std::size_t>(l1, ndim);
450 obs_ptr = matwork->get_observation(obs);
451 work.wcss_loss[obs] = squared_distance_from_cluster(obs_ptr, l1_ptr, ndim) * work.loss_multiplier[l1];
452 }
453
454 // If neither the best or second-best clusters have changed
455 // after the previous iteration that we visited this
456 // observation, then there's no point reevaluating the
457 // transfer, because nothing's going to be different anyway.
458 const auto l2 = work.best_destination_cluster[obs];
459 auto& history2 = work.update_history[l2];
460 if (history1.changed_after(prev_it, obs) || history2.changed_after(prev_it, obs)) {
461 if (obs_ptr == NULL) {
462 obs_ptr = matwork->get_observation(obs);
463 }
464 const auto l2_ptr = centers + sanisizer::product_unsafe<std::size_t>(l2, ndim);
465 const auto wcss_gain = squared_distance_from_cluster(obs_ptr, l2_ptr, ndim) * work.gain_multiplier[l2];
466
467 if (wcss_gain < work.wcss_loss[obs]) {
468 had_transfer = true;
469 steps_since_last_quick_transfer = 0;
470 history1.set_quick(it, obs);
471 history2.set_quick(it, obs);
472 transfer_point(ndim, obs_ptr, obs, l1, l2, centers, best_cluster, work);
473 }
474 }
475 }
476
477 if (steps_since_last_quick_transfer == nobs) {
478 // Quit early if no transfer occurred within the past 'nobs'
479 // steps, as we've already converged for each observation.
480 return std::make_pair(had_transfer, false);
481 }
482 }
483 }
484
485 return std::make_pair(had_transfer, true);
486}
487
488}
528template<typename Index_, typename Data_, typename Cluster_, typename Float_, class Matrix_ = Matrix<Index_, Data_> >
529class RefineHartiganWong final : public Refine<Index_, Data_, Cluster_, Float_, Matrix_> {
530public:
534 RefineHartiganWong(RefineHartiganWongOptions options) : my_options(std::move(options)) {}
535
540
541private:
542 RefineHartiganWongOptions my_options;
543
544public:
550 return my_options;
551 }
552
553public:
557 Details<Index_> run(const Matrix_& data, const Cluster_ ncenters, Float_* const centers, Cluster_* const clusters) const {
558 const auto nobs = data.num_observations();
559 if (internal::is_edge_case(nobs, ncenters)) {
560 return internal::process_edge_case(data, ncenters, centers, clusters);
561 }
562
563 RefineHartiganWong_internal::Workspace<Float_, Index_, Cluster_> work(nobs, ncenters);
564
565 RefineHartiganWong_internal::find_closest_two_centers(data, ncenters, centers, clusters, work.best_destination_cluster, my_options.num_threads);
566 for (Index_ obs = 0; obs < nobs; ++obs) {
567 ++work.cluster_sizes[clusters[obs]];
568 }
569 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
570
571 for (Cluster_ cen = 0; cen < ncenters; ++cen) {
572 const Float_ num = work.cluster_sizes[cen]; // yes, cast is deliberate here so that the multipliers can be computed correctly.
573 work.gain_multiplier[cen] = num / (num + 1);
574 work.loss_multiplier[cen] = (num > 1 ? num / (num - 1) : RefineHartiganWong_internal::big_number<Float_>());
575 }
576
577 I<decltype(my_options.max_iterations)> iter = 0;
578 int ifault = 0;
579 for (; iter < my_options.max_iterations; ++iter) {
580 const bool finished = RefineHartiganWong_internal::optimal_transfer(data, work, ncenters, centers, clusters, /* all_live = */ (iter == 0));
581 if (finished) {
582 break;
583 }
584
585 const auto quick_status = RefineHartiganWong_internal::quick_transfer(
586 data,
587 work,
588 centers,
589 clusters,
591 );
592
593 // Recomputing the centroids to avoid accumulation of numerical
594 // errors after many transfers (e.g., adding a whole bunch of
595 // values and then subtracting them again leaves behind some
596 // cancellation error). Note that we don't have to do this if
597 // 'finished = true' as this means that there was no transfer of
598 // any kind in the final pass through the dataset.
599 internal::compute_centroids(data, ncenters, centers, clusters, work.cluster_sizes);
600
601 if (quick_status.second) { // Hit the quick transfer iteration limit.
603 ifault = 4;
604 break;
605 }
606 } else {
607 // If quick transfer converged and there are only two clusters,
608 // there is no need to re-enter the optimal transfer stage.
609 if (ncenters == 2) {
610 break;
611 }
612 }
613
614 if (quick_status.first) { // At least one quick transfer was performed.
615 work.optra_steps_since_last_transfer = 0;
616 }
617
618 for (Cluster_ c = 0; c < ncenters; ++c) {
619 work.update_history[c].reset(nobs);
620 }
621 }
622
623 if (iter == my_options.max_iterations) {
624 ifault = 2;
625 } else {
626 ++iter; // make it 1-based.
627 }
628
629 return Details(std::move(work.cluster_sizes), iter, ifault);
630 }
634};
635
636}
637
638#endif
Report detailed clustering statistics.
Interface for k-means refinement.
Implements the Hartigan-Wong algorithm for k-means clustering.
Definition RefineHartiganWong.hpp:529
RefineHartiganWongOptions & get_options()
Definition RefineHartiganWong.hpp:549
RefineHartiganWong(RefineHartiganWongOptions options)
Definition RefineHartiganWong.hpp:534
Interface for k-means refinement algorithms.
Definition Refine.hpp:30
virtual Details< Index_ > run(const Matrix_ &data, Cluster_ num_centers, Float_ *centers, Cluster_ *clusters) const =0
Perform k-means clustering.
Definition compute_wcss.hpp:16
void parallelize(const int num_workers, const Task_ num_tasks, Run_ run_task_range)
Definition parallelize.hpp:28
Utilities for parallelization.
Additional statistics from the k-means algorithm.
Definition Details.hpp:20
Options for RefineHartiganWong.
Definition RefineHartiganWong.hpp:30
bool quit_on_quick_transfer_convergence_failure
Definition RefineHartiganWong.hpp:49
int max_iterations
Definition RefineHartiganWong.hpp:35
int num_threads
Definition RefineHartiganWong.hpp:55
int max_quick_transfer_iterations
Definition RefineHartiganWong.hpp:41