factorize
Create factors from categorical variables
Loading...
Searching...
No Matches
Create factors from categorical variables

Unit tests Documentation Codecov

Overview

This repository contains functions to create R-style factors from categorical variables. Each factor is represented by (i) an array of integer codes in the interval $[0, N)$ and (ii) an array of length $N$ containing sorted and unique levels. For any given observation, its value in the categorical variable can be retrieved by indexing the array of levels by its code. Factors are useful as they map arbitrary variables onto integer codes that can be easily processed by other functions.

Quick start

We can create a factor from any categorical variable:

std::vector<std::string> group { "A", "B", "C", "A", "B", "C" };
std::vector<int> codes(group.size());
auto levels = factorize::create_factor(group.size(), group.data(), codes.data());
group[0] == levels[codes[0]]; // true
Create factors from categorical variables.
std::vector< Input_ > create_factor(const std::size_t n, const Input_ *const input, Code_ *const codes)
Definition create_factor.hpp:39

We can also easily create a factor from multiple variables, where the "levels" will be sorted and unique combinations of the variables.

std::vector<char> grouping1 { 'c', 'a', 'b', 'a', 'b', 'c' };
std::vector<char> grouping2 { 'A', 'B', 'C', 'C', 'B', 'A' };
std::vector<int> combined_codes(grouping1.size());
auto combined_levels = factorize::combine_to_factor(
grouping1.size(),
std::vector<const int*>{ grouping1.data(), grouping2.data() },
combined_codes.data()
);
grouping1[0] == combined_levels[0][combined_codes[0]]; // true
grouping2[0] == combined_levels[1][combined_codes[0]]; // true
std::vector< std::vector< Input_ > > combine_to_factor(const std::size_t n, const std::vector< const Input_ * > &inputs, Code_ *const codes)
Definition combine_to_factor.hpp:43

Check out the reference documentation for more details.

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
factorize
GIT_REPOSITORY https://github.com/libscran/factorize
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(factorize)

Then you can link to factorize to make the headers available during compilation:

# For executables:
target_link_libraries(myexe ltla::factorize)
# For libaries
target_link_libraries(mylib INTERFACE ltla::factorize)

CMake with find_package()

find_package(ltla_factorize CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE ltla::factorize)

To install the library, use:

mkdir build && cd build
cmake .. -DFACTORIZE_TESTS=OFF
cmake --build . --target install

By default, this will use FetchContent to fetch all external dependencies. If you want to install them manually, use -DFACTORIZE_FETCH_EXTERN=OFF. See the tags in extern/CMakeLists.txt to find compatible versions of each dependency.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This also requires the external dependencies listed in extern/CMakeLists.txt.