factorize
Create factors from categorical variables
Loading...
Searching...
No Matches
factorize Namespace Reference

Create factors from categorical variables. More...

Functions

template<typename Input_ , typename Code_ >
std::vector< std::vector< Input_ > > combine_to_factor (const std::size_t n, const std::vector< const Input_ * > &inputs, Code_ *const codes)
 
template<typename Input_ , typename Number_ , typename Code_ >
std::vector< std::vector< Input_ > > combine_to_factor_unused (const std::size_t n, const std::vector< std::pair< const Input_ *, Number_ > > &inputs, Code_ *const codes)
 
template<typename Input_ , typename Code_ >
std::vector< Input_ > create_factor (const std::size_t n, const Input_ *const input, Code_ *const codes)
 

Detailed Description

Create factors from categorical variables.

Function Documentation

◆ combine_to_factor()

template<typename Input_ , typename Code_ >
std::vector< std::vector< Input_ > > factorize::combine_to_factor ( const std::size_t n,
const std::vector< const Input_ * > & inputs,
Code_ *const codes )
Template Parameters
Input_Type of the categorical variables to be combined. Any type may be used here as long as it implements the comparison operators.
Code_Integer type of the codes of the combined factor. This should be large enough to hold the number of unique combinations.
Parameters
nNumber of observations (i.e., cells).
[in]inputsVector of pointers to arrays of length n, each containing a different categorical variable.
[out]codesPointer to an array of length n in which the codes of the combined factor are to be stored. On output, the code for observation i refers to the factor level defined by indexing into the inner vectors of the output vector, i.e., for j := codes[i], the factor level is defined by the combination (output[0][j], output[1][j], ...).
Returns
Vector of vectors containing the levels of the combined factor. Each inner vector corresponds to a variables in inputs, and all inner vectors have the same length. Corresponding entries of the inner vectors represent a level of the combined factor, in the form of a combination of values from the input variables, i.e., the first level is defined as (output[0][0], output[1][0], ...), the second level is defined as (output[0][1], output[1][1], ...), and so on. Each entry in output[i] is guaranteed to be a value in inputs[i]. Combinations are guaranteed to be unique and lexicographically sorted (i.e., by the value of the first variable, then the second, and so on).

◆ combine_to_factor_unused()

template<typename Input_ , typename Number_ , typename Code_ >
std::vector< std::vector< Input_ > > factorize::combine_to_factor_unused ( const std::size_t n,
const std::vector< std::pair< const Input_ *, Number_ > > & inputs,
Code_ *const codes )

This function is a variation of combine_to_factor() that considers unobserved combinations of variables.

Template Parameters
Input_Factor type. Any type may be used here as long as it is comparable.
Number_Integer type for the number of unique values in each variable.
Code_Integer type for the combined factor. This should be large enough to hold the number of unique (possibly unused) combinations.
Parameters
nNumber of observations (i.e., cells).
[in]inputsVector of pairs, each of which corresponds to a categorical variable. The first element of the pair is a pointer to an array of length n, containing the values of the variable for each observation. The second element is the total number of unique values for this variable, which may be greater than the largest observed level.
[out]codesPointer to an array of length n in which the codes of the combined factor are to be stored. On output, each entry determines the corresponding observation's combination of levels by indexing into the inner vectors of the returned object; see the argument of the same name in combine_to_factor() for more details.
Returns
Vector of vectors containing all unique and sorted combinations of the input variables. This has the same structure as the output of combine_to_factor(), with the only difference being that unobserved combinations are also reported.

◆ create_factor()

template<typename Input_ , typename Code_ >
std::vector< Input_ > factorize::create_factor ( const std::size_t n,
const Input_ *const input,
Code_ *const codes )

Convert a categorical variable into a factor. Factors are defined in a similar manner as in the R programming language, i.e., an array of integer codes, each of which reference into an array of unique levels.

Template Parameters
Input_Type of the categorical variable. Any type may be used here as long as it is hashable and has an equality operator.
Code_Integer type for the output factor codes.
Parameters
nNumber of observations.
[in]inputPointer to an array of length n containing the input categorical variable.
[out]codesPointer to an array of length n in which the factor codes are to be stored. All values are integers in \([0, N)\) where \(N\) is the length of the output vector; all integers in this range are guaranteed to be present at least once in cleaned.
Returns
A vector of the unique and sorted values of input, i.e., the factor levels. For any observation i, it is guaranteed that output[codes[i]] == input[i].