Categorical Data Matrix to One-Hot Binary Matrix
one_hot(data, normalize = FALSE)
data | a matrix or data.frame of categorical variables as character strings or factors. |
---|---|
normalize | logical, if FALSE then binary matrix is returned. If TRUE, then normalization (see details) is applied to each binary transformed variable. |
A transformed matrix is returned.
The normalization technique is taken from Outlier Analysis (Aggarwal, 2017), section 8.3. For each column j in the binary transformed matrix, a normalization factor is defined as sqrt(ni \* pj \* (1-pj)), where ni is the number of distinct categories in the reference variable from the raw data set and pj is the proportion of records taking the value of 1 for the jth variable
x <- data.frame(gender = sample(c("male", "female"), 15, T), age_cat = sample(c("young", "old", "unknown"), 15, T)) one_hot(data = x, normalize = TRUE)#> gender_male gender_female age_cat_young age_cat_old age_cat_unknown #> [1,] 1.5 0.0 1.224745 0.000000 0.000000 #> [2,] 0.0 1.5 0.000000 1.443376 0.000000 #> [3,] 0.0 1.5 1.157275 0.000000 0.000000 #> [4,] 1.5 0.0 0.000000 1.224745 0.000000 #> [5,] 1.5 0.0 0.000000 0.000000 1.443376 #> [6,] 1.5 0.0 1.157275 0.000000 0.000000 #> [7,] 1.5 0.0 0.000000 0.000000 1.224745 #> [8,] 1.5 0.0 0.000000 0.000000 1.443376 #> [9,] 0.0 1.5 1.157275 0.000000 0.000000 #> [10,] 0.0 1.5 0.000000 0.000000 1.224745 #> [11,] 0.0 1.5 0.000000 0.000000 1.443376 #> [12,] 1.5 0.0 1.157275 0.000000 0.000000 #> [13,] 1.5 0.0 0.000000 1.224745 0.000000 #> [14,] 1.5 0.0 0.000000 0.000000 1.443376 #> [15,] 1.5 0.0 0.000000 0.000000 1.157275