Categorical Data Matrix to One-Hot Binary Matrix

one_hot(data, normalize = FALSE)

Arguments

data

a matrix or data.frame of categorical variables as character strings or factors.

normalize

logical, if FALSE then binary matrix is returned. If TRUE, then normalization (see details) is applied to each binary transformed variable.

Value

A transformed matrix is returned.

Details

The normalization technique is taken from Outlier Analysis (Aggarwal, 2017), section 8.3. For each column j in the binary transformed matrix, a normalization factor is defined as sqrt(ni \* pj \* (1-pj)), where ni is the number of distinct categories in the reference variable from the raw data set and pj is the proportion of records taking the value of 1 for the jth variable

Examples

x <- data.frame(gender = sample(c("male", "female"), 15, T), age_cat = sample(c("young", "old", "unknown"), 15, T)) one_hot(data = x, normalize = TRUE)
#> gender_male gender_female age_cat_young age_cat_old age_cat_unknown #> [1,] 1.5 0.0 1.224745 0.000000 0.000000 #> [2,] 0.0 1.5 0.000000 1.443376 0.000000 #> [3,] 0.0 1.5 1.157275 0.000000 0.000000 #> [4,] 1.5 0.0 0.000000 1.224745 0.000000 #> [5,] 1.5 0.0 0.000000 0.000000 1.443376 #> [6,] 1.5 0.0 1.157275 0.000000 0.000000 #> [7,] 1.5 0.0 0.000000 0.000000 1.224745 #> [8,] 1.5 0.0 0.000000 0.000000 1.443376 #> [9,] 0.0 1.5 1.157275 0.000000 0.000000 #> [10,] 0.0 1.5 0.000000 0.000000 1.224745 #> [11,] 0.0 1.5 0.000000 0.000000 1.443376 #> [12,] 1.5 0.0 1.157275 0.000000 0.000000 #> [13,] 1.5 0.0 0.000000 1.224745 0.000000 #> [14,] 1.5 0.0 0.000000 0.000000 1.443376 #> [15,] 1.5 0.0 0.000000 0.000000 1.157275