Converting features to dummies

Question

I have this matrix:

quimio = matrix(c(51,33,16,58,29,13,48,42,30,26,38,16), 
            nrow = 4, ncol = 3)

colnames(quimio) = c("Pouca", "Média", "Alta")
rownames(quimio) = c("Tipo I", "Tipo II", "Tipo III", "Tipo IV")

Which looks like this:

          Pouca Média Alta
Tipo I      51    29   30
Tipo II     33    13   26
Tipo III    16    48   38
Tipo IV     58    42   16

I want to turn it into a tibble such that these row and column names are all dummy variables.

I wanted to make a bar chart and got this:

library(tidyverse)

tipo = c("Tipo I", "Tipo II", "Tipo III", "Tipo IV")

tipos = rep(tipo, 3)

quimiotb = as.tibble(quimio)
quimiotb = gather(quimiotb)
quimiotb$tipo = tipos

quimiotb = rename(quimiotb, reacao = key)
quimiotb$reacao = factor(quimiotb$reacao)
quimiotb$tipo = factor(quimiotb$tipo)

This is what I get:

A tibble: 12 x 3
reacao value tipo    
<fct>  <dbl> <fct>   
1 Pouca     51 Tipo I  
2 Pouca     33 Tipo II 
3 Pouca     16 Tipo III
4 Pouca     58 Tipo IV 
5 Média     29 Tipo I  
6 Média     13 Tipo II 
7 Média     48 Tipo III
8 Média     42 Tipo IV 
9 Alta      30 Tipo I  
10 Alta     26 Tipo II 
11 Alta     38 Tipo III
12 Alta     16 Tipo IV

And while this is quite ok to use for a bar chart with ggplot2 I can't run any model on it - that would require that tipo got spread into 4 columns and reacao in 3. Right now this tibble's first line reads like "51 patients with Tipo I cancer had pouca reacao". I've thought about using spread() but can't find the proper combination of arguments. Any help would be appreciated.

tl;dr

I need to tidy quimiotb and don't know how

EDIT: Expected output should be something like this

  A tibble: Y x 7
  Pouca Media Alta Tipo I Tipo II Tipo III Tipo IV    
  <fct> <fct> <fct> <fct>  <fct>   <fct>     <fct>
1   0     1    0      0      1       0         0
2   1     0    0      1      0       0         0

Please also add at least a small part of your expected output. — Julius Vainora, Dec 09 '18 at 19:55
R seldom, if ever, needs explicit transformation of factors to dummies, the modeling functions take care of that in a much more tested and safe way. — Rui Barradas, Dec 09 '18 at 19:59
I just want to run an ANOVA to assess if ``tipo`` is related to ``reacao`` — Pedro Cavalcante, Dec 09 '18 at 20:05

G. Grothendieck · Accepted Answer · 2018-12-09T20:52:38.057

2

The modelling routines will create a model.matrix for you internally without you having to specify it so this should be sufficient.

as.data.frame.table(quimio)

model.matrix can create a model matrix from that but you don't need it as seen in the code below.

Now you do things like:

DF <- as.data.frame.table(quimio)
fm0 <- lm(Freq ~ Var1, DF) # or maybe you want Var2?
fm1 <- lm(Freq ~ Var1 + Var2, DF) 
anova(fm0, fm1) # compare

or look at the t tests of the coefficients of Var2 in the output of summary(fm1) to see if they are significantly different from zero.

Or maybe you want to do a chi squared test on the original data

chisq.test(quimio)

Anyways there are many modelling functions in R and you now have the data in the form you need and can explore them.

edited Dec 09 '18 at 20:52

answered Dec 09 '18 at 20:17

G. Grothendieck

254,981
17
203
341

But how do I specify a formula for a model then? – Pedro Cavalcante Dec 09 '18 at 20:21
Let's say I want to run an ANOVA, to see if cancer type is related to ``reacao``. How to specify the model's formula? – Pedro Cavalcante Dec 09 '18 at 20:24
Have transferred my comments to answer. – G. Grothendieck Dec 09 '18 at 20:47

score 1 · Answer 2 · answered Dec 09 '18 at 20:20

Less elegant than I wanted, but should work with data.table and mltools:

> df
    Tipo I Tipo II Tipo III Tipo IV Alta Média Pouca value
 1:      1       0        0       0    0     0     1    51
 2:      0       1        0       0    0     0     1    33
 3:      0       0        1       0    0     0     1    16
 4:      0       0        0       1    0     0     1    58
 5:      1       0        0       0    0     1     0    29
 6:      0       1        0       0    0     1     0    13
 7:      0       0        1       0    0     1     0    48
 8:      0       0        0       1    0     1     0    42
 9:      1       0        0       0    1     0     0    30
10:      0       1        0       0    1     0     0    26
11:      0       0        1       0    1     0     0    38
12:      0       0        0       1    1     0     0    16

Code

library(data.table)
library(mltools)

df <- quimio %>% 
    as.data.frame() %>%
    rownames_to_column() %>%
    gather(key, value, -rowname) %>%
    mutate(rowname = as.factor(rowname),
           key = as.factor(key)) %>%
    as.data.table() %>%
    one_hot() %>% 
    rename_all(.funs = funs(sub("^.+_", "", names(df))))

score 1 · Answer 3 · answered Dec 09 '18 at 20:37

Another option would be

fun <- function(x, y) setNames(tibble(a = 1, b = 1)[rep(1, quimio[x, y]), ], c(rownames(quimio)[x], colnames(quimio)[y]))
1 * !is.na(map2_dfr(row(quimio), col(quimio), fun))
#      Tipo I Pouca Tipo II Tipo III Tipo IV Média Alta
# [1,]      1     1       0        0       0     0    0
# [2,]      1     1       0        0       0     0    0
# [3,]      1     1       0        0       0     0    0
# ...

Here fun creates a tibble with two columns for a certain pair of rows and columns of quimio, where the number of rows is given as an entry in quimio. The second line goes over all column and row pairs, creates a tibble for each, binds them, and sets to zero all the remaining NA entries.

Converting features to dummies

3 Answers3

Code