-1

Let's say I have a very large dataset with results from wine tastings with tasting descriptors stored in one of the variables.

data.frame(c("red","white","rose"),c("grapefruit, raspberry", "sweet, bold", "tannins, long finish"))

The number of possible descriptors is massive. I want to unpack them in a way such that it becomes usable for analysis with machine learning techniques. Should I put each possible descriptor in its own variable, or is there a more efficient and compact way to store such data?

Thank you in advance!

2 Answers2

0

Try this approach. You have two variables but one of them is separated by comma. You can reshape the data and build one variable for each feature. Also it depends which class of ML algorithm you want to apply (Maybe unsupervised learning). Here the code:

library(tidyverse)
#Code
dfnew <- df %>% setNames(.,c('v1','v2')) %>%
  mutate(id=row_number()) %>%
  pivot_longer(-id) %>%
  separate_rows(value,sep=',') %>%
  mutate(value=trimws(value)) %>% select(-name) %>%
  group_by(id) %>% mutate(Var=paste0('V',row_number())) %>%
  pivot_wider(names_from = Var,values_from=value) %>%
  ungroup() %>%select(-id)

Output:

# A tibble: 3 x 3
  V1    V2         V3         
  <chr> <chr>      <chr>      
1 red   grapefruit raspberry  
2 white sweet      bold       
3 rose  tannins    long finish

Some data used:

#Data
df <- data.frame(c("red","white","rose"),c("grapefruit, raspberry", "sweet, bold", "tannins, long finish"))
Duck
  • 39,058
  • 13
  • 42
  • 84
0

We can do this easily with base R (no need of any packages). Just rename the columns of the dataset (as the data.frame construction didn't had any name, it just take the first row as column name as well), and then use read.csv to read the second column and it will automatically use the delimiter as , to separate into different columns

names(df) <- paste0('v', seq_along(df))
df[c('v2', 'v3')] <- read.csv(text = df$v2, header = FALSE)

-output

df
#     v1         v2           v3
#1   red grapefruit    raspberry
#2 white      sweet         bold
#3  rose    tannins  long finish

data

df <- structure(list(c..red....white....rose.. = c("red", "white", 
"rose"), c..grapefruit..raspberry....sweet..bold....tannins..long.finish.. = c("grapefruit, raspberry", 
"sweet, bold", "tannins, long finish")),
class = "data.frame", row.names = c(NA, 
-3L))
akrun
  • 874,273
  • 37
  • 540
  • 662