-2

I am struggling to convert my dataset into numeric values. The dataset I have looks like this:

customer_id 2012 2013 2013 2014  2015 2016 2017
15251        X     N     U    D     S    C    L

X1 - X7 are marked as factors. The extract from dput(head(df)) is:

    structure(list(`2012` = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("N", 
"X"), class = "factor"), `2013` = structure(c(6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L
), .Label = c("C", "D", "N", "S", "U", "X"), class = "factor"), 
    `2014` = structure(c(8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("C", 
    "D", "L", "N", "R", "S", "U", "X"), class = "factor"), ... 

I would like to have the data in numeric values, but I don't know how I can transform them accordingly. The goal is that I can feed the df into a heatmap so that I can visually explore the differences. To my knowledge, this is only possible with a numeric matrix. Because I get the error Heatmap.2(input, trace = "none", : `x' must be a numeric matrix

Does someone have any idea?

Many Thanks for your support!

Lebowski
  • 51
  • 6
  • Hi Lebowski I have suggested something below. My suggestion is next time, ask the question in a more direct way, for example, how to visualize a matrix of letters.. Maybe that will get you less down votes.. – StupidWolf Nov 17 '19 at 15:29

1 Answers1

1

it's do-able. I think it would help next time to include the complete df. The heatmap.2 does not work because you gave it a character matrix. It's a bit more complicated to display the legend for color to letters using heatmap.2, I suggest something below using ggplot

library(ggplot2)
library(dplyr)
library(viridis)

# simulate data
df = data.frame(id=1:5,
replicate(7,sample(LETTERS[1:10],5)))
colnames(df)[-1] = 2012:2018

#convert to long format for plotting and refactor
df <- df %>% pivot_longer(-id) %>%
mutate(value=factor(as.character(value),levels=sort(levels(value))))

#define color scale
# sorted in alphabetical order
present_letters = levels(df$value)
COLS = viridis_pal()(length(present_letters))
names(COLS) = present_letters

#plot
ggplot(data=df,aes(x=name,y=id,fill=value)) + 
geom_tile() + 
scale_fill_manual(values=COLS)

enter image description here

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Hi StupidWolf. That's great advice! Thank you very much for your solution. Do you know by any chance how I can include the dendrogram? (to somehow sort them in cluster order)? – Lebowski Nov 18 '19 at 15:59
  • Hey @Lebowski, how would you like to cluster? I mean what you have are letters.. unless there's a defined way to convert them to numeric.. – StupidWolf Nov 18 '19 at 18:35
  • Hi StupidWolf. I encoded them into numeric values, and gave them numbers from 1-8 but are not entierly sure if this is correct. What I was thinking to try is using the package stringdist and sequence analysis (BioMedR and msa)... – Lebowski Nov 18 '19 at 19:31
  • @Lebowski you have amino acids? Yeah you need to align to get the order first. Putting a heatmap together with a dendrogram is not so easy with ggplot2. You can try heatmap.2 or pheatmap. But hey, do you need to do this alot? If it is once, I would combine them using illustrator lol – StupidWolf Nov 18 '19 at 23:24
  • 1
    You can also try asking your question (please provide data and code !) at https://bioinformatics.stackexchange.com/ , someone might have a good solution – StupidWolf Nov 18 '19 at 23:25
  • Hi StupidWolf. Yeah, it was kinda tricky to have nice dendrograms with a heatmap but now I got the trick...but only by converting the characters into numbers. No, the data are actually just Strings of a customer behaviour over time. I created the characters myself. My Professor thought it would be great if I could use the analogy of bioinformatics to extract some "hidden" patterns. Have a good day. – Lebowski Nov 19 '19 at 07:43