How to create heatmap only for 50 highest value

Question

I have data matrix with thousands row like this:

                                    file_A    file_B     file_C    file_D
Carbohydrate metabolism             69370     67839      68914      67272
Energy metabolism                   40223     40750      39450      39735
Lipid metabolism                    22333     21668      22421      21773
Nucleotide metabolism               18449     18389      17560      18263
Amino acid metabolism               63739     63441      62797      63106
Metabolism of other amino acids     19075     19068      18896      18836

I want to create heatmap only for 50 highest value of the row for file_A,B,C,D.

How I can get it?

score 1 · Answer 1 · answered Mar 13 '20 at 13:46

Assuming you want the top 50 rows for the sum of file_A through file_D, you can do so with dplyr pretty easily:

your_dataframe %>% 
  mutate(fileSum = select(., file_A:file_D) %>% rowSums()) %>%
  arrange(desc(fileSum)) %>%
  head(50)

From there, you can pipe into ggplot for your desired visual, save it as a separate dataframe, or whatever you need to do.

user12728748 · Answer 2 · 2020-03-13T14:29:41.607

First, determine maximum values by row, then sort in descending order and pick top 50. Then plot, eg. using pheatmap.

library(pheatmap)

# toy example
df <- data.frame(iris[, 1:4], row.names=make.unique(as.character(iris$Species)))

# pick top 50 rows with highest values
top <- df[order(apply(df, 1, max), decreasing = TRUE)[1:50],]

# plot heatmap
pheatmap::pheatmap(top)

^{Created on 2020-03-13 by the reprex package (v0.3.0)}

Edit:

If I misunderstood and you want the sums of the rows, then use

top <- df[order(rowSums(df), decreasing = TRUE)[1:50], ]

instead.

Edit #2:

If you want the top 50 for each row, as suggested by dc37, then you can use

top <- df[unique(unlist(lapply(df, function(x) order(x, decreasing = TRUE)[1:50]))),]

instead.

score 0 · Accepted Answer · answered Mar 13 '20 at 14:09

Maybe I misunderstood your question, but from my understanding, you are looking make the heatmap of the top 50 values of file A, top 50 values of file B, top 50 of file C and top 50 of File D. Am I right ?

If it is what you are looking for, it could means that you don't need only 50 but potentially up to 200 values (depending if the same row is in top 50 for all files or in only one).

Here a dummy example of large dataframe corresponding to your example:

row <- expand.grid(LETTERS, letters, LETTERS)
row$Row = paste(row$Var1, row$Var2, row$Var3, sep = "")
df <- data.frame(row = row$Row, 
                 file_A = sample(10000:99000,nrow(row), replace = TRUE),
                 file_B = sample(10000:99000,nrow(row), replace = TRUE),
                 file_C = sample(10000:99000,nrow(row), replace = TRUE),
                 file_D = sample(10000:99000,nrow(row), replace = TRUE))

> head(df)
  row file_A file_B file_C file_D
1 AaA  54418  65384  43526  86870
2 BaA  57098  75440  92820  27695
3 CaA  71172  59942  12626  53196
4 DaA  54976  25370  43797  30770
5 EaA  56631  73034  50746  77878
6 FaA  45245  57979  72878  94381

In order to get a heatmap using ggplot2, you need to obtain the following organization: One column for x value, one column for y value and one column that serve as a categorical variable for filling for example.

To get that, you need to reshape your dataframe into a longer format. To do that, you can use pivot_longer function from tidyr package but as you have thousands of rows,I will rather recommend data.table which is faster for this kind of process.

library(data.table)
DF <- melt(setDT(df), measure = list(c("file_A","file_B","file_C","file_D")), value.name = "Value", variable.name = "File")

   row   File Value
1: AaA file_A 54418
2: BaA file_A 57098
3: CaA file_A 71172
4: DaA file_A 54976
5: EaA file_A 56631
6: FaA file_A 45245

Now, we can use dplyr to get only the first top 50 values for each file by doing:

library(dplyr)
Extract_DF <- DF %>% 
  group_by(File) %>% 
  arrange(desc(Value)) %>% 
  slice(1:50)

# A tibble: 200 x 3
# Groups:   File [4]
   row   File   Value
   <fct> <fct>  <int>
 1 PaH   file_A 98999
 2 RwX   file_A 98996
 3 JjQ   file_A 98992
 4 SfA   file_A 98990
 5 TrI   file_A 98989
 6 WgU   file_A 98975
 7 DnZ   file_A 98969
 8 TdK   file_A 98965
 9 YlS   file_A 98954
10 FeZ   file_A 98954
# … with 190 more rows

Now to plot this as a heatmap we can do:

library(ggplot2)
ggplot(Extract_DF, aes(y = row, x = File, fill = Value))+
  geom_tile(color = "black")+
  scale_fill_gradient(low = "red", high = "green")

And you get:

I intentionally let y labeling even if it is not elegant just in order you see how the graph is organized. All the white spot are those rows that are top 50 in one column but not in other columns

If you are looking for only top 50 values across all columns, you can use @Jon's answer and use the last part of my answer for getting a heatmap using ggplot2

score 0 · Answer 4 · answered Mar 13 '20 at 14:24

Here is another approach using rank. I am using a matrix, but it should easily work on a data.frame as well. Using the volcano dataset, each column is reverse ranked (i.e. lowest rank for highest value), then returns a value of 1 for those values that have a rank of less than or equal to 50, and a 0 otherwise. I include a plot of the scaled version of the matrix to show that the results correctly identify the highest values for each column of the matrix.

# example data
M <- volcano

# for reference - each column is centered and scaled
Msc <- scale(M)

# return TRUE if rank is in top 50 highest values
Ma <- apply(M, 2, function(x){
  ran <- length(x) - rank(x, ties.method = "average")
  ran <= 50
})
colSums(Ma)


png("tmp.png", width = 7.5, height = 2.5, units = "in", res = 400)
op <- par(mfcol = c(1,3), mar = c(1,1,1.5,1), oma = c(2,2,0,0))
image(M, xlab = "", ylab = "", xaxt = "n", yaxt = "n"); mtext("original")
image(Msc, xlab = "", ylab = "", xaxt = "n", yaxt = "n"); mtext("scaled")
image(Ma, xlab = "", ylab = "", xaxt = "n", yaxt = "n"); mtext("top 50 for each column")
mtext(text = "rows", side = 1, line = 0, outer = TRUE)
mtext(text = "columns", side = 2, line = 0, outer = TRUE)
par(op)
dev.off()

How to create heatmap only for 50 highest value

4 Answers4

Linked