Maybe I misunderstood your question, but from my understanding, you are looking make the heatmap of the top 50 values of file A, top 50 values of file B, top 50 of file C and top 50 of File D. Am I right ?
If it is what you are looking for, it could means that you don't need only 50 but potentially up to 200 values (depending if the same row is in top 50 for all files or in only one).
Here a dummy example of large dataframe corresponding to your example:
row <- expand.grid(LETTERS, letters, LETTERS)
row$Row = paste(row$Var1, row$Var2, row$Var3, sep = "")
df <- data.frame(row = row$Row,
file_A = sample(10000:99000,nrow(row), replace = TRUE),
file_B = sample(10000:99000,nrow(row), replace = TRUE),
file_C = sample(10000:99000,nrow(row), replace = TRUE),
file_D = sample(10000:99000,nrow(row), replace = TRUE))
> head(df)
row file_A file_B file_C file_D
1 AaA 54418 65384 43526 86870
2 BaA 57098 75440 92820 27695
3 CaA 71172 59942 12626 53196
4 DaA 54976 25370 43797 30770
5 EaA 56631 73034 50746 77878
6 FaA 45245 57979 72878 94381
In order to get a heatmap using ggplot2
, you need to obtain the following organization: One column for x value, one column for y value and one column that serve as a categorical variable for filling for example.
To get that, you need to reshape your dataframe into a longer format. To do that, you can use pivot_longer
function from tidyr
package but as you have thousands of rows,I will rather recommend data.table
which is faster for this kind of process.
library(data.table)
DF <- melt(setDT(df), measure = list(c("file_A","file_B","file_C","file_D")), value.name = "Value", variable.name = "File")
row File Value
1: AaA file_A 54418
2: BaA file_A 57098
3: CaA file_A 71172
4: DaA file_A 54976
5: EaA file_A 56631
6: FaA file_A 45245
Now, we can use dplyr
to get only the first top 50 values for each file by doing:
library(dplyr)
Extract_DF <- DF %>%
group_by(File) %>%
arrange(desc(Value)) %>%
slice(1:50)
# A tibble: 200 x 3
# Groups: File [4]
row File Value
<fct> <fct> <int>
1 PaH file_A 98999
2 RwX file_A 98996
3 JjQ file_A 98992
4 SfA file_A 98990
5 TrI file_A 98989
6 WgU file_A 98975
7 DnZ file_A 98969
8 TdK file_A 98965
9 YlS file_A 98954
10 FeZ file_A 98954
# … with 190 more rows
Now to plot this as a heatmap we can do:
library(ggplot2)
ggplot(Extract_DF, aes(y = row, x = File, fill = Value))+
geom_tile(color = "black")+
scale_fill_gradient(low = "red", high = "green")
And you get:

I intentionally let y
labeling even if it is not elegant just in order you see how the graph is organized. All the white spot are those rows that are top 50 in one column but not in other columns
If you are looking for only top 50 values across all columns, you can use @Jon's answer and use the last part of my answer for getting a heatmap using ggplot2