How to convert huge file with huge number of rows from long to wide format in R?

Question

I have a dataframe ACC which looks like following: The dataframe shows miRNA and Target interactions:

ACC:

  miRNAs        Targets
MIMAT0000075    HIF1A
MIMAT0000449    CXCR4
MIMAT0000421    CYP7A1
MIMAT0000279    STAT5A
MIMAT0000076    RASGRP1

I converted the above long format to wider format using mutate and pivot_wider. I used the following snippet for the conversion: If there is interaction it will be 1 else 0

library(dplyr)
library(tidyr)

validated_targets <- ACC %>%
  mutate(n = 1) %>%
  pivot_wider(names_from = miRNAs, values_from = n, values_fill = list(n = 0))

The output looks like below:

Help needed:

The above dataframe is just a small example. I have a dataframe with 400639 rows and two columns. The dimension of my original data is below:

dim(originaldata)
[1] 400639      2

I'm not able to convert my original data from long to wider format in Rstudio. Can anyone please tell me how to convert such huge file from long to wider format like the output given above?

thanq

Try [data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html). — JBGruber, Oct 12 '19 at 09:39
@JBGruber If you don't mind could you please give me an example. — beginner, Oct 12 '19 at 09:51
There might be a better way to achieve what you want which will save you having to make it wide. — Hearkz, Oct 12 '19 at 09:59
But for some of my work, I need the table like the above example output. There is no other way. This is what I need. — beginner, Oct 12 '19 at 10:07
From the [tidyr::pivot vignette](https://tidyr.tidyverse.org/reference/pivot_wider.html) - "It’s relatively rare to need pivot_wider() to make tidy data, but it’s often useful for creating summary tables for presentation, or data in a format needed by other tools." You may get a more helpful response if you can say why you need it in a wide format. — Hearkz, Oct 12 '19 at 10:08

JBGruber · Accepted Answer · 2019-10-12T10:18:34.680

I believe what you are looking for is the data.table equivalent to pivot_wider - dcast:

library(data.table)
library(dplyr)

dt_wide <- dt %>%
  mutate(n = 1) %>% 
  dcast(Targets ~ miRNAs, value.var = "n", fill = 0)

dt_wide
#>   Targets MIMAT0000075 MIMAT0000076 MIMAT0000279 MIMAT0000421 MIMAT0000449
#> 1   CXCR4            0            0            0            0            1
#> 2  CYP7A1            0            0            0            1            0
#> 3   HIF1A            1            0            0            0            0
#> 4 RASGRP1            0            1            0            0            0
#> 5  STAT5A            0            0            1            0            0

data

dt <- fread("miRNAs        Targets
MIMAT0000075    HIF1A
MIMAT0000449    CXCR4
MIMAT0000421    CYP7A1
MIMAT0000279    STAT5A
MIMAT0000076    RASGRP1")


dt
#>          miRNAs Targets
#> 1: MIMAT0000075   HIF1A
#> 2: MIMAT0000449   CXCR4
#> 3: MIMAT0000421  CYP7A1
#> 4: MIMAT0000279  STAT5A
#> 5: MIMAT0000076 RASGRP1

benchmarking

The data.table function is supposed to be quicker. So let's try how much of a difference this makes in this case and with data that is the same size as the data described in the question:

# simulate data
dt_big <- data.table(
  miRNAs = sample(dt$miRNAs, 400639, replace = TRUE),
  Targets = sample(dt$Targets, 400639, replace = TRUE)
)

# create mock functions for easier benchmarking
dt_dcast <- function(x) {
  dt %>%
    mutate(n = 1) %>% 
    dcast(Targets ~ miRNAs, value.var = "n", fill = 0)
}

tidyr_pivot_wider <- function(variables) {
  dt %>%
    mutate(n = 1) %>%
    pivot_wider(names_from = miRNAs, values_from = n, values_fill = list(n = 0))
}

bench::mark(
  dt_dcast(dt_big),
  tidyr_pivot_wider(dt_big),
  check = FALSE
)
#> # A tibble: 2 x 6
#>   expression                     min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt_dcast(dt_big)             1.8ms      2ms      462.     973KB     10.5
#> 2 tidyr_pivot_wider(dt_big)   4.02ms   4.52ms      218.     826KB     13.5

As you can see, data.table takes less than half the time to complete the task. However, neither function really takes long on that size of a dataset and I'm a little surprised you are having problems with pivot_wider in the first place.

But I would like to know Will I get the output quickly, Or will take huge time for conversion. Asking this because with "pivot_wider" I also get the same output, but it is taking hours of time and didn't get required output with the huge input file. — beginner, Oct 12 '19 at 10:13
I was about to add this to the answer. `data.table` is usually fast than `dplyr`/`tidyr`. — JBGruber, Oct 12 '19 at 10:19
I'm really surprised though that you weren't able to use `pivot_wider` on your data. Maybe it would make sense to describe how exactly `R` failed to perform the task. — JBGruber, Oct 12 '19 at 10:23
Wow this is awesome. I got the required output with "data.table" in few mins, but the problem with dplyr/tidyr and using "pivot_wider" is its been already more than hour for conversion, but it is still running. But anyways "data.table" is super fast as you said. — beginner, Oct 12 '19 at 10:31
Great! One stupid question though: Have you checked if either `miRNAs` or `Targets` is a factor? I imagine that would increase the time for computation by a lot. — JBGruber, Oct 12 '19 at 11:02
The I assumethis should be pretty quick as well: `dt %>% mutate(n = 1) %>% mutate_if(is.factor, as.character) %>% pivot_wider(names_from = miRNAs, values_from = n, values_fill = list(n = 0))` — JBGruber, Oct 12 '19 at 12:42

How to convert huge file with huge number of rows from long to wide format in R?

1 Answers1

data

benchmarking