0

I have a very large dataframe, with number of rows = 10 703 009. I want to remove NAs but getting this error, 'Colloc couldnot allocate memory of 10703009 bytes. My input dataframe is 'a' with many rows with NAs,

IDs Codes
1 C493
1 NA
2 E348
3 NA

I need a output with rows without NAs

IDs Codes
1 C493
2 E348

I tried both, but getting memory error,

drop_na(a,Codes)
subset(a,Codes)

Please suggest the solution to this in R.

James Z
  • 12,209
  • 10
  • 24
  • 44
Muhammad
  • 145
  • 3

1 Answers1

0

A frame of 10,703,009 lines is no problem for R. See below. I generated a tibble with exactly the number of lines where the variable Codes contains NA with a probability of probNA = 0.3.

library(tidyverse)

n=10703009
probNA = 0.3


df = tibble(IDs = 1:n,  
            Codes = paste0(sample(LETTERS[1:10], n, replace = TRUE), 
                           sample(100:999, n, replace = TRUE))) %>% 
  mutate(Codes = ifelse(sample(c(T,F), n, replace = TRUE, 
                               prob = c(probNA, 1-probNA)), NA, Codes))

df

output

# A tibble: 10,703,009 x 2
     IDs Codes
   <int> <chr>
 1     1 I586 
 2     2 A188 
 3     3 H674 
 4     4 D641 
 5     5 A793 
 6     6 B455 
 7     7 B837 
 8     8 A805 
 9     9 NA   
10    10 E380 
# ... with 10,702,999 more rows

The size of such a tibble is object.size (df) return 12 894 1096 bytes.

We will try to get rid of the lines with NA values.

df %>% filter(!is.na(Codes))

output

# A tibble: 7,490,809 x 2
     IDs Codes
   <int> <chr>
 1     1 I586 
 2     2 A188 
 3     3 H674 
 4     4 D641 
 5     5 A793 
 6     6 B455 
 7     7 B837 
 8     8 A805 
 9    10 E380 
10    11 C231 
# ... with 7,490,799 more rows

Now let's replace all NA values with an empty string.

df %>% mutate(Codes = ifelse(is.na(Codes), "", Codes))

output

# A tibble: 10,703,009 x 2
     IDs Codes 
   <int> <chr> 
 1     1 "I586"
 2     2 "A188"
 3     3 "H674"
 4     4 "D641"
 5     5 "A793"
 6     6 "B455"
 7     7 "B837"
 8     8 "A805"
 9     9 ""    
10    10 "E380"
# ... with 10,702,999 more rows

As you can see, everything works smoothly and without any problems.

Marek Fiołka
  • 4,825
  • 1
  • 5
  • 20