2

I have being trying to import a huge .csv, with chunks and filters. But my code are just reading part of the archive (20 millions of 45 millions).

I also already tried to use data.table() but without success.

arq_grande <- file("cnpj_dados_cadastrais_pj.csv", "r")
tam_chunk <- 5000
df1 <- read.csv(arq_grande, nrows = 10, header = T, sep = "#", dec = ".")
for(i in 1:ncol(df1)){df1[,i] <- df1[,i] %>% iconv(from = 'UTF-8', to = 'latin1')}
df_filtrado <- df1 %>% filter(codigo_natureza_juridica == c("2143","2330")) %>%  select(cnpj,everything())
write.table(df_filtrado, "/cnpj_dados_cadastrais_pj_filtrado_coop.csv", row.names = F, sep = "#", dec = ".")
names(df1)
nrow <- 1
totalRows <- 0

repeat {
  df <- read.csv(arq_grande, header=FALSE, sep="#", col.names = names(df1), nrows = tam_chunk)
  for(i in 1:ncol(df)){df[,i] <- df[,i] %>% iconv(from = 'UTF-8', to = 'latin1')}
  nRow = nrow(df)
  totalRows <- totalRows + nRow
  cat("Lendo", nrow(df), "linhas, total lido", totalRows, "\n")
  if (nrow(df) == 0)
    break

  df_filtrado <- df %>% filter(codigo_natureza_juridica == c("2143","2330")) %>%  select(cnpj,everything())
  write.table(df_filtrado, "/cnpj_dados_cadastrais_pj_filtrado_coop.csv", append = T, col.names = F, row.names = F, sep = "#", dec = ".")
}
close(arq_grande)

I saw other exemples here, but nothing worked. Sorry, I'm new with this kind of data.

I just want to read all lines of my .csv.

RxT
  • 486
  • 7
  • 17

2 Answers2

1

You can read a csv file in chunks with readr::read_csv using the skip and n_max arguments: skip is the number of lines to skip at the start, n_max is the number of lines to read afterwards.

library("readr")

# Example uses `#` as the separator
file <- "
lineno#X#Y#Z
1#a#b#c
2#d#e#f
3#g#h#i
4#j#k#l
5#m#n#o
6#p#q#r
7#s#t#u
8#v#w#
9#x#y#z
"

# Increase the chunk size appropriately
chunk_size <- 3

# Assumption: There is a header on the first line
# but we don't know what it is.
col_names <- TRUE
line_num <- 1

while (TRUE) {
  chunk <- read_delim(
    file, "#",
    skip = line_num,
    n_max = chunk_size,
    # On the first iteration, col_names is TRUE
    # so the first line "X,Y,Z" is assumed to be the header
    # On any subsequent iteration, col_names is a character vector
    # of the actual column names
    col_names = col_names
  )

  # If the chunk has now rows, then reached end of file
  if (!nrow(chunk)) {
    break
  }

  # Do something with the chunk of data
  print(chunk)

  # Update `col_names` so that it is equal the actual column names
  col_names <- colnames(chunk)

  # Move to the next chunk. Add 1 for the header.
  line_num <- line_num + chunk_size + (line_num == 1)
}
#> # A tibble: 3 x 4
#>   lineno X     Y     Z    
#>    <dbl> <chr> <chr> <chr>
#> 1      1 a     b     c    
#> 2      2 d     e     f    
#> 3      3 g     h     i    
#> # A tibble: 3 x 4
#>   lineno X     Y     Z    
#>    <dbl> <chr> <chr> <chr>
#> 1      4 j     k     l    
#> 2      5 m     n     o    
#> 3      6 p     q     r    
#> # A tibble: 3 x 4
#>   lineno X     Y     Z    
#>    <dbl> <chr> <chr> <chr>
#> 1      7 s     t     u    
#> 2      8 v     w     <NA> 
#> 3      9 x     y     z

Created on 2019-10-31 by the reprex package (v0.3.0)

dipetkov
  • 3,380
  • 1
  • 11
  • 19
  • Thanks ! But I'm having some problems with this. The ```skip=line_num``` don't allow to find the column names; – RxT Oct 29 '19 at 16:26
  • What do you mean by finding the column names? Does your csv have the usual structure (header on first row)? If not, then it will help to explain what the structure is (other than that the filw has 45 mil rows). Or better, show an example. – dipetkov Oct 30 '19 at 02:08
  • Yes, the csv have ! but when I try to apply ``` df %>% filter(codigo_natureza_juridica == c("2143","2330")) ``` , I can't find "codigo_natureza_juridica" – RxT Oct 30 '19 at 14:28
  • Updated the example, so that the header on the first line is used for each chunk, not just the first chunk. – dipetkov Oct 30 '19 at 16:13
  • Thanks ! now, the problem is the separator "#". I tried read_delim but did not worked – RxT Oct 30 '19 at 22:12
  • Then your csv might be malformed. It is easy to check that the example still works if you change the commas with `#`s and use `read_delim(file, "#", ...)` instead of `read_csv(file, ...)`. – dipetkov Oct 30 '19 at 23:58
1

You can directly use the fread function present in R to load large chunks .csv file data at fast speed.

Below is an example:

n = 1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
                 b=sample(1:1000,n,replace=TRUE),
                 c=rnorm(n),
                 d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                 e=rnorm(n),
                 f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):", round(file.info("test.csv")$size/1024^2),"\n")
# 50 MB (1e6 rows x 6 columns)

system.time(DF1 <-read.csv("test.csv",stringsAsFactors=FALSE))
# 60 sec (first time in fresh R session)

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
# 30 sec (immediate repeat is faster, varies)

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
    stringsAsFactors=FALSE,comment.char="",nrows=n,
    colClasses=c("integer","integer","numeric",
                 "character","numeric","integer")))

Now using fread from data.table

require(data.table)

system.time(DT <- fread("test.csv"))
  #  3 sec (faster and friendlier)

Same effect is scene when you scale your data up to GBs.

Below is the documentation of fread function.

https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread

ashwin agrawal
  • 1,603
  • 8
  • 16
  • Thank you ! But, look: ``` system.time(DT <- fread("C:/cnpj/cnpj_dados_cadastrais_pj.csv", sep='#'))``` Error: cannot allocate vector of size 316.7 Mb – RxT Oct 29 '19 at 16:40
  • I tried to put a ```memory.limit()``` , but still veeery slow. I tell you if works. – RxT Oct 29 '19 at 17:27
  • Hi, sorry, but it did not work. I think my .csv is too large even for fread. – RxT Oct 30 '19 at 14:29