0

My task is loading a big csv file (9 gb) and extracting some specific rows and save theses specific rows in a new csv file. This proceeding I’m doing with a function. Therefore in my console I load my function with the source() command and then execute the function with myfun().

Exceeds the csv file 6 gb my computer hang-up.

Solution approaches that I tried with no success:

  • How can I remove all objects but one from the workspace in R?

    • Because I have a function my variables aren’t in my workspace and so I can’t remove them…
  • The gc() command

    • Here on stackoverflow are some various post to this topic
    • Recently I used gc() to release my memory – and it works
    • But now I need my three variables (start, quantity and l) – means not all variables are allowed to delete
    • In the fifth or sixth for-loop the computer hang-up
    • Note: without the gc() command I only reach the second or third for-loop
      • The gc() command has an effect

An additional note to my csv file:

  • It has 6 columns
  • I need to extract every fortieth or hundredth row
    • The distance of rows I have to detect
    • And I’m not sure if the distance of rows is constant over the whole csv file

My pc is a win7 machine with 64 bit and 16 mb internal memory.

And now my question: Is there a way to avoid hang-ups? Maybe a better position of gc() in my code or some other arguments for gc()?

When you need further information please comment – I will edit my post.

Many thanks in advance!

Now my code:

    library(data.table)  # because of the fread() command

    myfun=function () {

    start=i
    quantity=2.2*10^7  # this is the number of rows and this amount is about 1.2 gb of the csv file
    for (l in 1:12) {  # the 12 is guessed… perhaps here exists also a better solution

        DT=fread("C:\\user1\\AllRows.csv",sep = ";",stringsAsFactors=FALSE,drop=7,header=FALSE,nrows= quantity,skip=start,data.table=FALSE)
        colnames(DT)=c("col_1"," col_2"," col_3"," col_4"," col_5"," col_6")

        # Detect the distance of rows and extract the corresponding rows
        # and save it in data.df

        # and now data.df will be saved
        file=file.path("C:\\user1\\ExtractedRows.csv"))
        if (l==1) {write.table(data.df,file=file,sep=";",dec=",",row.names=FALSE,col.names= c("col_1"," col_2"," col_3"," col_4"," col_5"," col_6"),append=FALSE)}
        if (l!=1) {write.table(data.df,file=file,sep=";",dec=",",row.names=FALSE,col.names=FALSE,append=TRUE)}

        # release the internal memory
        gc(reset=T)

        # incrementing start
        start = start + quantity

    }  # end of for loop

    }  # end of function
tueftla
  • 369
  • 1
  • 3
  • 16

1 Answers1

0

When working with data too big to fit in memory, you can use SQLlite to hold the data on the disk and then query out what you need.

Here is reference to help get you started: https://www.r-bloggers.com/r-and-sqlite-part-1/

This approach is probably better suited to your problem than the method you outlined in your question.

Ian Wesley
  • 3,565
  • 15
  • 34