My task is loading a big csv file (9 gb) and extracting some specific rows and save theses specific rows in a new csv file. This proceeding I’m doing with a function. Therefore in my console I load my function with the source()
command and then execute the function with myfun()
.
Exceeds the csv file 6 gb my computer hang-up.
Solution approaches that I tried with no success:
How can I remove all objects but one from the workspace in R?
- Because I have a function my variables aren’t in my workspace and so I can’t remove them…
The gc() command
- Here on stackoverflow are some various post to this topic
- Recently I used gc() to release my memory – and it works
- But now I need my three variables (start, quantity and l) – means not all variables are allowed to delete
- In the fifth or sixth for-loop the computer hang-up
- Note: without the gc() command I only reach the second or third for-loop
- The gc() command has an effect
An additional note to my csv file:
- It has 6 columns
- I need to extract every fortieth or hundredth row
- The distance of rows I have to detect
- And I’m not sure if the distance of rows is constant over the whole csv file
My pc is a win7 machine with 64 bit and 16 mb internal memory.
And now my question: Is there a way to avoid hang-ups? Maybe a better position of gc() in my code or some other arguments for gc()?
When you need further information please comment – I will edit my post.
Many thanks in advance!
Now my code:
library(data.table) # because of the fread() command
myfun=function () {
start=i
quantity=2.2*10^7 # this is the number of rows and this amount is about 1.2 gb of the csv file
for (l in 1:12) { # the 12 is guessed… perhaps here exists also a better solution
DT=fread("C:\\user1\\AllRows.csv",sep = ";",stringsAsFactors=FALSE,drop=7,header=FALSE,nrows= quantity,skip=start,data.table=FALSE)
colnames(DT)=c("col_1"," col_2"," col_3"," col_4"," col_5"," col_6")
# Detect the distance of rows and extract the corresponding rows
# and save it in data.df
# and now data.df will be saved
file=file.path("C:\\user1\\ExtractedRows.csv"))
if (l==1) {write.table(data.df,file=file,sep=";",dec=",",row.names=FALSE,col.names= c("col_1"," col_2"," col_3"," col_4"," col_5"," col_6"),append=FALSE)}
if (l!=1) {write.table(data.df,file=file,sep=";",dec=",",row.names=FALSE,col.names=FALSE,append=TRUE)}
# release the internal memory
gc(reset=T)
# incrementing start
start = start + quantity
} # end of for loop
} # end of function