1

I've been trying unsuccessfully the past two days to convert a large CSV (9 gigs) into XDF format using the RxImport function.

The process seems to start off well with R server reading in the data chunk by chunk but after a few minutes it slows to a crawl and then fails completely after around 6 hours with Windows stopping the server saying its run out of RAM.

The code I'm using is as follows:

pd_in_file <- RxTextData("cca_pd_entity.csv", delimiter = ",") #file to import
pd_out_file <- file.path("cca_pd_entity.xdf") #desired output file
pd_data <- rxImport(inData = pd_in_file, outFile = pd_out_file, 
stringsAsFactors = TRUE, overwrite = TRUE)

I'm running Microsoft R Server, version 9.0.1. on a Windows 7 machine with 16gig of RAM.

Thanks

  • See if setting the `colInfo` argument helps – Hong Ooi Jun 03 '17 at 10:42
  • Thank you for suggestion I'll give that a try. I have nearly 300 columns of data, so is it correct to assume I can import a subset of the data and use the rxGetVarInfo command to extract the column information and pass that to the rxImport command instead of having to manually specify each column separately? – Serban Dragne Jun 05 '17 at 09:20
  • It worked!!! Arg this is so awesome :D Thank you thank you thank you – Serban Dragne Jun 05 '17 at 15:09

1 Answers1

2

Its been solved using Hong Ooi's recommendation to set the colInfo in the rxTextData. I'm not sure why it made such a big difference but it converted the entire 9gig dataset in less than 2 minutes when it completely failed to import after several hours before.

  • what did you set the colInfo argument to? – gibbz00 Aug 08 '17 at 18:47
  • 1
    @gibbz00 - I imported a sample of data from the CSV and then assigned ColumnInfo <- rxGetVarInfo(sample_data). Then when importing in the rxDataStep (or rxImport if you want) I passed "colInfo = ColumnInfo". That worked very well. – Serban Dragne Aug 09 '17 at 18:40