6

How can I use the R packages zoo or xts with very large data sets? (100GB) I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them. How can I use it?

I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?

I just want to aggreagate series, cleanse, and perform some cointegrations and plots. I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.

Added: I'm on Windows

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
skan
  • 7,423
  • 14
  • 59
  • 96
  • This is not a quantitative finance question. I'm sending this to Stack Overflow. – chrisaycock Mar 27 '13 at 03:56
  • @skan You can have a look at `mmap` package which was created by Jeff Ryan (author of xts) – CHP Mar 27 '13 at 06:54
  • Also see this post http://r.789695.n4.nabble.com/xts-timeseries-as-shared-memory-objects-with-bigmemory-package-tp3385186p3385252.html – CHP Mar 27 '13 at 07:07
  • But I'm using R for Windows, and nmap works on Linux. Then you think I cannot use packages such as ff, revoscaler or RHipe with zoo or to perform cointegrations or wavelet analysis? – skan Mar 27 '13 at 10:00
  • The mmap package works on Windows. Did you even look at the package? – Joshua Ulrich Mar 27 '13 at 16:41
  • Yes I did. And I read that is a typical Unix function. On the Wikipedia they say that there is something similar for Windows called MapViewOfFile but only for 32bit. Anyway it Mapviewoffile doesn't seem to be an program but some command for an API related with the internal memory management. – skan Mar 27 '13 at 23:55
  • 2
    The mmap package uses `mmap` on unix-alikes and `MapViewOfFile` on Windows. You don't need to know any of that to use the package, which is why I asked if you actually looked at (i.e. tried) the package. There's a vignette with examples and Jeff has several presentations floating around on the web. – Joshua Ulrich Apr 02 '13 at 01:37

1 Answers1

2

I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.

If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )

I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.

Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )

I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL

Community
  • 1
  • 1
Skif
  • 288
  • 2
  • 12
  • Thanks. If anybody is still interested there are some other ways: Revoscaler could be an option, thoguh it needs to add more functions. Using Hadoop with RHadoop could be an option, though Hadoop Mapreduce is quite complicated. – skan Sep 24 '13 at 12:01
  • The sparse matrix option sounds nice, Skif, but onlyt for some cases. How can I use the database (for example SQLite) and perform a time aggregation without loading everything on memory? Would I need to use SQL joins instead of R's functions? – skan Sep 24 '13 at 12:06
  • Yes, using SQL joins and other SQL would be the best option. Perhaps I confused you - you can extract data from your PostgreSQL into R one bit at a time. No need to download everything into R in one go. Say you have time-series data. One thing you can try is to load data into R one time period at a time and aggregate in in this way. The alternative is to do all the aggregation through SQL queries. I used the first option when I worked, but the 2nd option should also be doable – Skif Sep 25 '13 at 15:06