14

I bought Kibot's stock data and it is enormous. I have about 125,000,000 rows to load (1000 stocks * 125k rows/stock [1-minute bar data since 2010-01-01], each stock in a CSV file whose fields are Date,Time,Open,High,Low,Close,Volume). I'm totally new to python (I chose it because it's free and well-supported by a community) and I chose SQLite to store the data because of python's built-in support for it. (And I know the SQL language very well. SQLiteStudio is a gem of a free program.)

My loader program is working well, but is getting slower. The SQLite db is about 6 Gb and it's only halfway loaded. I'm getting about 500k rows/hour loaded using INSERT statements and committing the transaction after each stock (approx 125k rows).

So here's the question: is PyTables substantially faster than SQLite, making the effort to learn how to use it worth it? (And since I'm in learning mode, feel free to suggest alternatives to these two.) One things that bother me about PyTables is that it's really bare bones, almost like saving a binary file, for the free version. No "where clause" functions or indexing, so you wind up scanning for the rows you need.

After I get the data loaded, I'm going to be doing statistical analysis (rolling regression & correlation, etc) using something based on NumPy: Timeseries, larry, pandas, or a scikit. I haven't chosen the analysis package yet, so if you have a recommendation, and that recommendation is best used with either PyTables or pandas (or whatever), please factor that in to your response.

(For @John) Python 2.6;
Windows XP SP3 32-bit;
Manufactured strings used as INSERT statements;
Memory usage is rock solid at 750M of the 2G physical memory;
CPU usage is 10% +/- 5%;
Totally i/o bound (disk is always crunching).
DB schema:

create table MinuteBarPrices (
    SopDate smalldatetime not null,
    Ticker  char( 5 )     not null,
    Open    real,
    High    real,
    Low     real,
    Close   real          not null,
    Volume  int,
    primary key ( SopDate, Ticker )
);
create unique index MinuteBarPrices_IE1 on MinuteBarPrices (
    Ticker,
    SopDate
);
pklaus
  • 647
  • 8
  • 21
jdmarino
  • 545
  • 5
  • 13
  • Have you profiled loading a slightly smaller data set, to see if the bottleneck is reading the CSV or adding to the database? – Thomas K May 21 '11 at 18:57
  • The timing I quoted is just for the inserts. Not the csv reading, nor the commit. – jdmarino May 21 '11 at 23:42
  • 1
    You should say what version of Python, what version of what operating system, and show your database schema and your code. Are you using a parameterised INSERT statement or are you manufacturing complete INSERT statements yourself? Are you monitoring memory usage? Any evidence of swapping? Any evidence of lots of free memory? Are you CPU bound or IO bound? – John Machin May 22 '11 at 00:11
  • For that volume of information, using another DBMS might be a good idea. Have you tried MySQL? Is there any specific reason to stick with SQLite? – Stefano Sanfilippo Nov 16 '14 at 20:20

2 Answers2

10
  1. Back in 2003, a scientific paper on the comparison of PyTables and Sqlite was written by F. Altec, the author of PyTables. This shows that PyTables is usually faster, but not always.

  2. On your point that PyTables feels 'bare bones', I would say the H5py is the bare bones way of accessing HDF5 in python, PyTables brings in all kinds of extra stuff like querying and indexing, which HDF5 doesn't natively have.

Example of querying:

 example_table = h5file.root.spamfolder.hamtable
 somendarray = hamtable.readWhere('(gender = "male") & (age>40)')

Note that PyTables PRO, which has even fancier options, has just ceased to exist, the Pro version will be free from now on. This means yet extra options to play with.

dirkjot
  • 3,467
  • 1
  • 23
  • 17
4

Suggestions:

  1. You have 1 GB of memory that's not being used. Try using the cache_size pragma -- docs here. Other pragmas of interest: synchronous and page_size ... maybe too late for the latter.

  2. Sometimes it is faster to load the base table without any index(es), then create the index(es).

  3. "Manufactured strings used as INSERT statements" or any other SQL statements is a bad idea, both speed wise and security wise (google("SQL injection attack")). Get out of the habit now. Use parameterised SQL statements.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Follow up: it's all about the indexes (and not the python code). My primary key was the tuple (datetime, ticker), but I was loading the data in the order of (ticker, datetime) because it comes. Why? The tuple design was due to my background with cross-sections: I always queried all the stocks for a given date. This project, however, calls for querying a stock across all dates, so (ticker, datetime) is obviously better. – jdmarino Jun 09 '11 at 19:28
  • By changing the primary key to the natural one of the data and my future queries, the load time went from linear in the number of tickers loaded (e.g. the first ticker loaded in 50 seconds, the 100th ticker loaded 20 minutes) to constant time per ticker (50 seconds for the 1st and 100th). – jdmarino Jun 09 '11 at 19:30