14

I am using Python for some data analysis. I have two tables, the first (let's call it 'A') has 10 million rows and 10 columns and the second ('B') has 73 million rows and 2 columns. They have 1 column with common ids and I want to intersect the two tables based on that column. In particular I want the inner join of the tables.

I could not load the table B on memory as a pandas dataframe to use the normal merge function on pandas. I tried by reading the file of table B on chunks, intersecting each chunk with A and the concatenating these intersections (output from inner joins). This is OK on speed but every now and then this gives me problems and spits out a segmentation fault ... no so great. This error is difficult to reproduce, but it happens on two different machines (Mac OS X v10.6 (Snow Leopard) and UNIX, Red Hat Linux).

I finally tried with the combination of Pandas and PyTables by writing table B to disk and then iterating over table A and selecting from table B the matching rows. This last options works but it is slow. Table B on pytables has been indexed already by default.

How do I tackle this problem?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user2027051
  • 153
  • 1
  • 1
  • 4
  • 1
    why can't you just compute this in a database? – eLRuLL Jan 30 '13 at 22:24
  • I am just not very familiar with databases and SQL. PyTables seems to be quite fast in comparison to SQLite and others. I did think it would be a good way forward. I'll try with a SQLlite and see how it goes. – user2027051 Jan 31 '13 at 21:46

1 Answers1

18

This is a little pseudo codish, but I think should be quite fast.

Straightforward disk based merge, with all tables on disk. The key is that you are not doing selection per se, just indexing into the table via start/stop, which is quite fast.

Selecting the rows that meet a criteria in B (using A's ids) won't be very fast, because I think it might be bringing the data into Python space rather than an in-kernel search (I am not sure, but you might want to investigate on pytables.org more in the in-kernel optimization section. There is a way to tell if it's going to be in-kernel or not).

Also if you are up to it, this is a very parallel problem (just don't write the results to the same file from multiple processes. pytables is not write-safe for that).

See this answer for a comment on how doing a join operation will actually be an 'inner' join.

For your merge_a_b operation I think you can use a standard pandas join which is quite efficient (when in-memory).

One other option (depending on how 'big' A) is, might be to separate A into 2 pieces (that are indexed the same), using a smaller (maybe use single column) in the first table; instead of storing the merge results per se, store the row index; later you can pull out the data you need (kind of like using an indexer and take). See http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

A = HDFStore('A.h5')
B = HDFStore('B.h5')

nrows_a = A.get_storer('df').nrows
nrows_b = B.get_storer('df').nrows
a_chunk_size = 1000000
b_chunk_size = 1000000

def merge_a_b(a,b):
    # Function that returns an operation on passed
    # frames, a and b.
    # It could be a merge, join, concat, or other operation that
    # results in a single frame.


for a in xrange(int(nrows_a / a_chunk_size) + 1):

    a_start_i = a * a_chunk_size
    a_stop_i  = min((a + 1) * a_chunk_size, nrows_a)

    a = A.select('df', start = a_start_i, stop = a_stop_i)

    for b in xrange(int(nrows_b / b_chunk_size) + 1):

        b_start_i = b * b_chunk_size
        b_stop_i = min((b + 1) * b_chunk_size, nrows_b)

        b = B.select('df', start = b_start_i, stop = b_stop_i)

        # This is your result store
        m = merge_a_b(a, b)

        if len(m):
            store.append('df_result', m)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • Thanks for your help Jeff. I cannot use the multiple tables query options of pandas because the require the table A and B to have the same number of rows. – user2027051 Jan 31 '13 at 21:38
  • I am avoiding to use join/concat/merge option of pandas since they all spit out a "segmentation fault: 11" every now and then. I struggle to reproduce this error: a) it is not on a particular row of either table and b) it happens on different computer/OS. It does not happen all the time but still means I cannot use that route. SO far I am making use of indexes to fill in a columns on A with values from B. Which is probably like a inefficient "join". Any other ideas are welcomed. – user2027051 Jan 31 '13 at 21:45
  • the multiple table query is only meant for your A table, and only if you need say have lots of columns that are only needed at the end of the computation. did you give a try with the above algo? interested to see what kind of perf you get; also how big are the files on disk? – Jeff Jan 31 '13 at 22:50
  • hi, table A does not have lots of cols, only 8. I still need to get the info from table B which is the bottleneck. The files on disk are about 500 M for A and 1.5 GB for B. I'll try your approach and see how fast it is. – user2027051 Feb 01 '13 at 15:18
  • I finally used a solution very similar to yours. Table A can be hold in memory. Table A and B are indexed by the column of interest. I fetch from the HDF5 table B 1 million rows at the time and intersect. It is actually quite fast. If table A could not be loaded safely into memory then you code would have been the solution. thanks! – user2027051 Feb 13 '13 at 22:22
  • 1
    @user2027051.. I am searching for this solution . Can you help me to share any sample code. – avinash Aug 15 '19 at 19:04