2

There seems to be a problem with the join_by function in numpy.lib.recfunctions when doing an outer join on multiple keys. The matplotlib.mlab function works correctly. The recfunctions version seems to mix/match some of the keys (I had two keys: 001258 and 001670, the recfunctions produced keys 001270 and 001658 in addition to 001258 and 001670). Has anyone run into this issue?

I have two text files, test.csv and test2.csv that contain the following: test.csv:

gvkey,fyr,ogpoilq,datadate,cusip  
001258,12,,03/31/2002,13916P209  
001258,12,,06/30/2002,13916P209  
001258,12,,09/30/2002,13916P209  
001258,12,31.0000,12/31/2002,13916P209  
001678,12,74968.0000,12/31/2003,037411105  
001678,12,,03/31/2004,037411105  
001678,12,,06/30/2004,037411105  
001678,12,,09/30/2004,037411105  
001678,12,84736.0000,12/31/2004,037411105  
001678,12,,03/31/2005,037411105  
001678,12,,06/30/2005,037411105  
001678,12,,09/30/2005,037411105  
001678,12,85434.0000,12/31/2005,037411105  
001678,12,,03/31/2006,037411105  
001678,12,,06/30/2006,037411105  
001678,12,,09/30/2006,037411105  
001678,12,81971.0000,12/31/2006,037411105  

test2.csv:

gvkey,datadate,fyearq,fqtr,ciderglq,cisecglq    
001258,12/31/2001,2001,4,,  
001258,03/31/2002,2002,1,,  
001258,06/30/2002,2002,2,,  
001258,09/30/2002,2002,3,,  
001258,12/31/2002,2002,4,,  
001258,03/31/2003,2003,1,,  
001258,06/30/2003,2003,2,,  
001678,03/31/2004,2004,1,,  
001678,06/30/2004,2004,2,,  
001678,09/30/2004,2004,3,,  
001678,12/31/2004,2004,4,,  
001678,03/31/2005,2005,1,-136.9970,0.0000  
001678,06/30/2005,2005,2,-7.8000,0.0000  
001678,09/30/2005,2005,3,-164.6470,0.0000  
001678,12/31/2005,2005,4,73.3180,0.0000  
001678,03/31/2006,2006,1,71.6100,0.0000  
001678,06/30/2006,2006,2,5.5850,0.0000  

The following code produces the correct and incorrect merged tables:

import datetime
import numpy as np
import numpy.lib.recfunctions as rf
import matplotlib.mlab as ml
date_converter = lambda x: datetime.date(int(x[-4:]), int(x[:2]), int(x[3:5]))
prod_df = np.genfromtxt("../data/test.csv", filling_values=np.nan, converters={3:date_converter}, dtype="S10, f8, O4", names="gvkey, prod, date", delimiter=",", usecols=(0,2,3), skip_header=1)        
hedge_df = np.genfromtxt("../data/test2.csv", filling_values=np.nan, converters={1:date_converter}, dtype="S10, O4, f8", names="gvkey, date, hedgepnl", delimiter=",", usecols=(0,1,4), skip_header=1)

correct_outer_merge = ml.rec_join(["gvkey", "date"], prod_df, hedge_df, "outer")
incorrect_outer_merge = rf.rec_join(["gvkey", "date"], prod_df, hedge_df, "outer")
joris
  • 133,120
  • 36
  • 247
  • 202
Alex
  • 19,533
  • 37
  • 126
  • 195
  • 1
    Maybe you could show some code that reproduces the issue? – Sven Marnach Apr 20 '11 at 10:02
  • Alex: Please edit your answer and include the code there. There will be a button for code formatting when editing. After this, you can delete your comments. – Sven Marnach Apr 20 '11 at 14:29
  • @Alex - I can't reproduce your problem... What version of python and numpy are you using? Also, stackoverflow is a poor place to report bugs. There's nothing wrong with asking to see if other people are having similar issues on SO (which is what you're doing), but for the most part, the numpy devs are unlikely to check stackoverflow. If you do actually want to report a bug, you should use the numpy/scipy bug tracker: http://www.scipy.org/BugReport This is a nice, self contained example though! – Joe Kington Apr 20 '11 at 18:25
  • @Alex - While I'm not reproducing your exact problem, I am getting some oddities with some of the datetimes and `nan`'s in the `numpy.lib.refunctions` version, for whatever it's worth... – Joe Kington Apr 20 '11 at 18:29
  • @Joe, apologies, I am not aware of the etiquette yet, just started messing around with Python. Indeed, it seems that recfunctions were not well debugged. I will report the potential bug at the URL you mentioned. To others: If I am misusing join_by, please let me know so I don't report it as a bug. Thanks! – Alex Apr 20 '11 at 19:43
  • @Alex - No worries! It's not as much etiquette, as it is just a matter of putting it somewhere where it will get a developer's attention. – Joe Kington Apr 20 '11 at 22:48

0 Answers0