0

I am trying to efficiently read some legacy DBs contents into a numpy (rec-)array. I was following this What's the most efficient way to convert a MySQL result set to a NumPy array? and this MySQLdb query to Numpy array posts. Now it happens that some entries in the DB contains NULL, which are returned as None.

So np.fromiter will react like this e.g.

TypeError: long() argument must be a string or a number, not 'NoneType'

I would like to kind of, tell it how it should behave in case it encounters None. Is that even possible?

Here is (something like) my code:

cur = db.cursor()
query = ("SELECT a, b, c from Table;")
cur.execute(query)                                                               

dt = np.dtype([                                                                  
    ('a', int),                                                              
    ('b', int),                                                              
    ('c', float),                                                                
])                                                                               
r = np.fromiter(cur.fetchall(), count=-1, dtype=dt)    

And I would like to be able to specify, that the resulting array should contain np.nan in case None is encountered in column 'c', while it should contain the number 9999 when None is found for column 'a' or 'b'. Is something like that possible?

Or is there another (beautiful) method to get MySQL DB contents into numpy arrays, in case some values are unknown?

Community
  • 1
  • 1
Dominik Neise
  • 1,179
  • 1
  • 10
  • 23

1 Answers1

2

I would be very hesitant to suggest that this is the best way of doing this, but np.rec.fromrecords has worked well for me in the past.

The fix_duplicate_field_names function is there to ensure that numpy doesn't bork when MySQL returns multiple columns with the same name (it just fudges new names).

In the get_output function, the some info is parsed out of the cursor to get field names for the rec array, after which numpy is allowed to decide the data type of the MySQL data.

def fix_duplicate_field_names(self,names):
    """Fix duplicate field names by appending an integer to repeated names."""
    used = []
    new_names = []
    for name in names:
        if name not in used:
            new_names.append(name)
        else:
            new_name = "%s_%d"%(name,used.count(name))
            new_names.append(new_name)
        used.append(name)
    return new_names    

def get_output(cursor):
    """Get sql data in numpy recarray form."""
    if cursor.description is None:
        return None
    names = [i[0] for i in cursor.description]
    names = fix_duplicate_field_names(names)
    output = cursor.fetchall()
    if not output or len(output) == 0:
        return None
    else:
        return np.rec.fromrecords(output,names=names)
ebarr
  • 7,704
  • 1
  • 29
  • 40