9

Is there a way to retain the SqlAlchemy attribute names when you query the data into a pandas dataframe?

Here's a simple mapping of my database. For the school table, I've renamed 'SchoolDistrict', the DB name, to a shorter 'district'. I'm several layers removed from the DBA, so changing them in the source isn't feasible.

class School(Base):
    __tablename__ = 'DimSchool'

    id = Column('SchoolKey', Integer, primary_key=True)
    name = Column('SchoolName', String)
    district = Column('SchoolDistrict', String)


class StudentScore(Base):
    __tablename__ = 'FactStudentScore'

    SchoolKey = Column('SchoolKey', Integer, ForeignKey('DimSchool.SchoolKey'), primary_key = True)
    PointsPossible = Column('PointsPossible', Integer)
    PointsReceived = Column('PointsReceived', Integer)

    school = relationship("School", backref='studentscore')

So when I query something like:

query = session.query(StudentScore, School).join(School)
df = pd.read_sql(query.statement, query.session.bind)

I end up with the underlying 'SchoolDistrict' name for the column, not my attribute name, in the returned DataFrame df.

EDIT: An even more annoying case is when duplicate column names exist across tables. For example:

class Teacher(Base):
    __tablename__ = 'DimTeacher'

    id = Column('TeacherKey', Integer, primary_key=True)
    fname = Column('FirstName', String)
    lname = Column('FirstName', String)

class Student(Base):
    __tablename__ = 'DimStudent'

    id = Column('StudentKey', Integer, primary_key=True)
    fname = Column('FirstName', String)
    lname = Column('FirstName', String)

So a query across both tables (like the one below) produces a dataframe with duplicate FirstName and LastName columns.

query = session.query(StudentScore, Student, Teacher).join(Student).join(Teacher)

Would it be possible to rename these columns at the moment of the query? Right now I'm having trouble keeping my head straight with these two systems of column names.

lrnzcig
  • 3,868
  • 4
  • 36
  • 50
AZhao
  • 13,617
  • 7
  • 31
  • 54
  • Hi, you could try to do a `select as` like in this [SO answer](http://stackoverflow.com/questions/3576382/select-as-in-sqlalchemy) or you could just change the name of the column after reading the dataframe. Hope it helps. – lrnzcig Jul 01 '15 at 10:08
  • Hm. So the select approach would work but forces me to move away from a more pure orm approach. The rename columns will also be a little inconvenient because of there being many variables with bad and ambiguous names. I'm really looking for a way to rename at the mapping level and forget the actual names. – AZhao Jul 01 '15 at 12:53
  • 1
    I may be wrong of course, but I don't think it is possible. Pandas `read_sql` is using the sql statement as provided by `sqlalchemy`, and it is going to have the wrong name unless you change the statement (`sqlalchemy` maps it afterwards, i.e. after the query is run, when processing the results). And in pandas there is no way to do a mapping like the one you propose; why there would be, if you can change the name of the column? Good luck anyway! – lrnzcig Jul 01 '15 at 14:08
  • Darn. So maybe I need to just build out a dict of all the columns and map it all after the data frame is pulled... – AZhao Jul 01 '15 at 14:10
  • Yes, I guess that's the simplest way. But keep in mind you can use reflection to get your dictionary from the `sqlalchemy` classes. You know what I mean? – lrnzcig Jul 01 '15 at 14:17
  • No - mind elaborating? Also I updated the answer to show why I can't easily rename the dataframes after the query. – AZhao Jul 01 '15 at 17:32
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/82118/discussion-between-azhao-and-lrnzcig). – AZhao Jul 01 '15 at 17:37

2 Answers2

2

I am not a SQLAlchemy expert by any means, but I have come up with a more generalized solution (or at least a start).

Caveats

  • Will not handle mapped columns with the same name across different Models. You should deal with this by adding suffix, or you could modify my answer below to create pandas columns as <tablename/model name>.<mapper column name>.

It involves four key steps:

  1. Qualify your query statement with labels, which will result in column names in pandas of <table name>_<column name>:
df = pd.read_sql(query.statement, query.session.bind).with_labels()
  1. Separate table name from (actual) column name
table_name, col = col_name.split('_', 1)
  1. Get the Model based on tablename (from this question's answers)
for c in Base._decl_class_registry.values():
            if hasattr(c, '__tablename__') and c.__tablename__ == tname:
                return c
  1. Find the correct mapped name
for k, v in sa_class.__mapper__.columns.items():
        if v.name == col:
            return k

Bringing it all together, this is the solution I have come up with, with the main caveat being it will result in duplicate column names in your dataframe if you (likely) have duplicate mapped names across classes.

from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class School(Base):
    __tablename__ = 'DimSchool'

    id = Column('SchoolKey', Integer, primary_key=True)
    name = Column('SchoolName', String)
    district = Column('SchoolDistrict', String)


class StudentScore(Base):
    __tablename__ = 'FactStudentScore'

    SchoolKey = Column('SchoolKey', Integer, ForeignKey('DimSchool.SchoolKey'), primary_key = True)
    PointsPossible = Column('PointsPossible', Integer)
    PointsReceived = Column('PointsReceived', Integer)

    school = relationship("School", backref='studentscore')


def mapped_col_name(col_name):
    ''' Retrieves mapped Model based on
    actual table name (as given in pandas.read_sql)
    '''

    def sa_class(table_name):
        for c in Base._decl_class_registry.values():
            if hasattr(c, '__tablename__') and c.__tablename__ == tname:
                return c

    table_name, col = col_name.split('_', 1)
    sa_class = sa_class(table_name)

    for k, v in sa_class.__mapper__.columns.items():
        if v.name == col:
            return k

query = session.query(StudentScore, School).join(School)
df = pd.read_sql(query.statement, query.session.bind).with_labels()
df.columns = map(mapped_col_name, df.columns)
telemark
  • 81
  • 6
1

This is the kind of solution I would bitterly complain about if I had to maintain the code afterwards. But your question has so many constraints that I cannot find anything better.

First you construct a dictionary with the equivalences of schema and class columns using introspection like this (I'm using the first example you've posted):

In [132]:

def add_to_dict(c_map, t_map, table):
    name = table.__tablename__
    t_map[name] = table.__name__
    #print name
    c_map[name] = {}
    for column in dir(table):
        c_schema_name = table.__mapper__.columns.get(column)
        if isinstance(c_schema_name, Column):
            #print column, c_schema_name.name
            c_map[name][c_schema_name.name] = column

c_map = {}
t_map = {}
add_to_dict(c_map, t_map, School)
add_to_dict(c_map, t_map, StudentScore)
print c_map['DimSchool']['SchoolKey']
print c_map['FactStudentScore']['SchoolKey']
print t_map['DimSchool']
id
SchoolKey
School

[EDIT: clarifications on the way to build the dictionary with introspection

  • c_map is the dictionary of correspondences of column names
  • t_map is the dictionary of correspondences of table names
  • needs to be called for each class of each table
  • for table names the correspondence is easy, since it is just attributes of the class of the table
  • for the column names of the class, 1st iterate the attributes of the class using dir
  • for each of the attributes of the class (which will be the columns of the table, but also many other things) try to get the database column name using sqlalchemy mapper
  • the mapper will return a Column object only if the attribute is really a column
  • thus for the Column objects, add them to the column names dictionary. The database name is obtained with .name and the other is just the attribute

Run this just once after creating all the objects in the database, calling it once per table class.]

Then you take your sql statement and build up a list of the translation of the columns you are going to get:

In [134]:

df_columns = []
for column in str(query.statement).split('FROM')[0].split('SELECT')[1].split(','):
    table = column.split('.')[0].replace('"', '').strip()
    c_schema = column.split('.')[1].replace('"', '').strip()
    df_columns += [t_map[table] + '.' + eq[table][c_schema]]
print df_columns
​
['StudentScore.SchoolKey', 'StudentScore.PointsPossible', 'StudentScore.PointsReceived', 'School.id', 'School.name', 'School.district']

Finally, you read the dataframe as in your question and change the names of the columns:

In [137]:

df.columns = df_columns
In [138]:

df
Out[138]:
StudentScore.SchoolKey  StudentScore.PointsPossible StudentScore.PointsReceived School.id   School.name School.district
0   1   1   None    1   School1 None

(The data is just a silly register I've created).

Hope it helps!

lrnzcig
  • 3,868
  • 4
  • 36
  • 50
  • implementing now, but can you clarify how the add_to_dict function works? As the dude who's "going to bitterly complain about maintaining it" I'd love to be able to solidify my understanding :D – AZhao Jul 02 '15 at 16:55