Python: interact with complex data warehouse

Question

We've worked hard to work up a full dimensional database model of our problem, and now it's time to start coding. Our previous projects have used hand-crafted queries constructed by string manipulation.

Is there any best/standard practice for interfacing between python and a complex database layout?

I've briefly evaluated SQLAlchemy, SQLObject, and Django-ORM, but (I may easily be missing something) they seem tuned for tiny web-type (OLTP) transactions, where I'm doing high-volume analytical (OLAP) transactions.

Some of my requirements, that may be somewhat different than usual:

load large amounts of data relatively quickly
update/insert small amounts of data quickly and easily
handle large numbers of rows easily (300 entries per minute over 5 years)
allow for modifications in the schema, for future requirements

Writing these queries is easy, but writing the code to get the data all lined up is tedious, especially as the schema evolves. This seems like something that a computer might be good at?

I'll add this as a comment, but typically ORMs are used for OLTP, and trying to shoehorn them into mapping a data warehouse is usually more trouble than it's worth. — nos, Sep 23 '10 at 20:55
While ORM's can be used for OLTP's, they can also simplify certain kinds of data warehouse processing. Most DW queries **should** be simple `select sum/count and group-by` operations. Having these generated by an ORM tool can be a big time-saver. — S.Lott, Sep 23 '10 at 21:03
Are queries going to be generated by end-users? What interface are you providing, and what levels of freedom? — shmichael, Sep 24 '10 at 14:44
@shmichael: as much as I can, but we'll start with drop-down boxes with a few values from three or four dimensions. — bukzor, Sep 24 '10 at 23:29

score 6 · Answer 1 · answered Sep 23 '10 at 21:01

6

Don't get confused by your requirements. One size does not fit all.

load large amounts of data relatively quickly

Why not use the databases's native loaders for this? Use Python to prepare files, but use database tools to load. You'll find that this is amazingly fast.

update/insert small amounts of data quickly and easily

That starts to bend the rules of a data warehouse. Unless you're talking about Master Data Management to update reporting attributes of a dimension.

That's what ORM's and web frameworks are for.

handle large numbers of rows easily (300 entries per minute over 5 years)

Again, that's why you use a pipeline of Python front-end processing, but the actual INSERT's are done by database tools. Not Python.

alter schema (along with python interface) easily, for future requirements

You have almost no use for automating this. It's certainly your lowest priority task for "programming". You'll often do this manually in order to preserve data properly.

BTW, "hand-crafted queries constructed by string manipulation" is probably the biggest mistake ever. These are hard for the RDBMS parser to handle -- they're slower than using queries that have bind variables inserted.

answered Sep 23 '10 at 21:01

S.Lott

384,516
81
508
779

Thanks. What I meant by "alter" was "don't lock down the schema". Many times when interacting with a database, you reach a certain threshold of lines of code to be modified where it's not worth it to edit the schema anymore. I'd like to avoid that situation entirely if possible. – bukzor Sep 23 '10 at 21:14
When loading into the database, how do you handle new data that overlaps the old? Or do you simply load all the data each time? That seems prohibitively inefficient. – bukzor Sep 23 '10 at 21:20
@bukzor: Data Warehouse (almost always) means insert-only. New data --- by definition -- is *new* and does not *overlap* the old. It may share some common keys, but it's *new*. Please find a copy of Kimball's The Data Warehouse Toolkit before proceeding. – S.Lott Sep 23 '10 at 22:37
i got it last week but I'm still on chapter 2... Does that mean you start with a blank database always? Beyond that, at least some of my data will "overlap": jobs end, machines are upgraded, etc. I note that nearly all of the slowly-changing-dimesion types require updating the old data to expire it. – bukzor Sep 24 '10 at 00:40
@bukzor: "update reporting attributes of a dimension". I believe that's the use case for proper SQL. "you start with a blank database always?" Kind of. You often know the dimensions in advance. You rarely know the facts in advance. "jobs end, machines are upgraded" sounds like dimension updates to me. These are small, isolated, specialized, rare things. Not like fact loading, which is 80-90% of the database change. – S.Lott Sep 24 '10 at 15:06

score 3 · Answer 2 · answered Sep 23 '10 at 21:21

I'm using SQLAlchemy with a pretty big datawarehouse and I'm using it for the full ETL process with success. Specially in certain sources where I have some complex transformation rules or with some heterogeneous sources (such as web services). I'm not using the Sqlalchemy ORM but rather using its SQL Expression Language because I don't really need to map anything with objects in the ETL process. Worth noticing that when I'm bringing a verbatim copy of some of the sources I rather use the db tools for that -such as PostgreSQL dump utility-. You can't beat that. SQL Expression Language is the closest you will get with SQLAlchemy (or any ORM for the matter) to handwriting SQL but since you can programatically generate the SQL from python you will save time, specially if you have some really complex transformation rules to follow.

One thing though, I rather modify my schema by hand. I don't trust any tool for that job.

+1: "modify my schema by hand". It's too hard to write a tool for this -- too many special cases. — S.Lott, Sep 23 '10 at 22:36
Let me see...: one source is in the same server, and three other sources are in three different servers (two postgresql, 1 mysql and 1 oracle are my sources). Plus I'm gathering data from a SalesForce SOAP WS and data from some Google Apps (Analytics and Calendar). — Mariano, Nov 26 '10 at 03:19

score 2 · Answer 3 · answered Sep 23 '10 at 20:47

2

SQLAlchemy definitely. Compared to SQLAlchemy, all other ORMs look like child's toy. Especially the Django-ORM. What's Hibernate to Java, SQLAlchemy is to Python.

answered Sep 23 '10 at 20:47

dekomote

3,817
1
17
13

I've tried to learn the SA ORM three times now and each time I come against completely bizarre errors that make me not want to touch it ever again. How did you learn it? – bukzor Sep 23 '10 at 21:17
"completely bizarre errors" or "assumptions of mine that turned out to be untrue out SA"? Often, we make assumptions, find that the software doesn't match our assumptions, and call the whole thing "bizarre". Without details, it's impossible to evaluate your assertion. If you have problems, please post them as questions. – S.Lott Sep 23 '10 at 22:54
@S.Lott: You're probably right, but after repeated attempts, I'm tempted to think that if my assumptions are repeatedly wrong, then probably it's not the tool for me. I'm interested in @user184757's approach; the SA core expression language seems more sane. – bukzor Sep 24 '10 at 00:48

Python: interact with complex data warehouse

3 Answers3