What's a good data model for cross-tabulation?

Question

I'm implementing a cross-tabulation library in Python as a programming exercise for my new job, and I've got an implementation of the requirements that works but is inelegant and redundant. I'd like a better model for it, something that allows a nice, clean movement of data between the base model, stored as tabular data in flat files, and all of the statistical analysis results that might be asked of this.

Right now, I have a progression from a set of tuples for each row in the table, to a histogram counting the frequencies of the appearances of the tuples of interest, to a serializer that -- somewhat clumsily -- compiles the output into a set of table cells for display. However, I end up having to go back up to the table or to the histogram more often than I want to because there's never enough information in place.

So, any ideas?

Edit: Here's an example of some data, and what I want to be able to build from it. Note that "." denotes a bit of 'missing' data, that is only conditionally counted.

If I were looking at the correlation between columns 0 and 2 above, this is the table I'd have:

    . 1 2 3 4
1   0 1 0 3 0
2   2 1 1 0 1

In addition, I'd want to be able to calculate ratio of frequency/total, frequency/subtotal, &c.

It's not very clear to me what you're trying to do. Can you either describe the requirements or give an example with some data and show the transformations you want to do? — James Thompson, Jun 19 '09 at 19:51
In this example you'd be better off writing it in column major order. — jonnii, Jun 19 '09 at 20:49
jonnii, I'm not much of a statistician; can you clarify what you mean by that? — Chris R, Jun 21 '09 at 17:47

Roberto Bonvallet · Accepted Answer · 2009-06-20T23:18:07.347

You could use an in-memory sqlite database as a data structure, and define the desired operations as SQL queries.

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE data (a, b, c)')

c.executemany('INSERT INTO data VALUES (?, ?, ?)', [
    (1, None,    1),
    (1,    0,    3),
    (1,    0,    3),
    (1,    2,    3),
    (2, None,    1),
    (2,    0, None),
    (2,    2,    2),
    (2,    2,    4),
    (2,    2, None),
])

# queries
# ...

score 1 · Answer 2 · answered Jul 15 '11 at 02:11

S W has posted a good basic recipe for this on activestate.com.

The essence seems to be...

Define xsort=[] and ysort=[] as arrays of your axes. Populate them by iterating through your data, or some other way.
Define rs={} as a dict of dicts of your tabulated data, by iterating through your data and incrementing rs[yvalue][xvalue]. Create missing keys if/when needed.

Then for example the total for row y would be sum([rs[y][x] for x in xsort])

score 0 · Answer 3 · answered Jun 19 '09 at 19:57

Since this is an early programming exercise for Python, they probably want you to see what Python built-in mechanisms would be appropriate for the initial version of the problem. The dictionary structure seems a good candidate. The first column value from your tab-sep file can be the key into a dictionary. The entry found by that key can itself be a dictionary, whose key is the second column value. The entries of the subdictionary would be a count, initialized to 1 when you add a new subdictionary when a pair is first encountered.

jonnii · Answer 4 · 2009-06-19T19:34:09.767

-1

Why not store it using HTML Tables? It might not be the best, but you could then, very easily, view it in a browser.

Edit:

I just re-read the question and you're asking for data model, not a storage model. To answer that question...

It all depends on how you're going to be reporting on the data. For example if you're going to be doing a lot of pivoting or aggregation it might make more sense to store it in column major order, this way you can just sum a column to get counts, for example.

It'll help a lot if you explain what kind of information you're trying to extract.

edited Jun 19 '09 at 19:34

answered Jun 19 '09 at 19:29

jonnii

28,019
8
80
108

I'm not sure what type of data, actually; the exercises are being parceled out one step at a time (step 1: read tab-separated file and count pairs in columns 1/2 (pivot table? /me wanders off to wikipedia...) Assume that I want to do everything in here: http://en.wikipedia.org/wiki/Cross_tabulation#Statistics_related_to_cross_tabulations – Chris R Jun 19 '09 at 19:41
1

What do you mean by count pairs in a column? – jonnii Jun 19 '09 at 19:44

What's a good data model for cross-tabulation?

4 Answers4