I'm trying to make a reporting / analysis app.
I want to move data from a simple Django 'polls' application into a DataFrame for analysis (basic stats initially). The expected volume of information is low so I am not optimising for performance.
frame = pd.DataFrame.from_records([row for row in answer_set.values(col_names)], **kwargs)
I can then 'join' subsequent frames
(for different questions) on the 'user_id' column, provided I make it an index of the dataframe.
My problem is that an answer is also indexed (but orthogonally) by 'question' and as a result I have only been able to get to a table like this (this is transposed because it was impractically wide):
answer_fields User1 User2
questionpk 1 1
question__text 'How?' 'How?'
answerfield1 'x' 'x'
answerfield2 'y' 'z'
questionpk 2 2
etc...
I want to get a table more like this:
table User User
question question_text 1 2
1 'How?' field1 'x' 'x'
field2 'y' 'z'
But as far as I can see that requires a MultiIndex on the question dimension, making 'users' the columns, and if I cared about organising Users by demographics, I don't think I can have a multiIndex on that dimension too. I prefer this condensed format because I want to offer the ability to build a DataFrame with columns that an analyst user wants to profile and compare (comparing answers with subsequent answers matched on user, most obviously), and while the question_text is redundant as a variable, it has a useful value for reporting purposes. I'd like to offer some tools from statsmodels and patsy formulae, so easy names for columns and rows will be important.
[I have a sense that some kind of set logic among the foreignKeys would help me generalise this to other apps, but having done so badly on what I took to be a simple problem has left me unsure that I understand what a table even is.]