1

I'm trying to make a reporting / analysis app.

I want to move data from a simple Django 'polls' application into a DataFrame for analysis (basic stats initially). The expected volume of information is low so I am not optimising for performance.

frame = pd.DataFrame.from_records([row for row in answer_set.values(col_names)], **kwargs)

I can then 'join' subsequent frames(for different questions) on the 'user_id' column, provided I make it an index of the dataframe.

My problem is that an answer is also indexed (but orthogonally) by 'question' and as a result I have only been able to get to a table like this (this is transposed because it was impractically wide):

answer_fields User1 User2
questionpk 1 1 question__text 'How?' 'How?' answerfield1 'x' 'x' answerfield2 'y' 'z' questionpk 2 2 etc...

I want to get a table more like this:

table User User question question_text 1 2 1 'How?' field1 'x' 'x' field2 'y' 'z'

But as far as I can see that requires a MultiIndex on the question dimension, making 'users' the columns, and if I cared about organising Users by demographics, I don't think I can have a multiIndex on that dimension too. I prefer this condensed format because I want to offer the ability to build a DataFrame with columns that an analyst user wants to profile and compare (comparing answers with subsequent answers matched on user, most obviously), and while the question_text is redundant as a variable, it has a useful value for reporting purposes. I'd like to offer some tools from statsmodels and patsy formulae, so easy names for columns and rows will be important.

[I have a sense that some kind of set logic among the foreignKeys would help me generalise this to other apps, but having done so badly on what I took to be a simple problem has left me unsure that I understand what a table even is.]

Atcrank
  • 439
  • 3
  • 11
  • my suggestion is to just input the dirty data from your Django app (see: https://stackoverflow.com/questions/11697887/converting-django-queryset-to-pandas-dataframe) and then just cleaning it up using pandas – zero May 10 '19 at 01:59
  • 1
    Thanks, yeah I think that'll be best. I have just learned in the cookbook that you can multi-index both Index and Columns, which is pretty important. Also found (via that answer) a helpful project - (https://github.com/chrisdev/django-pandas/). – Atcrank May 10 '19 at 06:31

0 Answers0