2

Reading other questions (1) I am under the impression that pandas does boxplots or doing statistical analysis best when the data is in the following format:

    stimulus    vote
0          1       0
1          1       1
2          1       1
3          1       1
4          1       2
5          1       2
6          1       2
7          1       2
8          1       2
9          1       2
10         1       3
11         1       3
12         1       3
13         1       3

where stimulus is my independent variable and vote is each score given to it.

However my data already comes grouped by rating, with another column votes showing the .count() of each vote.

    stimulus  rating  votes
0          1       0      1
1          1       1      3
2          1       2      6
3          1       3      4

again, stimulus is my IV, rating is the score scale and votes is the number of votes given to each score.

I am now having trouble working with that format and I can't even find out how I can transform this data back into the "stacked" or "record" format.

In the end I want to

  • plot the data as a boxplot
  • execute a Kruskal-Wallis H-test
Nils Werner
  • 34,832
  • 7
  • 76
  • 98

1 Answers1

1
import numpy as np
import pandas as pd

df = pd.read_table('data', sep='\s+')

stacked = pd.DataFrame({key: np.repeat(df[key].values, df['votes'])
                        for key in ('rating', 'stimulus')})

yields

    rating  stimulus
0        0         1
1        1         1
2        1         1
3        1         1
4        2         1
5        2         1
6        2         1
7        2         1
8        2         1
9        2         1
10       3         1
11       3         1
12       3         1
13       3         1

What you posted as the vote column I am calling the rating column. If I understand your situation correctly, the values in the stacked vote/ratings column are ratings. So I think it is appropriate to call the column rating. (Moreover, it allows me to use a dict comprehension which is -- okay, I admit it -- the real reason for the name change. :)

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677