5

I am a beginner with pandas at best and I couldn't find a solution to this problem anywhere.

Let's say I have two variables: variable1, variable2.

They can have the following predefined values:

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']

However, the current data set only has some of those values:

df = pd.DataFrame({variable1 : ['1', '9', '20'],
                  variable2 : ['2', '2', '6']})

When crossing the variables:

pd.crosstab(df.variable1, df.variable2)

I get:

variable2  2  6
variable1      
1          1  0
20         0  1
9          1  0

Is there a way to put all the possible categorical values in both the columns and the rows even if the current data set does not have all of them? The goal is to have a table of the same size when running the script with an updated data set which may have the values that were not present in the previous data set.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252

3 Answers3

8

Use DataFrame.reindex:

variable1 = ['1', '4', '9', '15', '20']
variable2 = ['2', '5', '6']


df = pd.DataFrame({'variable1' : ['1', '9', '20'],
                  'variable2' : ['2', '2', '6']})

print (df)                  
  variable1 variable2
0         1         2
1         9         2
2        20         6

df = pd.crosstab(df.variable1, df.variable2)
df = df.reindex(index=variable1, columns=variable2, fill_value=0)
print (df)
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1

from collections import OrderedDict


valuelabels = OrderedDict([('S8', [['1', 'Medical oncology'], 
                                   ['2', 'Hematology'], 
                                   ['3', 'Hematology/Oncology'], 
                                   ['4', 'Other']]), 
                           ('S9', [['1', 'Academic / Teaching Hospital'], 
                                   ['2', 'Community-Based Solo Private Practice'], 
                                   ['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'], 
                                   ['5', 'Comprehensive Cancer Center'], 
                                   ['6', 'Other (specify)']])])
#print (valuelabels)


df = pd.DataFrame({'variable1' : ['1', '2', '4'],
                  'variable2' : ['2', '3', '1']})

table = pd.crosstab(df.variable1, df.variable2)      
print (table)
variable2  1  2  3
variable1         
1          0  1  0
2          0  0  1
4          1  0  0
d1 = dict(list(zip([a[0] for a in valuelabels['S8']], [a[1] for a in valuelabels['S8']])))
print (d1)
{'4': 'Other', '1': 'Medical oncology', '2': 'Hematology', '3': 'Hematology/Oncology'}

d2 = dict(list(zip([a[0] for a in valuelabels['S9']], [a[1] for a in valuelabels['S9']])))
print (d2)
{'1': 'Academic / Teaching Hospital', 
'3': 'Community-Based Group Private Practice (record practice size )', 
'4': 'Community Non-Teaching Hospital', 
'6': 'Other (specify)', 
'2': 'Community-Based Solo Private Practice', 
'5': 'Comprehensive Cancer Center'}
table = table.reindex(index=[a[0] for a in valuelabels['S8']], 
                      columns=[a[0] for a in valuelabels['S9'], fill_value=0)
print (table)
variable2  1  2  3  4  5  6
variable1                  
1          0  1  0  0  0  0
2          0  0  1  0  0  0
3          0  0  0  0  0  0
4          1  0  0  0  0  0

table.index = table.index.to_series().map(d1).values
table.columns = table.columns.to_series().map(d2).values
print (table)
                     Academic / Teaching Hospital  \
Medical oncology                                0   
Hematology                                      0   
Hematology/Oncology                             0   
Other                                           1   

                     Community-Based Solo Private Practice  \
Medical oncology                                         1   
Hematology                                               0   
Hematology/Oncology                                      0   
Other                                                    0   

                     Community-Based Group Private Practice (record practice size )  \
Medical oncology                                                     0                
Hematology                                                           1                
Hematology/Oncology                                                  0                
Other                                                                0                

                     Community Non-Teaching Hospital  \
Medical oncology                                   0   
Hematology                                         0   
Hematology/Oncology                                0   
Other                                              0   

                     Comprehensive Cancer Center  Other (specify)  
Medical oncology                               0                0  
Hematology                                     0                0  
Hematology/Oncology                            0                0  
Other                                          0                0  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Works perfectly! Thanks! – somethingsomethingdangerzone Oct 27 '16 at 13:08
  • Glad can help you! – jezrael Oct 27 '16 at 13:09
  • I am stuck again - now I cannot pass my value labels with pd.crosstab parameters - rownames and colnames. I get `AssertionError: arrays and names must have the same length`. Is there any way of working around that? – somethingsomethingdangerzone Oct 27 '16 at 14:33
  • `table = pd.crosstab(df.S8, df.S9) table = table.reindex(index=[a[0] for a in valuelabels['S8']], columns=[a[0] for a in valuelabels['S9']], fill_value=0)` I started with this, then tried adding the value label as parameters in pd.crosstab and eventually got that error. If I pass the number of labels that actually exist in my data set everything works fine. – somethingsomethingdangerzone Oct 27 '16 at 14:39
  • really interesting. How works crosstab alternative `df = df.pivot_table(index='variable1', columns='variable2', aggfunc=len, fill_value=0)` ? – jezrael Oct 27 '16 at 14:41
  • `valuelabels` is another dataframe? – jezrael Oct 27 '16 at 14:43
  • If `valuelabels = pd.DataFrame({'S8':pd.Series(['1', '4', '9', '15', '20']), 'S9':pd.Series( ['2', '5', '6'])})` you need `table = pd.crosstab(df.S8, df.S9)` and `table = table.reindex(index=valuelabels['S8'], columns=valuelabels['S9'], fill_value=0)` – jezrael Oct 27 '16 at 14:46
  • Sorry need `table = table.reindex(index=valuelabels['S8'], columns=valuelabels['S9'].dropna(), fill_value=0)` – jezrael Oct 27 '16 at 14:48
  • valuelables is actually an ordereddict where all the possible values and their labels are stored for all variables. In my list comprehension a[0] returns all the possible codes and a[1] would return all the values. – somethingsomethingdangerzone Oct 27 '16 at 15:27
  • `.dropna()` comes up with ~ `AttributeError: 'list' object has no attribute 'dropna'`. I guess it kind of makes sense since `valuelables['S8']` returns a list of lists - `[['code 1', 'code 1 label], ['code2', 'code 2 label'], ['code3', 'code 3 label'], ['code4', 'code 4 label']]` I just don't understand what the `dropna()` is supposed to accomplish at this particular case. – somethingsomethingdangerzone Oct 27 '16 at 15:48
  • Hmm, it is something like `valuelabels = {'S8':[['1', 'code'],['4', 'code1'],['9', 'code'],['15', 'code1'],['20', 'code']], 'S9':[['2', 'code4'],['5', 'code5'],['6', 'code5']]}` ? – jezrael Oct 27 '16 at 16:13
  • And if use `from collections import OrderedDict` and `valuelabels = OrderedDict({'S8':[['1', 'code'],['4', 'code1'],['9', 'code'],['15', 'code1'],['20', 'code']], 'S9':[['2', 'code4'],['5', 'code5'],['6', 'code5']]})` it works also well. – jezrael Oct 27 '16 at 16:15
  • Not sure if I should be posting this here as it is 'work stuff', but here goes nothing :) `valuelabels` returns `OrderedDict([('S8', [['1', 'Medical oncology'], ['2', 'Hematology'], ['3', 'Hematology/Oncology'], ['4', 'Other']]), ('S9', [['1', 'Academic / Teaching Hospital'], ['2', 'Community-Based Solo Private Practice'], ['3', 'Community-Based Group Private Practice (record practice size )'], ['4', 'Community Non-Teaching Hospital'], ['5', 'Comprehensive Cancer Center'], ['6', 'Other (specify)']])])` – somethingsomethingdangerzone Oct 27 '16 at 16:30
  • see my edit, for me it works nice. What is your pandas version? `print (pd.show_versions())` ? – jezrael Oct 27 '16 at 16:36
  • pandas: 0.19.0, maybe the problem is with how I parse my data: `reader = csv.DictReader(open(csvinput)) result = {} for row in reader: for column, value in row.items(): result.setdefault(column, []).append(value) df = pd.DataFrame(result, columns = [key for key in result.keys()])` – somethingsomethingdangerzone Oct 27 '16 at 16:44
  • what is `df.info()` and `print (type(df.ix[0,'variable1']))` ? It looks weird if it is standars dataframe. – jezrael Oct 27 '16 at 16:51
  • variable1 and variable2 were only examples when i started explaining the problem. Now I am trying to cross actual variables named 'S8' and 'S9'. `df.info()` is ` RangeIndex: 100 entries, 0 to 99 Columns: 196 entries, S9v3 to Q9_4 dtypes: object(196) memory usage: 76.6+ KB None` `print (type(df.ix[0,'S8']))` is `` – somethingsomethingdangerzone Oct 27 '16 at 16:57
  • Yes, and types are `dict` or `list` ? – jezrael Oct 27 '16 at 16:58
  • Sorry, not following - types of what? – somethingsomethingdangerzone Oct 27 '16 at 17:06
  • I keep hitting enter and end up editing my comments so its not your eyes playing tricks on you... – somethingsomethingdangerzone Oct 27 '16 at 17:09
  • Are data confidental? Why dont use [`read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) ? – jezrael Oct 27 '16 at 17:09
  • It is kind of confidential, let me try your option for reading the data. – somethingsomethingdangerzone Oct 27 '16 at 17:15
  • Everything is the same with pandas read_csv... I think the problem is with my codes/labels dict. `print(type(valuelabels['S8']))` returns `` and the error message is telling me that something is wrong there - `'list' object has no attribute 'dropna'`. What do you think? – somethingsomethingdangerzone Oct 27 '16 at 17:21
  • Sorry, `dropna` is wrong, because I dont know what is `valuelabels` before. So your solution works nice for me with test data in last edit. – jezrael Oct 27 '16 at 17:23
  • And one question `AssertionError` return `crosstab` or `reindex` ? – jezrael Oct 27 '16 at 17:25
  • `crosstab`. It displays the data fine, but what i wanted to do is to display value labels instead of the actual codes for example "Medical oncology" instead of "1", "Hematology" instead of "2" in the downbreak. Is that possible? – somethingsomethingdangerzone Oct 27 '16 at 17:31
  • Yes, you can use map. – jezrael Oct 27 '16 at 17:34
  • Give me a sec, I modify solution. – jezrael Oct 27 '16 at 17:34
  • See my edit. Use [`map`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) by `dicts`, but need convert `index` to series before by [`to_series`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.to_series.html). – jezrael Oct 27 '16 at 17:54
  • `print(d1)` returns `{'4': 'Other', '2': 'Hematology', '1': 'Medical oncology', '3': 'Hematology/Oncology'}`, but somehow my `print(table)` returns ` NaN NaN NaN NaN NaN NaN NaN 9 4 2 1 1 0 NaN 0 0 0 0 0 0 NaN 33 8 29 8 5 0 NaN 0 0 0 0 0 0` – somethingsomethingdangerzone Oct 27 '16 at 18:11
  • Ahh it's all distorted here... The numbers are ok, but for some reason I get NaN instead of labels. – somethingsomethingdangerzone Oct 27 '16 at 18:12
  • are keys in dicts strings? And `table.index` is string also? (`Index(['1', '2', '3', '4'], dtype='object', name='variable1')`). Or both types are `int` ? – jezrael Oct 27 '16 at 18:15
  • Keys in dicts are strings, but `print(type(table.index))` yields `` – somethingsomethingdangerzone Oct 27 '16 at 18:20
  • so need cast `index` to `str`, but before to `int` - `df.index = df.index.astype(int).astype(str)`. – jezrael Oct 27 '16 at 18:22
  • I've added `table.index = table.index.astype(int).astype(str) table.columns = table.columns.astype(int).astype(str)` as it seemed I needed it for both the index and columns and it finally works! thanks a lot for your patience and time! – somethingsomethingdangerzone Oct 27 '16 at 18:33
4

You can use reindex:

ct = pd.crosstab(df.variable1, df.variable2)
ct.reindex(index=variable1, columns=variable2).fillna(0).astype('int')
Out: 
variable2  2  5  6
variable1         
1          1  0  0
4          0  0  0
9          1  0  0
15         0  0  0
20         0  0  1
ayhan
  • 70,170
  • 20
  • 182
  • 203
-1
def TargetPercentByNominal (
 targetVar,       # target variable
 predictor):      # nominal predictor

countTable = pandas.crosstab(index = predictor, columns = targetVar, margins = True, dropna = True)
  x = countTable.drop('All', 1)
  percentTable = countTable.div(x.sum(1), axis='index')*100

  print("Frequency Table: \n")
  print(countTable)
  print( )
  print("Percent Table: \n")
  print(percentTable)

  return
Aniket G
  • 3,471
  • 1
  • 13
  • 39