46

I have the following code,

df = pd.read_csv(CsvFileName)

p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)

p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]] = p[["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]].astype(int)

It has always been working until the csv file doesn't have enough coverage (of all week days). For e.g., with the following .csv file,

DOW,Hour,Changes
4Wed,01,237
3Tue,07,2533
1Sun,01,240
3Tue,12,4407
1Sun,09,2204
1Sun,01,240
1Sun,01,241
1Sun,01,241
3Tue,11,662
4Wed,01,4
2Mon,18,4737
1Sun,15,240
2Mon,02,4
6Fri,01,1
1Sun,01,240
2Mon,19,2300
2Mon,19,2532

I'll get the following error:

KeyError: "['5Thu' '7Sat'] not in index"

It seems to have a very easy fix, but I'm just too new to Python to know how to fix it.

nbro
  • 15,395
  • 32
  • 113
  • 196
xpt
  • 20,363
  • 37
  • 127
  • 216
  • 1
    Is simply `p = p.astype(int)` insufficient? It's hard to tell why you're indexing the columns like that, the two lists look identical and from your input data it doesn't look like you're *excluding* columns this way. – jedwards Jul 19 '16 at 15:40
  • Even if you subset the columns you look at (i.e. exclude `['5Thu' '7Sat']`), you still will run into errors casting `np.nan` to an `int`. You can use a different fill value (for missing entries) to skirt this problem. – Alex Jul 19 '16 at 15:44
  • Eventually I'll need all the week days in the index (e.g., '5Thu' '7Sat'), because I'll plot them next. So I'm looking for ways to add them all, without affecting the averages. – xpt Jul 19 '16 at 15:45

4 Answers4

48

Use reindex to get all columns you need. It'll preserve the ones that are already there and put in empty columns otherwise.

p = p.reindex(columns=['1Sun', '2Mon', '3Tue', '4Wed', '5Thu', '6Fri', '7Sat'])

So, your entire code example should look like this:

df = pd.read_csv(CsvFileName)

p = df.pivot_table(index=['Hour'], columns='DOW', values='Changes', aggfunc=np.mean).round(0)
p.fillna(0, inplace=True)

columns = ["1Sun", "2Mon", "3Tue", "4Wed", "5Thu", "6Fri", "7Sat"]
p = p.reindex(columns=columns)
p[columns] = p[columns].astype(int)
nbro
  • 15,395
  • 32
  • 113
  • 196
piRSquared
  • 285,575
  • 57
  • 475
  • 624
30

I had a very similar issue. I got the same error because the csv contained spaces in the header. My csv contained a header "Gender " and I had it listed as:

[['Gender']]

If it's easy enough for you to access your csv, you can use the excel formula trim() to clip any spaces of the cells.

or remove it like this

df.columns = df.columns.to_series().apply(lambda x: x.strip())

diimdeep
  • 1,096
  • 17
  • 27
ILikeWhiskey
  • 551
  • 5
  • 12
  • 1
    I am afraid that mine is a totally differnct case: _"It has always been working until the csv file doesn't have enough coverage"_, even though the symtoms are the ***same***. But thanks for your input though. – xpt Nov 23 '17 at 10:55
  • 1
    You've covered a really non obvious case, thank you very much – inthy Jan 02 '20 at 12:36
  • 2
    This should be marked as the right answer - as this is a GENERIC answer – abhijat_saxena Jul 07 '20 at 13:44
  • Spot on! saved another 3 hours of mindless frustration.... – bonCodigo Feb 16 '23 at 15:51
4

please try this to clean and format your column names:

df.columns = (df.columns.str.strip().str.upper()
              .str.replace(' ', '_')
              .str.replace('(', '')
              .str.replace(')', ''))
Peter
  • 10,959
  • 2
  • 30
  • 47
Edwin Paul
  • 41
  • 1
0

I had the same issue.

During the 1st development I used a .csv file (comma as separator) that I've modified a bit before saving it. After saving the commas became semicolon.

On Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.

When testing from a brand new file I encountered that issue.

I've removed the 'sep' argument in read_csv method before:

df1 = pd.read_csv('myfile.csv', sep=',');

after:

df1 = pd.read_csv('myfile.csv');

That way, the issue disappeared.

Emma
  • 27,428
  • 11
  • 44
  • 69