0

I am running a Python script (Kaggle script). It works in a 3.4.5 virtualenv, but not in 3.5.2

I am not sure why and I am not familiar with the [[0]] syntax. Below is the snippet.

import pandas as pd
data = pd.read_csv(r'path\train.csv')
labels_flat = data[[0]].values.ravel()

It should produce a list of values from the csv's first column.

In 3.5.2 I get this error:

KeyError: '[0] not in index'

I tried to replicate the value with

labels_flat = []
lf = data.values.tolist()
for row in lf:
    labels_flat.append(row[0])

But I don't think it is the same thing.

Chad Crowe
  • 1,260
  • 1
  • 16
  • 21
  • 1
    Then you need to use `data[0]` not `data[[0]]`. – Willem Van Onsem Jul 28 '17 at 17:53
  • No, it is a dataframe, not a list. having data[0] produces the error "KeyError: 0". This behavior occurs in both 3.4.5 and 3.5.2 – Chad Crowe Jul 28 '17 at 17:55
  • 2
    The non-ambiguous way to get the first column by integer index is `data.iloc[:, 0]`. – Igor Raush Jul 28 '17 at 17:56
  • Place that as the answer and I will accept it Igor Raush. That worked on 3.4.5 and 3.5.2. I am still not sure about the [[0]] syntax. An explanation would be great. – Chad Crowe Jul 28 '17 at 17:59
  • @ChadCrowe Print out df.columns. What do you see? – cs95 Jul 28 '17 at 18:02
  • @cᴏʟᴅsᴘᴇᴇᴅ it prints out the dataframe's columns. Index(['label', 'pixel0', 'pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5', 'pixel6', 'pixel7', 'pixel8', ... 'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779', 'pixel780', 'pixel781', 'pixel782', 'pixel783'], dtype='object', length=785) This is for a 28x28 picture MNIST dataset in Kaggle. The first column is the correct label for this training data. – Chad Crowe Jul 28 '17 at 18:04
  • 1
    @ChadCrowe In that case you must index it by data[['label']] or else use iloc as mentioned. – cs95 Jul 28 '17 at 18:15

1 Answers1

4

I dont think the problem is with the syntax, your Dataframe just does not contain the index you are looking for.

For me this works:

In [1]: data = pd.DataFrame({0:[1,2,3], 1:[4,5,6], 2:[7,8,9]})
In [2]: data[[0]]
Out[2]: 
   0
0  1
1  2
2  3

I think what confuses you about the [[0]] syntax is that the squared brackets are used in python for two completely different things, and the [[0]] statement uses both:

A. [] is used to create a list. In the above example [0] creates a list with the single element 0.

B. [] is also used to access an element from a list (or dict,...). So data[0] returns the 0.-th element of data.

The next confusion thing is that while the usual python lists are indexed by numbers (eg. data[4] is the 4. element of data), Pandas Dataframes can be indexed by lists. This is syntactic sugar to easily access multiple columns of the dataframe at once. So in my example from above, to get column 0 and 1 you can do:

In [3]: data[[0, 1]]
Out[3]: 
   0  1
0  1  4
1  2  5
2  3  6

Here the inner [0, 1] creates a list with the elements 0 and 1. The outer [ ] retrieve the columns of the dataframe by using the inner list as an index.

For more readability look at this, its the exact same:

In [4]: l = [0, 1]

In [5]: data[l]
Out[5]: 
   0  1
0  1  4
1  2  5
2  3  6

If you only want the first column (column 0) you get this:

In [6]: data[[0]]
Out[6]: 
   0
0  1
1  2
2  3

Which is exactly what you were looking for.

Johannes
  • 3,300
  • 2
  • 20
  • 35