0

I have a pandas dataframe with two columns: "review"(text) and "sentiment"(1/0)

X_train = df.loc[0:25000, 'review'].values
y_train = df.loc[0:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

But after conversion to numpy array, using values() method. I obtain numpy arrays of following shape:

print(df.shape)   #(50000, 2)
print(X_train.shape) #(25001,)
print(y_train.shape) #(25001,)
print(X_test.shape) # (25000,)
print(y_test.shape) # (25000,) 

So as you can see values() method, added one additional row. This is really strange and I cant detect error.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
mokebe
  • 77
  • 1
  • 7

1 Answers1

1

The df.loc is label based, i.e. it includes the upper bound. Use iloc:

df.iloc[:25000, 1].values # here 1 is the column of 'review' for example

if you want NumPy-like slicing.

With iloc you need to supply both rows and columns as integers or integer slices.

Example

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6

This is label based, i.e. upper bound inclusive:

>>> df.loc[:1, 'a']
0    1
1    2
Name: a, dtype: int64

This works like slicing in NumPy, i.e. upper bound exclusive:

>>> df.iloc[:2, 0]
0    1
1    2
Name: a, dtype: int64
Mike Müller
  • 82,630
  • 20
  • 166
  • 161