1

I'm pulling data from Impala using impyla, and converting them to dataframe using as_pandas. And I'm using Pandas 0.18.0, Python 2.7.9

I'm trying to calculate the sum of all columns in a dataframe and trying to select the columns which are greater than the threshold.

self.data = self.data.loc[:,self.data.sum(axis=0) > 15]

But when I run this I'm getting error like below:

pandas.core.indexing.IndexingError: Unalignable boolean Series key provided

Then I tried like below.

print 'length : ',len(self.data.sum(axis = 0)),' all columns : ',len(self.data.columns)

Then i'm getting different length i.e

length : 78 all columns : 83

And I'm getting below warning

C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception

And To achieve my goal i tried the other way

for column in self.data.columns:
    sum = self.data[column].sum()
    if( sum < 15 ):
        self.data = self.data.drop(column,1) 

Now i have got the other errors like below:

TypeError: unsupported operand type(s) for +: 'Decimal' and 'float' C:\Python27\lib\decimal.py:1150: RuntimeWarning: tp_compare didn't return -1 or -2 for exception

Then i tried to get the data types of each column like below.

print 'dtypes : ', self.data.dtypes

The result has all the columns are one of these int64 , object and float 64 Then i thought of changing the data type of columns which are in object like below

self.data.convert_objects(convert_numeric=True)

Still i'm getting the same errors, Please help me in solving this.

Note : In all the columns I do not have strings i.e characters and missing values or empty.I have checked this using self.data.to_csv

As i'm new to pandas and python Please don't mind if it is a silly question. I just want to learn

Manoj Kumar
  • 745
  • 2
  • 8
  • 29

1 Answers1

0

Please review the simple code below and you may understand the reason of the error.

import pandas as pd
import numpy as np


df = pd.DataFrame(np.random.random([3,3]))
df.iloc[0,0] = np.nan

print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]

df.iloc[0,0] = 'string'

print df
print df.sum(axis=0) > 1.5
print df.loc[:, df.sum(axis=0) > 1.5]

          0         1         2
0       NaN  0.336250  0.801349
1  0.930947  0.803907  0.139484
2  0.826946  0.229269  0.367627

0     True
1    False
2    False
dtype: bool

          0
0       NaN
1  0.930947
2  0.826946

          0         1         2
0    string  0.336250  0.801349
1  0.930947  0.803907  0.139484
2  0.826946  0.229269  0.367627

1    False
2    False
dtype: bool

Traceback (most recent call last):
...
pandas.core.indexing.IndexingError: Unalignable boolean Series key provided

Shortly, you need additional preprocess on your data.

df.select_dtypes(include=['object'])

If it's convertable string numbers, you can convert it by df.astype(), or you should purge them.

su79eu7k
  • 7,031
  • 3
  • 34
  • 40
  • In all the column's I have just numbers neither strings nor nan. Added this point to my question – Manoj Kumar May 06 '16 at 12:44
  • @ManojKumar `pd.to_csv()` not guarantee your value type of your dataframe. It's posterior. Did you check dtypes after `self.data.convert_objects(convert_numeric=True)` again? No more `objects` type now? If not, maybe you didn't inplaced like `self.data = self.data.convert_objects(convert_numeric=True)`. Please check. – su79eu7k May 06 '16 at 13:00
  • it is working i'm missing assignment Thanks @su79eu7k – Manoj Kumar May 06 '16 at 13:29