0

I'm using Panda read_csv. The last column on most rows is missing-data, as shown in the sample below. But on a few rows, the data is there. Instead of treating it as null, it seems to be treating it as NAN. I was trying to create an if statement to show just the rows that have data in that column.

(Sample extract from American Express to a CSV):

01/01/2018 Mon,,"GOOGLE *SVCSAPPS_NEALW - CC@GOOGLE.COM, CA",Neal Walters,XXXX-XXXXXX-XXXXX,,,4.16,,,GOOGLE SERVICES,"1600 AMPHITHEATRE PKWYMOUNTAIN VIEWCA","94043-1351UNITED STATES",'320180020394601453',

colnames=['DateTime', 'NotUsed2', 'PayeeLong', 'NotUsed4', 'NotUsed5', 'NotUsed6', 'NotUsed7', 'Amount', 'NotUsed9',
          'NotUsed10', 'Payee', 'PayeeAddress', 'PayeeCountry', 'NotUsedX', 'AmexCategory']
data = pd.read_csv(filenameAmexGold, names=colnames, header=None)
# Preview the first 5 lines of the loaded data
print (data.head())

for j in range(len(data)):
    #if not(math.isnan(data['AmexCategory'][j])):
    #    if data['AmexCategory'][j] > ' ':
            print("Row ", j, data['DateTime'][j], data['Payee'][j], data['Amount'][j],
                 "AmexCat=", data['AmexCategory'][j],
                 "PayeeLong=", data['PayeeLong'][j] )

Sample output of the data.head...

          DateTime  NotUsed2  ...              NotUsedX AmexCategory
0  01/01/2018  Mon       NaN  ...  '320180021453'          NaN
1  01/02/2018  Tue       NaN  ...  '320180035375'          NaN
2  01/04/2018  Thu       NaN  ...  '320180043184'          NaN
3  01/08/2018  Mon       NaN  ...  '320180080899'   'Software'
4  01/13/2018  Sat       NaN  ...  '320180133142'          NaN

When I include the two commented-out if statements, I get this error:

TypeError: must be real number, not str

PART2

Similarly, Row 19 has no PAYEE, since it's a payment, not a charge.

01/26/2018 Fri,20,AUTOPAY PAYMENT - THANK YOU,Neal Walters,XXXX-XXXXXX-XXXXX,,,-347.52,,,,,,'320180260752306017',

I know this row is showing as NaN in the data.head(20), so I want to know how to test it for null or NaN. When I list the dtypes, it shows that Payee is an object (not a float). To me it's just a string field, but I guess that's an object.

#This test works 
print("Test2", dfAmexGold['Payee'][19])
if (math.isnan( dfAmexGold['Payee'][19])):
   print("found a NAN value")

print("Test1", dfAmexGold['Payee'][20])
if (math.isnan( dfAmexGold['Payee'][20])):
   print("found a NAN value")

The test for row 20 blows up with this:

TypeError: must be real number, not str

The question is how to do If tests on individual items, and why it's not consistent using Null for empty cells instead of NaN.

I also tried, but this does not show the row as NULL (but doesn't blow up either). if dfAmexGold['Payee'][19] is None: print("found a NULL value")

NealWalters
  • 17,197
  • 42
  • 141
  • 251
  • Pandas default conversion uses `numpy.str` and NaN. Specifying the data type `pd.read_csv( sourceObj, dtype='string' )`, sets the column type as primitive string. NULL / empty values become empty strings. Pandas also allows per column setting of data type. Version: Pandas v1.5 – dank8 Mar 03 '23 at 02:25
  • @dank8 - using dtype='string' treats all columns as strings, and empty columns get converted to '' which makes things worse. – Dean Schulze Mar 13 '23 at 19:05
  • @DeanSchulze agree. Is there a alternative? have been experiencing this pain and basically resorted to `.fillna(` after every.`.read_csv(` or `.merge(`. are there benefits to sticking with `dtype= 'str'`? – dank8 Mar 15 '23 at 09:35

3 Answers3

1

to show just the rows that have data in that column

Just drop rows with missing values in particular subset of columns:

data = pd.read_csv(filenameAmexGold, names=colnames, header=None)
data.dropna(subset=['AmexCategory'])
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • I don't want to permanently drop them. My first pass is just to show rows that have non-blank data in that column. I may use it in if statements to help categorize the rows that have it, and the other rows, I will use some other logic to categorize. So I need to test if that row has the data there or not without getting the weird errors about NaN. – NealWalters Dec 01 '19 at 19:39
  • I also have cases where PAYEE is blank, for example on a payment to the credit card company. I tried dfAmexGold = dfAmexGold.dropna(subset=['Payee']), but it didn't remove that row. I read the doc, but finding it very confusing. It says subset are the columns to "include". – NealWalters Dec 01 '19 at 20:50
1

You can use pandas.DataFrame.isnull on the column with the missing values and remove the affected rows via boolean indexing:

colnames=['DateTime', 'NotUsed2', 'PayeeLong', 'NotUsed4', 'NotUsed5', 'NotUsed6', 'NotUsed7', 'Amount', 'NotUsed9',
          'NotUsed10', 'Payee', 'PayeeAddress', 'PayeeCountry', 'NotUsedX', 'AmexCategory']

data = pd.read_csv(filenameAmexGold, names=colnames, header=None)

data = data[~data['AmexCategory'].isnull()]

print(data.head())
tmsss
  • 1,979
  • 19
  • 23
  • Don't want to remove the rows, just loop through them and test the values. – NealWalters Dec 01 '19 at 20:36
  • Why are they are NaN instead of null? – NealWalters Dec 01 '19 at 20:37
  • Tried above code on my Payee column which has an "empty cell" when a payment is made (vs a charge being made). The field is nan, not null, so the .isnull is your example doesn't seem to work. – NealWalters Dec 01 '19 at 20:55
  • In that case you can do data = data[~data['Payee']=='']. To loop the dataframe, if find it easier to use pandas' iterrows function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html – tmsss Dec 01 '19 at 23:07
0

Not sure this is the best answer, but for now it gets me moving forward:

data.fillna('', inplace=True)

Pandas Doc: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

StackOverflow: Get pandas.read_csv to read empty values as empty string instead of nan

NealWalters
  • 17,197
  • 42
  • 141
  • 251