0

I have a dataset with several hundred Account numbers and their Descriptions. It has been imported from Excel into a Python dataframe. The Descriptions, in Excel, have varying numbers of leading and trailing white spaces. The Account number is an integer, Description is an object, and End Balance is a float.

I've tried stripping leading and trailing spaces, replacing multiple white space with single but when I use groupby it does not recognize the Descriptions as identical. If I only groupby Account I get 435 rows, which is correct. If I groupby Description I get over 1100 which is not correct (that's the original number of rows). Grouping by Account and Description yields same result as grouping by Description. This implies to me that the Descriptions are still not seen as identical.

I've also tried not stripping at all and leaving as original with no joy.

Any thoughts of how to make the Descriptions identical?

# Replaces multiple white spaces in string to a single whitespace
PE5901_df['Description'] = PE5901_df['Description'].str.replace('\s+', ' ', regex=True)

# Strip leading and trailing spaces from fields to avoid groupby, concat, and merge issues later.
PE5901_df['Description'] = PE5901_df['Description'].str.strip()

# Groupby Account number and Asset name - sums individual rows with identical account numbers.
PE5901_df=PE5901_df.groupby(['Account','Description'],as_index=False).sum()

Dataframe

Wolfe
  • 77
  • 2
  • 9
  • 2
    welcome to SO! Please avoid using screenshots and paste data directly and format as code – Vivek Kalyanarangan Jan 18 '22 at 15:56
  • Hi, just wanted to add visual of what the Descriptions look like. – Wolfe Jan 18 '22 at 15:57
  • 2
    This seems to be more data quality issue than programming one... I would recommend to get a few random samples of `Account`, and check unique values of `Description` for them. Maybe, you can find any pattern you have missed so far. – Jaroslav Bezděk Jan 18 '22 at 16:02
  • Hi! Thank you for suggestion. I ran 'PE5901_df.nunique()' and, as expected, got 435 unique values for Account and 1136 for Description. – Wolfe Jan 18 '22 at 16:10
  • 1
    Can you include / paste a couple of "descriptions" from your dataframe that should be identical but are not identical. Maybe there are other non-printable characters? – Mortz Jan 18 '22 at 16:12
  • Mortz, thank you. You are right. About 100 or so characters to the far right of Excel cells there are weird characters getting tacked on which aren't visible- some weird import thing I hadn't run across before. *sigh*. I appreciate the guidance y'all have given me. – Wolfe Jan 18 '22 at 16:42

1 Answers1

0

Here is one way to inspect the data in the Descriptions column. This would show if the issue is whitespace, or something else.

import pandas as pd

description = [
    '111001 cash deposit', '111001 cash deposit ', '111001 cash deposit  ',
    ' 111001 cash deposit', '  111001 cash deposit', '   111001 cash deposit',
]

elements = pd.Series(description).sort_values().unique()

for element in elements:
    print(f">>{element}<<")

Print-out is:

>>   111001 cash deposit<<
>>  111001 cash deposit<<
>> 111001 cash deposit<<
>>111001 cash deposit<<
>>111001 cash deposit <<
>>111001 cash deposit  <<

One can remove leading/trailing whitespace with the .str accessor:

elements = pd.Series(description).str.strip().sort_values().unique()
jsmart
  • 2,921
  • 1
  • 6
  • 13
  • Thank you- I had used: 'PE5901_df['Description'] = PE5901_df['Description'].str.strip()' in the original code to try to eliminate issues like that. But it didn't solve weird hidden characters. That was a new issue for me to try to figure out how to solve. I really appreciate your taking the time to offer help. – Wolfe Jan 18 '22 at 16:46
  • The built-in functions `ord()` and `chr()` may also be helpful. This expression looks at an en-dash and an ordinary dash: `(ord("–"), ord("-")) == (8211, 45)` -- it evaluates to True. – jsmart Jan 18 '22 at 17:19