-1

I would like to know how to write a formula that would identify/display records of string/object data type on a Pandas DataFrame that contains leading or trailing spaces.

The purpose for this is to get an audit on a Jupyter notebook of such records before applying any strip functions.

The goal is for the script to identify these records automatically without having to type the name of the columns manually. The scope should be any column of str/object data type that contains a value that includes either a leading or trailing spaces or both.

Please notice. I would like to see the resulting output in a dataframe format.

Thank you!

Link to sample dataframe data

mmunozjr
  • 1
  • 1

1 Answers1

1

You can use:

df['col'].str.startswith(' ')
df['col'].str.endswith(' ')

or with a regex:

df['col'].str.match(r'\s+')
df['col'].str.contains(r'\s+$')

Example:

df = pd.DataFrame({'col': [' abc', 'def', 'ghi ', ' jkl ']})

df['start'] = df['col'].str.startswith(' ')
df['end'] = df['col'].str.endswith(' ')
df['either'] = df['start'] | df['stop']

     col  start    end  either
0    abc   True  False    True
1    def  False  False   False
2   ghi   False   True    True
3   jkl    True   True    True

However, this is likely not faster than directly stripping the spaces:

df['col'] = df['col'].str.strip()

   col
0  abc
1  def
2  ghi
3  jkl
updated answer

To detect the columns with leading/traiing spaces, you can use:

cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$')).any()
cols[cols].index

example on the provided link:

Index(['First Name', 'Team'], dtype='object')
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Hi, thank you for your prompt reply. Could you please tell me how I could modify this script so it would show the columns automatically for the records that have values with leading or trailing spaces? Thanks! – mmunozjr Sep 03 '22 at 17:04
  • The challenge with typing the name of the column(s) manually is that the dataframe may contain dozens of columns. Ideally, it would be great to automate the identification process of those records. – mmunozjr Sep 03 '22 at 17:15
  • @mmunozjr then you can use something like: `cols = df.select_dtypes('object').apply(lambda c: c.str.contains(r'^\s+|\s+$')).any() ; cols[cols].index` – mozway Sep 03 '22 at 17:17
  • Sorry, I am getting this error message: AttributeError: Can only use .str accessor with string values! – mmunozjr Sep 03 '22 at 17:23
  • This means you don't have only string data. Please provide a reproducible example in the question – mozway Sep 03 '22 at 17:32
  • Thanks! I just created a link to the sample data set that I am using. – mmunozjr Sep 03 '22 at 18:52
  • OK, see updated answer – mozway Sep 03 '22 at 19:11