1

I have a dataframe where the first column, lets call it: df['Name'], looks like the "actual" column, and Id like to change it to look the "desired" column in order to do operations on following columns. Here are the actual and desired outputs:

Name (actual) Name (desired)
string1 string1
Number string1
Number string1
Number string1
string2 string2
Number string2
Number string2
Number string2
Number string2
string3 string3
Number string3
Number string3
string4 string4
Number string4
etc etc

There is no fixed number of 'numbers', between the names. Could be 3, could be 300.

I have the following code to forward fill the names as far as the next name:

df['Name'].fillna(method = 'ffill', inplace = True)

but it only works when the cells with numbers are empty.

So, I need to remove all the numbers from the ['Name'] series first, leaving empty cells:

Name
String1
blank
blank
blank
String2
blank
etc...

I cant find a way to remove the numbers. Ive tried some suggestions I found in other similar posts:

1)

df[df['Name'].apply(lambda x: isinstance(x, str))]

but it seems to do nothing.

2)

df['Name'] = df['Name'].apply(lambda x: isinstance(x, str))

turns the whole ['Name'] series to True, both strings and numbers.

3)

df['Name'] = df[df['Name'].apply(lambda x: isinstance(x, str))]

which gives a value error.

I found the result to 2) strange, but discovered df['Name'].dtype gave me dtype('O'), which Id never seen before, but suggests the names (strings) and numbers (integers/floats) in the ['Name'] series are the same type (numpy objects). Not sure if/how its relevant, but I understood it to mean that Python sees both the text and numbers as being the same type.

Im stuck. Any suggestions on how to remove the numbers and fill the way I explained?

Thanks!

2 Answers2

2

You're close. Try this :

import pandas as pd
import numpy as np

df = pd.DataFrame({'Name (actual)': ['string1', 334, 34, 124, 'string2', 23, 11, 89, 76, 'string3', 53, 4]})

df['Name (desired)'] = df['Name (actual)'].apply(lambda x: x if isinstance(x, str) else np.nan).ffill()

>>> print(df)

enter image description here

Timeless
  • 22,580
  • 4
  • 12
  • 30
  • Hi. Thanks for that! Your code works fine, on the df I have, it doesnt: it leaves everything as-is... the text as text the numbers as numbers. – AndysPythonStuff Aug 27 '22 at 21:25
  • Maybe it means that the format of your real `df` doesn't correspond to the one you posted in your question. – Timeless Aug 27 '22 at 21:31
  • Yes, youre right! My df is quite big, and imported from a csv. I simplified it for the question, cos I didnt think it would make any difference. (sorry... noob mistake!) I kind of get the feeling isinstance is reading every entry as a str, even when its a number - it must be if it doesnt reach 'else ffill' ... Im really confused! – AndysPythonStuff Aug 27 '22 at 22:01
  • Are you specifying any `dtype` when using `pd.read_csv()` ? – Timeless Aug 27 '22 at 22:12
  • No: Like this: with open(file_to_use, 'r') as fileObject: reader_object = pandas.read_csv(fileObject, delimiter=';', decimal=',', header=0, names=('Noticia','Ativo','Variacao','Maximo','Ultimo','Minimo','Negocios','Semana',)) The 'Noticia' column is the one that has the text entries followed by numbers and sometimes empty. Its this column Im trying to clean up, removing all numbers, then filling all empty spaces with the text. – AndysPythonStuff Aug 27 '22 at 22:19
  • The automatic dtype of the column Im trying to clean is 'object'. An all-text column is also object, the others are all float64. – AndysPythonStuff Aug 27 '22 at 22:38
  • You're right, when reading the column from a .csv, the cleanup doesn't work with the suggestion I gave. However, @mozway's solution works 100%. You should use it instead. – Timeless Aug 27 '22 at 22:44
  • Yes, it does seem to: Ive tried it on a simplified file, later Ill try it on the original. L'Artise... thanks for your help! Even though it didnt work, Ive learned a bunch, especially about dtyes, so thanks for your effort. I found another way that I might try later, writing a function to return nothing when not text, and using converters parameter when reading in. – AndysPythonStuff Aug 27 '22 at 22:59
  • You’re welcome @AndysPythonStuff. I learned new things as well. Happy coding ! – Timeless Aug 27 '22 at 23:01
2

Using apply is not efficient, prefer a vectorial method:

# identify numbers:
m = pd.to_numeric(df['Name'], errors='coerce').notna()

# mask and ffill:
df['Name'] = df['Name'].mask(m).ffill()

Example (assigning to new column "Name 2" for clarity);

       Name    Name2
0   string1  string1
1       123  string1
2       123  string1
3       123  string1
4   string2  string2
5       123  string2
6       123  string2
7       123  string2
8       123  string2
9   string3  string3
10      123  string3
11      123  string3
mozway
  • 194,879
  • 13
  • 39
  • 75
  • 1
    Hi Mozway. Sorry didnt upvote this, I didnt know it was a thing at the time. I used your idea in the end and it works fine. Thanks for the idea and the lesson! – AndysPythonStuff Apr 01 '23 at 00:21