3

I have a pandas dataframe.

df = pd.DataFrame(['Donald Dump','Make America Great Again!','Donald Shrimp'],
                   columns=['text'])

What I like to have is another column in Dataframe which has the length of the strings in the 'text' column.

For above example, it would be

                        text  text_length
0                Donald Dump           11
1  Make America Great Again!           25
2              Donald Shrimp           13

I know I can loop through it and get the length but is there any way to vectorize this operation? I have few million rows.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
aerin
  • 20,607
  • 28
  • 102
  • 140

2 Answers2

6

Use str.len:

print (df.text.str.len())                   
0    11
1    25
2    13
Name: text, dtype: int64

Sample:

import pandas as pd

df = pd.DataFrame(['Donald Dump','Make America Great Again!','Donald Shrimp'],
                   columns=['text'])
print (df)
                        text
0                Donald Dump
1  Make America Great Again!
2              Donald Shrimp

df['text_length'] = (df.text.str.len())                   
print (df)
                        text  text_length
0                Donald Dump           11
1  Make America Great Again!           25
2              Donald Shrimp           13
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

I think the easiest way is to use the apply method of the DataFrame. With this method you can manipulate the data any way you want.

You could do something like:

df['text_ength'] = df['text'].apply(len)

to create a new column with the data you want.


Edit After seeing @jezrael answer I was curious and decided to timeit. I created a DataFrame full with lorem ipsum sentences (101000 rows) and the difference is quite small. For me I got:

In [59]: %timeit df['text_length'] = (df.text.str.len())
10 loops, best of 3: 20.6 ms per loop

In [60]: %timeit df['text_length'] = df['text'].apply(len)
100 loops, best of 3: 17.6 ms per loop
pekapa
  • 881
  • 1
  • 11
  • 25
  • Thanks for the timing. Interesting to see apply is faster than inbuilt str.len! – aerin Jun 07 '16 at 20:17
  • the problem with apply aside from not being idiomatic is that this will not work on NaN values ; stick to the string methods – Jeff Jun 08 '16 at 00:03