3

This is my code to read text from a CSV file and convert all the words in a column of it into singular form from plural:

import pandas as pd
from textblob import TextBlob as tb
data = pd.read_csv(r'path\to\data.csv')

for i in range(len(data)):
    blob = tb(data['word'][i])
    singular = blob.words.singularize()  # This makes singular a list
    data['word'][i] = ''.join(singular)  # Converting the list back to a string

But this code has been running for minutes now (and possibly keep running for hours, if I don't stop it?)! Why is that? When I checked for few words individually, the conversion happens instantly - doesn't take any time at all. There are only 1060 rows (words to convert) in the file.

EDIT: It finished running in about 10-12 minutes.

Here's some sample data:

Input:

word
development
investment
funds
slow
company
commit
pay
claim
finances
customers
claimed
insurance
comment
rapid
bureaucratic
affairs
reports
policyholders
detailed

Output:

word
development
investment
fund
slow
company
commit
pay
claim
finance
customer
claimed
insurance
comment
rapid
bureaucratic
affair
report
policyholder
detailed
Kristada673
  • 3,512
  • 6
  • 39
  • 93
  • You are iterating over a data frame. Performance will be terrible.. – rafaelc Jul 10 '18 at 02:10
  • @RafaelC Oh! I didn't know that! Why is that so? And what should I use to store the file if not a dataframe? I find multidimensional lists a pain in the a** to work with in Python - its not as intuitive as, say, in C. – Kristada673 Jul 10 '18 at 02:12
  • Because you're constantly shuffling your data across the Python/C threshold, which is expensive. Also, `.words` is a pretty complex operation; `.singularize` might be the fastest thing in you have in your code. – Amadan Jul 10 '18 at 02:15
  • 1
    Can you provide some sample input/output ? Also, you are doing `data['word'][I]` and probably getting a warning that you're changing a copy and not your df ? – rafaelc Jul 10 '18 at 02:21
  • @RafaelC Yes, it did throw that warning! I edited the question to include a portion of the input and output. – Kristada673 Jul 10 '18 at 02:28

1 Answers1

1

What about this?

In [1]: import pandas as pd

In [2]: from textblob import Word

In [3]: s = pd.read_csv('text', squeeze=True, memory_map=True)

In [4]: type(s)
Out[4]: pandas.core.series.Series

In [5]: s = s.apply(lambda w: Word(w).singularize())

In [6]: s
Out[6]:
0      development
1       investment
2             fund
3             slow
4          company
5           commit
6              pay
7            claim
8          finance
9         customer
10         claimed
11       insurance
12         comment
13           rapid
14    bureaucratic
15          affair
16          report
17    policyholder
18        detailed
Name: word, dtype: object

I use squeeze here to let read_csv return a Series instead of a DataFrame because the word file only has one column. In addition memory_map can be used if the word file is large.

Can you test the performance with your data?