0

summary: I am searching for misspellings between a bunch of data and it is taking forever

I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For each loop or the json value, I am adding a column to each dataframe, then using a lambdas function to use Levenshtein's search algorithm to find misspellings. I then output the result of any row that contains a potential misspelling code:

for file in file_list:  #20+ files             
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for v in json_data.values():  #30 ish json values
    for row in v["json_search_string"]:    #200 ish substrings
        df_temp = df
        df_temp['jelly'] = row
        df_temp['difference'] = df_temp.apply(lambda x: jellyfish.levenshtein_distance(x['search column'],x['jelly']), axis=1)
        df_agg = df_temp[df_temp['difference'] <3]
        if os.path.isfile(filepath+"levenshtein.csv"):
            with open(filepath+"levenshtein.csv", 'a') as f:
                df_agg.to_csv(f, header=False)
        else:
            df_agg.to_csv(filtered_filepath+"levenshtein.csv") 

I've tried the same algorithm before, but just to keep it short, instead of itterating through all JSON values for each CSV, I just did a single JSON value like this:

for file in file_list:  #20+ files             
    df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
    for row in data['z']['json_search_string']:
        #levenshtein algorithm above

The above loop took about 100 minutes to run through! (Edit: it takes about 1-3 seconds for the lambda function to run each time) And there are about 30 of them in the JSON file. Any ideas on how I can condense the algorithm and make it faster? I've thought maybe I could take all 200ish json sub strings and add them each as a column to each df and somehow run a lambda function that searches all columns at once, but I am not sure how to do that yet. This way I would only iterate the 20 files 30 times each, as opposed to however many thousand iterations that the 3rd layer for loop is adding on. Thoughts?

Notes: Here is an example of what the data might look like: JSON data

{
"A": {
    "email": "blah",
    "name": "Joe Blah",
    "json_search_string": [
        "Company A",
        "Some random company name",
        "Company B",
        "etc",
        "..."

And the csv columns:

ID, Search Column,            Other Columns
1,  Clompany A,               XYZ
2,  Company A,                XYZ
3,  Some misspelled company,  XYZ
etc
jleatham
  • 456
  • 8
  • 17
  • Can you give a sample of the first 5 rows of your df? Where is the json coming in? – cs95 Sep 27 '17 at 13:50
  • I am just pulling in the csv file into the df, so it looks similar to the csv example above. The json file is a static file that I am loading into the json_data variable. I can iterate through it using the .values() function – jleatham Sep 27 '17 at 14:52
  • Oppening and writing are costly operation, have you thought of doing it only at the end ? – Tbaki Sep 27 '17 at 15:43
  • Also you can store df['search column'] before the loop in say temp, then use df_temp =pd.DataFrame([temp,row], columns=['search column',"jelly"]), and finish with df_agg = df[df_temp['difference'] <3], this should cut some computation and acess time. – Tbaki Sep 27 '17 at 15:51
  • @Tbaki Thanks! I actually did have the write operation at the end, I moved it up because the operation was so long that it did not finish, and I didn't have ANY of the data the was processed (it ran for 4 hours or so). So I figured I would put the write in the loop. I will try and digest your second comment... I think i get where you are going with it and will try and implement it. – jleatham Sep 27 '17 at 16:55

2 Answers2

0

Well, it is really hard to answer performance enhancement question. Depending on the effort and performance, here are some suggestions.

  1. Small tweaking by re-arrangement of your code logic. Effort: small. Expected enhancement: small. By going through your code, I know that you are going to comparing words from File (number 20) with a fixed JSON File (only one). Instead of reading the JSON File for each File, why not first prepare the fixed words list from the JSON File, and used it for all following comparison? The logic is like:

    # prepare fixed words from JSON DATA
    fixed_words = [] 
    for v in json_data.values():
        fixed_words += v["json_search_string"]
    # looping over each file, and compare them with word from fixed words
    for f in file_list:
        # do the comparison and save.
    
  2. Using multiprocessing. Effort: Small. Expected Enhancement: Median. Since all your work are similar, why not try multiprocessing? You could apply multiprocessing on each file OR when doing dataframe.apply. There are lots of source for multiprocessing, please have a look. It is easy to implement for your case.

  3. Using other languages to implement Levenshtein distance. The bottleneck of your code is the computing of Levenshtein distance. You used the jellyfish python package, which is a pure python (of course, performance is not good for a large set). Here are some other options:

    a. Already existed python package with C/C++ implementation. Effort: small. Expected Enhancement: High. Thanks the comment from @Corley Brigman , editdistance is one option you can use.

    b. Self-implementation by Cyphon. Effort: Median. Enhancement: Median or High. Check pandas document Performance

    c. Self-implementation by C/C++ as a wrapper. Effort: High; Expected Enhancement: High. Check Wrapping with C/C++

You could use several of my suggestion to gain higher performance. Hope this would be helpful.

rojeeer
  • 1,991
  • 1
  • 11
  • 13
  • good ideas. just wanted to add that there are already some C++ implementations of levenshtein e.g. https://pypi.python.org/pypi/editdistance ... – Corley Brigman Sep 27 '17 at 15:35
  • Thanks. Actually, https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance lists the implementation of this algorithm for all modern programming language. – rojeeer Sep 27 '17 at 15:51
  • Right... i mean, there are already C++ implementations wrapped up in pip-able python packages, better than pure-python implementations like jellyfish, which would accomplish your 3/4 options without much work. – Corley Brigman Sep 27 '17 at 16:00
  • Thanks very much. I updated the answer and add your comment as one option. – rojeeer Sep 27 '17 at 16:10
  • Awesome guys! Will implement a few of the changes and report back. – jleatham Sep 27 '17 at 16:48
  • @rojeer keeping everything else the same and swapping jellyfish with editdistance, I got equal performance. About 8 seconds to process 10 loops. I will now try some of your other suggestions. – jleatham Sep 27 '17 at 17:16
  • i will just say that iterating directly through a pandas dataframe using apply is not a good use case for it, and often is very slow, definitely across rows. oddly, if you transpose your source dataframe, and work on columns instead of rows (exactly the same, but axis=0), it is often (but not always) much faster. – Corley Brigman Sep 27 '17 at 17:42
0

You could change your code to :

for file in file_list:  #20+ files             
    df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
    x_search  = x['search column']
    for v in json_data.values():  #30 ish json values
        for row in v["json_search_string"]:    #200 ish substrings
            mask = [jellyfish.levenshtein_distance(s1,s2) < 3  for s1,s2 in zip(x_search, row) ] 
            df_agg = df_temp[mask]
            if os.path.isfile(filepath+"levenshtein.csv"):
                with open(filepath+"levenshtein.csv", 'a') as f:
                    df_agg.to_csv(f, header=False)
            else:
                df_agg.to_csv(filtered_filepath+"levenshtein.csv") 

apply return a copy of a serie which can be more expensive:

a = range(10**4)
b = range(10**4,2*(10**4))
%timeit [ (x*y) <3 for x,y in zip(a,b)]
%timeit pd.DataFrame([a,b]).apply(lambda x: x[0]*x[1] < 3 )

1000 loops, best of 3: 1.23 ms per loop

1 loop, best of 3: 668 ms per loop

Tbaki
  • 1,013
  • 7
  • 12
  • Thanks! I have never used zip before. First, i changed x['search column'] to df['search column'] . Then with the -> for row in v["json..."]: <-- string i found out that that is just pulling the row variable as a string and loading it into zip. When you do this, zip will only pull the first letter of the string. So I needed to first run a for loop and append all the json variables into a list, then plug that into the zip(x_search,new_list) function. When I do that however, it doesn't match each list item to each other, it only compares list item1 to x_search list item 1, and so on. – jleatham Sep 28 '17 at 15:51
  • Thanks for all your help. I used itertools.product() in place of zip() to create a matrix, and was then able to run your mask function . However, I am still unclear how you can then apply the mask to the dataframe and get the results I want. I ended up just making two large lists from the CSVs, matrixing them together, running editdistance.eval() against each matrix pair, and then placing any match into a master list. I then run the master list as a df[df['search column'].isin(master_list)], to get the original data. It went from 100 minutes, to 100 seconds. – jleatham Sep 30 '17 at 03:43