summary: I am searching for misspellings between a bunch of data and it is taking forever
I am iterating through a few CSV files (million lines total?), in each I am iterating through a json sub-value that has maybe 200 strings to search for. For each loop or the json value, I am adding a column to each dataframe, then using a lambdas function to use Levenshtein's search algorithm to find misspellings. I then output the result of any row that contains a potential misspelling code:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for v in json_data.values(): #30 ish json values
for row in v["json_search_string"]: #200 ish substrings
df_temp = df
df_temp['jelly'] = row
df_temp['difference'] = df_temp.apply(lambda x: jellyfish.levenshtein_distance(x['search column'],x['jelly']), axis=1)
df_agg = df_temp[df_temp['difference'] <3]
if os.path.isfile(filepath+"levenshtein.csv"):
with open(filepath+"levenshtein.csv", 'a') as f:
df_agg.to_csv(f, header=False)
else:
df_agg.to_csv(filtered_filepath+"levenshtein.csv")
I've tried the same algorithm before, but just to keep it short, instead of itterating through all JSON values for each CSV, I just did a single JSON value like this:
for file in file_list: #20+ files
df = pd.read_csv(file, usecols=["search column","a bunch of other columns...") #50k lines each-ish
for row in data['z']['json_search_string']:
#levenshtein algorithm above
The above loop took about 100 minutes to run through! (Edit: it takes about 1-3 seconds for the lambda function to run each time) And there are about 30 of them in the JSON file. Any ideas on how I can condense the algorithm and make it faster? I've thought maybe I could take all 200ish json sub strings and add them each as a column to each df and somehow run a lambda function that searches all columns at once, but I am not sure how to do that yet. This way I would only iterate the 20 files 30 times each, as opposed to however many thousand iterations that the 3rd layer for loop is adding on. Thoughts?
Notes: Here is an example of what the data might look like: JSON data
{
"A": {
"email": "blah",
"name": "Joe Blah",
"json_search_string": [
"Company A",
"Some random company name",
"Company B",
"etc",
"..."
And the csv columns:
ID, Search Column, Other Columns
1, Clompany A, XYZ
2, Company A, XYZ
3, Some misspelled company, XYZ
etc