Pandas dataframe creating takes hours

Question

I'm trying to create a dataframe panda object from a list of Schools objects that contain a row of information. The problem is that it is taking hours to complete. I'm running this on a Jupyter notebook and after hour of running it crashes. I have an ordered list of School objects. The objects are as following:

class School:
    def __init__(self, distance, row):
        self.distance_to_origin = distance
        self.row = row
        self.name = row['name']
        self.lat = row['lat']
        self.lon = row['lon']
    def get_distance(self):
        return self.distance_to_origin
    def get_lat_lon(self):
        return [self.lat, self.lat]
    def get_name(self):
        return self.name
    def get_row(self):
        return self.row
    def __str__(self):
        return str(self.distance_to_origin)
    def __repr__(self):
        return str(self.distance_to_origin)

I'm then trying to create a pandas dataframe from this list. The overall goal is to remove duplicate schools. A duplicate schools is one that is within a 1600 and has a similar name.

The code that removes schools is the following:

def get_duplicates(ordered_list): 
    total_dups = 0;
    newDataFrame = pd.DataFrame()
    for i in trange(len(ordered_list)-1):
        newDataFrame = newDataFrame.append(ordered_list[i].get_row())
        ite = i+1    
        while( not (ite>(len(ordered_list)-1)) and abs(ordered_list[i].get_distance()-ordered_list[ite].get_distance())<1600):
            if(vincenty(ordered_list[i].get_lat_lon(),ordered_list[ite].get_lat_lon()).meters<1600):
                if(fuzzy_match(ordered_list[i].get_name(), ordered_list[ite].get_name())): #it's a match, dont add
                    total_dups +=1

                else: # is within distane, name doesnt match
                    newDataFrame = newDataFrame.append(ordered_list[ite].get_row())
            else: # it is not within distance
                newDataFrame = newDataFrame.append(ordered_list[ite].get_row())
                #print(newlist[ite].get_name())
                    #print( newlist[ite].get_lat_lon())
            ite+=1
    print(total_dups)
    return newDataFrame

vincenty is from geopy.distance

fuzzy_match is:

stemmer = stem.PorterStemmer()
def normalize(s):
    words = tokenize.wordpunct_tokenize(s.lower().strip())
    return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=3):
    return edit_distance(normalize(s1), normalize(s2)) <= max_dist

edit_distance is from nltk.metrics

What am I doing wrong that is causing this to take hours? Is there a way to optimize this? Thanks!

How big is your `ordered_list`? Because you are first iterating over all it's elements (n operations), then in the while loop you request `ite` to be larger than len(ordered_list) (another n operations). So your code is O(n) — , Mar 08 '18 at 15:20
Roughly about 43k. I've created a dataframe from another dataframe and that took about 20 mins ( same 43 k schools) — 39fredy, Mar 08 '18 at 15:23
not sure why are you using `trange` and how it performs compared to a normal `range`. Also if your list is ordered (as the name suggests) you don't need to iterate all elements, once one element has distance > 1600 all the following will be at a greater distance — , Mar 08 '18 at 15:29
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html : "Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once." — avigil, Mar 08 '18 at 15:32
also you are calculating distance for every entry at least twice — avigil, Mar 08 '18 at 15:34
And even a better approach is not to request the distance of each school again and again. Create a numpy array of distances and only check the edit distance (expensive operation) of those ones where the distance is lower than 1600. — , Mar 08 '18 at 15:34
trange is function from the tqdm library that displays a progress bar. I tried it with range and it was the same rate. — 39fredy, Mar 08 '18 at 15:38
The reason why I calculate the distance twice is because the list itself is ordered by the distance to the equator. I then have to check is the distance between these two schools are within a mile. The reason is because the equatorial distance can be relatively close, but their actual distance is not.( IE, both are 1000 meters from (0,0) , but the distance between them is much greater) — 39fredy, Mar 08 '18 at 15:40
@SembeiNorimaki What im doing is taking a stepping approach. Where I'm at schools i and check all schools with an equatorial distance less than a mile. So I go from i to i + delta_0, ( where delta_0 <1600) then from i +1 to i + delta_1 again where ( where delta_1 <1600) etc etc — 39fredy, Mar 08 '18 at 15:44

Pandas dataframe creating takes hours

0 Answers0