2

I'm trying to return a dictionary that aggregates tweets by their nearest state center. I'm iterating over all the tweets, and for each tweet, I'm checking all the states to see which state is the closest.

what would be a better way to do this?

def group_tweets_by_state(tweets):
    """

    The keys of the returned dictionary are state names, and the values are
    lists of tweets that appear closer to that state center than any other.

    tweets -- a sequence of tweet abstract data types """


    tweets_by_state = {}
    for tweet in tweets:
        position = tweet_location(tweet)
        min, result_state = 100000, 'CA'
        for state in us_states:
            if geo_distance(position, find_state_center(us_states[state]))< min:
                min = geo_distance(position, find_state_center(us_states[state]))
                result_state = state
        if result_state not in tweets_by_state:
            tweets_by_state[result_state]= []
            tweets_by_state[result_state].append(tweet)
        else:
            tweets_by_state[result_state].append(tweet)
    return tweets_by_state
Paul Lo
  • 6,032
  • 6
  • 31
  • 36
Jobs
  • 3,317
  • 6
  • 26
  • 52
  • http://c2.com/cgi/wiki?PrematureOptimization Does your code work? So do you really care about the time it takes it to run? Is it so slow? –  Dec 26 '14 at 18:43
  • But right now, it is very slow. – Jobs Dec 26 '14 at 18:45
  • Maybe that's because you've got a lot of tweets to iterate over. In such a case, I agree it is better to optimize the time. If not, maybe it is very slow because of any other code fragment you've got around –  Dec 26 '14 at 18:46
  • It takes minutes to run. Yeah I do have a lot of tweets to iterate over. – Jobs Dec 26 '14 at 18:46

1 Answers1

4

Every little enhancement in that huge for-loop would result in huge performance gain for time complexity when the number of tweets is extremely big, there are few things I can think of:

1. Just call geo_distance() once especially when it's costly

distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
     min = distance

rather than

if geo_distance(position, find_state_center(us_states[state]))< min:
    min = geo_distance(position, find_state_center(us_states[state]))

2. If positions tend to be be repeated often:

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, find_state_center(us_states[state]))
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state 

So, let's say if you have 1000 tweets from 200 different positions, and us_states is 50, your origin algorithm would call geo_distance() 1000*50*2 times , now it can be reduced to 200*50*1 times of invocation.

3. Reduce the times of invocation on find_state_center()

Similar to #2, it is called redundantly for every tweet every state now.

state_center_dict = {}
for state in us_states:
    state_center_dict[state] = find_state_center(us_states[state])

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, state_center_dict[state])
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state 

Now find_state_center() is only called 50 times (number of state) rather than 50*1000 (number of tweets), we made another huge improvement!

Performance achievement summary

By #1, we double the performance. #2 we enhance it by (number of tweets/number of positions) times. #3 is the biggest among the three, we reduce the time complexity to only 1/(number of tweets) comparing with origin code.

Paul Lo
  • 6,032
  • 6
  • 31
  • 36
  • lol. I see. But is there a way to prevent iterating over every single state? – Jobs Dec 26 '14 at 18:54
  • That's a very nice suggestion! Thanks Paul. – Jobs Dec 26 '14 at 19:06
  • @Steve I just edited the post by avoiding unnecessary call on `find_state_center()` as well, enjoy :p – Paul Lo Dec 26 '14 at 19:23
  • 1
    #4. You can probably get a bit more performance by building a kd-tree (or some other space partitioning data structure) of the state centers. Reducing the number of states checked per tweet from 50 to around 9. #2: If some approximation is acceptable, this could be improved by using a geohash to place an upper limit on the number of unique coordinates. Rather than say 12000 tweets with 4000 unique locations, you could limit it to only say 512 or 256 or 128 or 64 unique locations depending on how much accuracy your willing to sacrifice. – Nuclearman Dec 28 '14 at 09:40