Every little enhancement in that huge for-loop would result in huge performance gain for time complexity when the number of tweets is extremely big, there are few things I can think of:
1. Just call geo_distance()
once especially when it's costly
distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
rather than
if geo_distance(position, find_state_center(us_states[state]))< min:
min = geo_distance(position, find_state_center(us_states[state]))
2. If positions tend to be be repeated often:
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
So, let's say if you have 1000 tweets from 200 different positions, and us_states
is 50, your origin algorithm would call geo_distance()
1000*50*2 times , now it can be reduced to 200*50*1 times of invocation.
3. Reduce the times of invocation on find_state_center()
Similar to #2, it is called redundantly for every tweet every state now.
state_center_dict = {}
for state in us_states:
state_center_dict[state] = find_state_center(us_states[state])
position_closest_state = {} # save known result
tweets_by_state = {}
for tweet in tweets:
position = tweet_location(tweet)
min, result_state = 100000, 'CA'
if position in position_closest_state:
result_state = position_closest_state[position]
else:
for state in us_states:
distance = geo_distance(position, state_center_dict[state])
if distance < min:
min = distance
result_state = state
position_closest_state[position] = result_state
Now find_state_center()
is only called 50 times (number of state) rather than 50*1000 (number of tweets), we made another huge improvement!
Performance achievement summary
By #1, we double the performance. #2 we enhance it by (number of tweets/number of positions) times. #3 is the biggest among the three, we reduce the time complexity to only 1/(number of tweets) comparing with origin code.