0

I am trying to match the stop_id in stop_times.csv to the stop_id in stops.csv in order to copy over the stop_lat and stop_lon to their respective columns in stop_times.csv.

Gist files:

stops.csv LINK

stop_times.csv LINK

Here's my code:

import pandas as pd

st = pd.read_csv('csv/stop_times.csv', sep=',')
st.set_index(['trip_id','stop_sequence'])
stops = pd.read_csv('csv/stops.csv')

for i in range(len(st)):
    for x in range(len(stops)):
        if st['stop_id'][i] == stops['stop_id'][x]:
            st['stop_lat'][i] = stops['stop_lat'][x]
            st['stop_lon'][i] = stops['stop_lon'][x]

st.to_csv('csv/stop_times.csv', index=False)

I'm aware that the script is applying a copy, but I'm not sure what other method to go about doing this, as I'm fairly new to pandas.

  • 1
    Aside: `set_index` doesn't act in place, so the `st.set_index` line doesn't change `st`. – DSM Dec 30 '14 at 07:13

1 Answers1

1

You can merge the two DataFrames:

pd.merge(stops, st, on='stop_id')

Since there are stop_lat columns in each, it will give you stop_lat_x (the good one) and stop_lat_y (the always-zero one). You can then remove or ignore the bad column and output the resulting DataFrame however you want.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • How would you implement the `pd.merge`? I tried just doing `st = pd.merge(stops,st,on='stop_id')` and that messed up my csv pretty badly. The answer is not very obvious to me, so if you could extrapolate a little further that would be appreciated. –  Dec 30 '14 at 07:23
  • 1
    I would look at the result from `merge` and remove the columns I didn't want. And I would be more specific than "it messed up badly." :) – John Zwinck Dec 30 '14 at 07:27
  • I see. By messed up, I realized that what it did was change the ordering around. I'll just apply a sort again and get it in the correct order. This answer worked, thanks a bunch! I spent 6 hours trying different methods. :( –  Dec 30 '14 at 07:30
  • 1
    You're welcome. When you find yourself writing loops in Pandas/NumPy/SciPy, you usually should take a step back and figure out how to do it the "right" way, with vectorized operations. It's much faster to execute anyway. – John Zwinck Dec 30 '14 at 07:37