CSV Manipulation in Python Origin Destination Matrix Formulation

Question

I have a csv which containts the name and lat/lng location information of the Underground Stations in London. It looks like this:

Station Lat Lng
Abbey Road  51.53195199 0.003737786
Abbey Wood  51.49078408 0.120286371
Acton   51.51688696 -0.267675543
Acton Central   51.50875781 -0.263415792
Acton Town  51.50307148 -0.280288296

I wish to transform this csv to create an origin destination matrix of all the possible combinations of these stations. There are 270 stations, thus there are 72,900 possible combinations.

Ultimately I wish to turn this matrix into a csv with the following format

O_Station   O_lat   O_lng   D_Station   D_lat   D_lng
Abbey Road  51.53195199 0.003737786 Abbey Wood  51.49078408 0.120286371
Abbey Road  51.53195199 0.003737786 Acton   51.51688696 -0.267675543
Abbey Road  51.53195199 0.003737786 Acton Central   51.50875781 -0.263415792
Abbey Wood  51.49078408 0.120286371 Abbey Road  51.53195199 0.003737786
Abbey Wood  51.49078408 0.120286371 Acton   51.51688696 -0.267675543
Abbey Wood  51.49078408 0.120286371 Acton Central   51.50875781 -0.263415792
Acton   51.51688696 -0.267675543    Abbey Road  51.53195199 0.003737786
Acton   51.51688696 -0.267675543    Abbey Wood  51.49078408 0.120286371
Acton   51.51688696 -0.267675543    Acton Central   51.50875781 -0.263415792

The first step would be to pair any station using a loop with all of the other possible stations. I would then need to remove the 0 combinations where an origin and destination were the same station.

Ive tried using the NumPy function column_stack. However this gives a strange result.

import csv
import numpy
from pprint import pprint
numpy.set_printoptions(threshold='nan')

with open('./London stations.csv', 'rU') as csvfile:
    reader = csv.DictReader(csvfile)
    Stations = ['{O_Station}'.format(**row) for row in reader]
print(Stations)
O_D = numpy.column_stack(([Stations],[Stations]))
pprint(O_D)

OUTPUT

Stations =

['Abbey Road', 'Abbey Wood', 'Acton', 'Acton Central', 'Acton Town']

O_D =

array([['Abbey Road', 'Abbey Wood', 'Acton', 'Acton Central', 'Acton Town',
        'Abbey Road', 'Abbey Wood', 'Acton', 'Acton Central', 'Acton Town']], 
      dtype='|S13')

I am ideally looking for more suitable function and finding it hard to locate it in the Numpy manual.

score 0 · Answer 1 · answered Aug 19 '15 at 17:45

This is an incomplete answer, but I would skip numpy and head right into pandas:

csv_file = '''Station Lat Lng
Abbey Road  51.53195199 0.003737786
Abbey Wood  51.49078408 0.120286371
Acton   51.51688696 -0.267675543
Acton Central   51.50875781 -0.263415792
Acton Town  51.50307148 -0.280288296'''

This is tough since it isn't really comma-delimited, otherwise we could just call pandas.read_csv():

names = [' '.join(x.split()[:-2]) for x in stations]
lats = [x.split()[-2] for x in stations]
lons = [x.split()[-1] for x in stations]

stations_dict = {names[i]: (lats[i], lons[i]) for i, _ in enumerate(stations)}

df = pd.DataFrame(stations_dict).T    # Transpose it
df.columns = ['Lat', 'Lng']
df.index.name = 'Station'

So we end up with df.head() yielding:

                       Lat           Lng
Station
Abbey Road     51.53195199   0.003737786
Abbey Wood     51.49078408   0.120286371
Acton          51.51688696  -0.267675543
Acton Central  51.50875781  -0.263415792
Acton Town     51.50307148  -0.280288296

Getting the permutations might mean we need to not have the Stations as the index... Not sure for the moment. Hopefully this helps a bit!

rgalbo · Answer 2 · 2015-08-19T22:35:09.590

0

When working with tabular data like this I prefer to use pandas. It makes controlling your data structure simple.

import pandas as pd

#read in csv
stations = pd.read_csv('london stations.csv', index_col = 0)

#create new dataframe
O_D = pd.DataFrame(columns = ['O_Station','O_lat','O_lng','D_Station','D_lat','D_lng'])

#iterate through the stations

new_index= 0
for o_station in stations.index:
    for d_station in stations.index:
        ls = [o_station,stations.Lat.loc[o_station],stations.Lng.loc[o_station],d_station, stations.Lat.loc[d_station], stations.Lng.loc[d_station]]
        O_D.loc[new_index] = ls
        new_index+=1

#remove double stations
O_D = O_D[O_D.O_Station != O_D.D_Station]

This should do the trick for your data transform.

edited Aug 19 '15 at 22:35

answered Aug 19 '15 at 18:49

rgalbo

4,186
1
19
29

Thanks @rgalbo. However, I am getting an IndexError - IndexError: iloc cannot enlarge its target object on line "O_D.O_Station.iloc[new_index] = o_station" – LearningSlowly Aug 19 '15 at 20:36
Excellent! Many thanks. Now. to get my head around how this works. – LearningSlowly Aug 19 '15 at 22:30
It does the same looping structure from the last version. It takes the info and creates a list of the same format as the dataframe. Then it uses `.loc` to add the dataframe at that given index. then it simply increments the index. The last line simply says select all the rows where the origin is not equal to the destination – rgalbo Aug 19 '15 at 22:33
ok. Slight issue. It appears this gives the last series of combinations and overwrites all previous combinations. So the O_D data frame is being overwritten each time it does the second loop. – LearningSlowly Aug 20 '15 at 11:43

CSV Manipulation in Python Origin Destination Matrix Formulation

2 Answers2