I'm trying to make a function where I feed it a list of URLs which go through a 301 hop and it flattens it for me. I want to save the resulting list as a CSV so I can hand it to the developers who can implement it and get rid of 301 hops.
For example, my crawler will produce this list of 301 hops:
URL1 | URL2 | URL3 | URL4
example.com/url1 | example.com/url2 | |
example.com/url3 | example.com/url4 | example.com/url5 |
example.com/url6 | example.com/url7 | example.com/url8 | example.com/10
example.com/url9 | example.com/url7 | example.com/url8 |
example.com/url23 | example.com/url10 | |
example.com/url24 | example.com/url45 | example.com/url46 |
example.com/url25 | example.com/url45 | example.com/url46 |
example.com/url26 | example.com/url45 | example.com/url46 |
example.com/url27 | example.com/url45 | example.com/url46 |
example.com/url28 | example.com/url45 | example.com/url46 |
example.com/url29 | example.com/url45 | example.com/url46 |
example.com/url30 | example.com/url45 | example.com/url46 |
The output I'm trying to get is
URL1 | URL2
example.com/url1 | example.com/url2
example.com/url3 | example.com/url5
example.com/url4 | example.com/url5
example.com/url6 | example.com/10
example.com/url7 | example.com/10
example.com/url8 | example.com/10
example.com/url23 | example.com/url10
...
I've converted the Pandas dataframe to a list of lists using the below code:
import pandas as pd
import numpy as np
csv1 = pd.read_csv('Example_301_sheet.csv', header=None)
outlist = []
def link_flat(csv):
for row in csv.iterrows():
index, data = row
outlist.append(data.tolist())
return outlist
This returns each row as a list, and they are all nested together in a list, like below:
[['example.com/url1', 'example.com/url2', nan, nan],
['example.com/url3', 'example.com/url4', 'example.com/url5', nan],
['example.com/url6',
'example.com/url7',
'example.com/url8',
'example.com/10'],
['example.com/url9', 'example.com/url7', 'example.com/url8', nan],
['example.com/url23', 'example.com/url10', nan, nan],
['example.com/url24', 'example.com/url45', 'example.com/url46', nan],
['example.com/url25', 'example.com/url45', 'example.com/url46', nan],
['example.com/url26', 'example.com/url45', 'example.com/url46', nan],
['example.com/url27', 'example.com/url45', 'example.com/url46', nan],
['example.com/url28', 'example.com/url45', 'example.com/url46', nan],
['example.com/url29', 'example.com/url45', 'example.com/url46', nan],
['example.com/url30', 'example.com/url45', 'example.com/url46', nan]]
How do I match each URL in each nested list with the last URL in the same list to produce the above list?