1

I'm trying to make a function where I feed it a list of URLs which go through a 301 hop and it flattens it for me. I want to save the resulting list as a CSV so I can hand it to the developers who can implement it and get rid of 301 hops.

For example, my crawler will produce this list of 301 hops:

    URL1          | URL2              | URL3              | URL4
example.com/url1  | example.com/url2  |                   | 
example.com/url3  | example.com/url4  | example.com/url5  | 
example.com/url6  | example.com/url7  | example.com/url8  | example.com/10
example.com/url9  | example.com/url7  | example.com/url8  | 
example.com/url23 | example.com/url10 |                   | 
example.com/url24 | example.com/url45 | example.com/url46 | 
example.com/url25 | example.com/url45 | example.com/url46 | 
example.com/url26 | example.com/url45 | example.com/url46 | 
example.com/url27 | example.com/url45 | example.com/url46 | 
example.com/url28 | example.com/url45 | example.com/url46 | 
example.com/url29 | example.com/url45 | example.com/url46 | 
example.com/url30 | example.com/url45 | example.com/url46 | 

The output I'm trying to get is

URL1              | URL2 
example.com/url1  | example.com/url2
example.com/url3  | example.com/url5
example.com/url4  | example.com/url5
example.com/url6  | example.com/10
example.com/url7  | example.com/10
example.com/url8  | example.com/10
example.com/url23 | example.com/url10
...

I've converted the Pandas dataframe to a list of lists using the below code:

import pandas as pd
import numpy as np

csv1 = pd.read_csv('Example_301_sheet.csv', header=None)
outlist = []

def link_flat(csv):

    for row in csv.iterrows():
        index, data = row
        outlist.append(data.tolist())

    return outlist

This returns each row as a list, and they are all nested together in a list, like below:

[['example.com/url1', 'example.com/url2', nan, nan],
 ['example.com/url3', 'example.com/url4', 'example.com/url5', nan],
 ['example.com/url6',
  'example.com/url7',
  'example.com/url8',
  'example.com/10'],
 ['example.com/url9', 'example.com/url7', 'example.com/url8', nan],
 ['example.com/url23', 'example.com/url10', nan, nan],
 ['example.com/url24', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url25', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url26', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url27', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url28', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url29', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url30', 'example.com/url45', 'example.com/url46', nan]]

How do I match each URL in each nested list with the last URL in the same list to produce the above list?

cs95
  • 379,657
  • 97
  • 704
  • 746
jceg316
  • 469
  • 1
  • 9
  • 17
  • 1
    Mind explaining a bit more how you get this output? What is a hop? – cs95 May 10 '18 at 08:18
  • A hop is when a URL is redirected to a URL which itself redirects. On the list above, if a user or crawler clicks on the first link, they will be redirected through all the URLs in that row until they get to the last URL in the column which resolves in a 200. The aim here is to "flatten" the redirects, so if a user clicks on a URL it redirects straight to a 200 URL – jceg316 May 10 '18 at 08:22
  • In the 3rd row, shouldn't it be `example.com/url6 -> example.com/10`? – cs95 May 10 '18 at 08:25
  • The 3rd row should be `example.com/url4 -> /example.comurl5` as I also need to include URLs which don't hop to avoid confusion. Also I think it would be easier to include these in. It doesn't matter if there are duplicates in the output CSV or if it contains existing 301s – jceg316 May 10 '18 at 08:31
  • Ahhhh, I finally get it. Thanks. – cs95 May 10 '18 at 08:32

1 Answers1

2

You'll need to determine the last valid item per row using groupby + last, and then reshape your dataFrame and build a two-column mapping using melt.

df.columns = range(len(df.columns))
df = (
    df.assign(URL2=df.stack().groupby(level=0).last())
      .melt('URL2', value_name='URL1')  
      .drop('variable', 1)
      .dropna()
      .drop_duplicates()
      .query('URL1 != URL2')
      .sort_index(axis=1)
      .reset_index(drop=True)
)

df
                 URL1               URL2
0    example.com/url1   example.com/url2
1    example.com/url3   example.com/url5
2    example.com/url6     example.com/10
3    example.com/url9   example.com/url8
4   example.com/url23  example.com/url10
5   example.com/url24  example.com/url46
6   example.com/url25  example.com/url46
7   example.com/url26  example.com/url46
8   example.com/url27  example.com/url46
9   example.com/url28  example.com/url46
10  example.com/url29  example.com/url46
11  example.com/url30  example.com/url46
12   example.com/url4   example.com/url5
13   example.com/url7     example.com/10
14   example.com/url7   example.com/url8
15  example.com/url45  example.com/url46
16   example.com/url8     example.com/10
cs95
  • 379,657
  • 97
  • 704
  • 746