1

There is a list containing URL adresses with parameters:

http://example.com/?param1=apple&param2=tomato&param3=carrot
http://sample.com/?param1=banana&param3=potato&param4=berry
http://example.org/?param2=apple&param3=tomato&param4=carrot

Each URL may contain any of 4 parameters.

I want to extract URL parameters and add them into Pandas DataFrame. The DataFrame should have a URL column and 4 columns with parameters. If a parameter is not present in the URL, the cell is empty:

URL    param1    param2     param3    param4
...    apple     tomato     carrot
...    banana               potato    berry
...              apple      tomato    carrot

I was planning to use python built-in urlparse module, which allows to extract parameters easily:

import urlparse
url = 'http://example.com/?param1=apple&param2=tomato&param3=carrot'
par = urlparse.parse_qs(urlparse.urlparse(url).query)
print par['param1'], par['param2']

Out: ['apple'] ['tomato']

With urlparse I can get the list of parameters in URLs:

import pandas as pd

urls = ['http://example.com/?param1=apple&param2=tomato&param3=carrot',
        'http://sample.com/?param1=banana&param3=potato&param4=berry',
        'http://example.org/?param2=apple&param3=tomato&param4=carrot']

df = pd.DataFrame(urls, columns=['url'])
params = [urlparse.parse_qs(urlparse.urlparse(url).query) for url in urls]
print params

Out: [{'param1': ['apple'], 'param2': ['tomato'], 'param3': ['carrot']},
      {'param1': ['banana'], 'param3': ['potato'], 'param4': ['berry']},
      {'param2': ['apple'], 'param3': ['tomato'], 'param4': ['carrot']}]
...

I don't know how to add extracted parameters into the DataFrame. Maybe there is a better way of doing it? The original file is ~1m URLs.

chilliq
  • 1,212
  • 3
  • 13
  • 32

3 Answers3

4

There is a urlparse library that I will recommend, the benefit of this approach is that you don't need to know the field names of the query in advance ('param1' etc.):

In [278]:

import urlparse
In [279]:

T = ['http://example.com/?param1=apple&param2=tomato&param3=carrot',
     'http://sample.com/?param1=banana&param3=potato&param4=berry',
     'http://example.org/?param2=apple&param3=tomato&param4=carrot']
In [280]:

df = pd.concat(map(lambda x: pd.DataFrame(urlparse.parse_qs(urlparse.urlparse(x).query)), T))
print df
#df['URL'] = T : add another column with the original URL's
   param1  param2  param3  param4
0   apple  tomato  carrot     NaN
0  banana     NaN  potato   berry
0     NaN   apple  tomato  carrot
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
3

Alternatively, since pandas 0.18.0 (March 13, 2016), you could use the pandas.Series.str.extractall() method as:

params = df.url.str.extractall('[?&](?P<parameter>[^?#=]+)=?(?P<value>[^&]*)')
print params
Out[1]:
        parameter   value
  match                  
0 0        param1   apple
  1        param2  tomato
  2        param3  carrot
1 0        param1  banana
  1        param3  potato
  2        param4   berry
2 0        param2   apple
  1        param3  tomato
  2        param4  carrot

Or something adapted from that.

It would be sweet to have (better) access to the regexes used in the urlparse library to be able to use them in this method and pandas.Series.str.extract() directly.

Finally, be sure to bookmark Working with Text Data, it has a bunch of useful examples.

bluu
  • 542
  • 3
  • 13
1

You can use a dictionary comprehension to extract the data in the parameters per parameter. I'm not sure if you wanted the final values in list form. If not, it would be easy to extract it.

>>> pd.DataFrame({p: [d.get(p) for d in params] 
                  for p in ['param1', 'param2', 'param3', 'param4']})
     param1    param2    param3    param4
0   [apple]  [tomato]  [carrot]      None
1  [banana]      None  [potato]   [berry]
2      None   [apple]  [tomato]  [carrot]

or...

>>> pd.DataFrame({p: [d[p][0] if p in d else None for d in params] 
                  for p in ['param1', 'param2', 'param3', 'param4']})
   param1  param2  param3  param4
0   apple  tomato  carrot    None
1  banana    None  potato   berry
2    None   apple  tomato  carrot
Alexander
  • 105,104
  • 32
  • 201
  • 196