There is a list containing URL adresses with parameters:
http://example.com/?param1=apple¶m2=tomato¶m3=carrot
http://sample.com/?param1=banana¶m3=potato¶m4=berry
http://example.org/?param2=apple¶m3=tomato¶m4=carrot
Each URL may contain any of 4 parameters.
I want to extract URL parameters and add them into Pandas DataFrame. The DataFrame should have a URL column and 4 columns with parameters. If a parameter is not present in the URL, the cell is empty:
URL param1 param2 param3 param4
... apple tomato carrot
... banana potato berry
... apple tomato carrot
I was planning to use python built-in urlparse module, which allows to extract parameters easily:
import urlparse
url = 'http://example.com/?param1=apple¶m2=tomato¶m3=carrot'
par = urlparse.parse_qs(urlparse.urlparse(url).query)
print par['param1'], par['param2']
Out: ['apple'] ['tomato']
With urlparse I can get the list of parameters in URLs:
import pandas as pd
urls = ['http://example.com/?param1=apple¶m2=tomato¶m3=carrot',
'http://sample.com/?param1=banana¶m3=potato¶m4=berry',
'http://example.org/?param2=apple¶m3=tomato¶m4=carrot']
df = pd.DataFrame(urls, columns=['url'])
params = [urlparse.parse_qs(urlparse.urlparse(url).query) for url in urls]
print params
Out: [{'param1': ['apple'], 'param2': ['tomato'], 'param3': ['carrot']},
{'param1': ['banana'], 'param3': ['potato'], 'param4': ['berry']},
{'param2': ['apple'], 'param3': ['tomato'], 'param4': ['carrot']}]
...
I don't know how to add extracted parameters into the DataFrame. Maybe there is a better way of doing it? The original file is ~1m URLs.