Python: extract text request from url

Question

I try to extract users requests from url. I try to search answer, but I only find how to parse string. But I have a problem, that I should identify a lot of urls with request and when I try to get string with attribute, attributes with text are different. I mean when I try

pat = re.compile(r"\?\w+=(.*)")
search = ['yandex.ru/search', 'youtube.com/results', 'google.com/search', 'google.ru/search', 'go.mail.ru/search', 'search.yahoo.com/search', 'market.yandex.ru/search', 'bing.com/search']
for i in urls:
    u = re.findall(pat, i)
    if any(ext in i for ext in search):
        if len(u) > 0:
            str = urllib.unquote(u[0])
            print str
            print {k: [s for s in v] for k, v in parse_qs(str).items()}

And it looks like

chromesearch&clid=2196598&text=королевы крика смотреть онлайн&lr=213&redircnt=1467230336.1
{'text': ['\xd0\xba\xd0\xbe\xd1\x80\xd0\xbe\xd0\xbb\xd0\xb5\xd0\xb2\xd1\x8b \xd0\xba\xd1\x80\xd0\xb8\xd0\xba\xd0\xb0 \xd1\x81\xd0\xbc\xd0\xbe\xd1\x82\xd1\x80\xd0\xb5\xd1\x82\xd1\x8c \xd0\xbe\xd0\xbd\xd0\xbb\xd0\xb0\xd0\xb9\xd0\xbd'], 'clid': ['2196598'], 'lr': ['213'], 'redircnt': ['1467230336.1']}
минималистичный+стиль&newwindow=1&biw=1280&bih=909&source=lnms&tbm=isch&sa=X&ved=0ahUKEwikhI2M_s3NAhXBBiwKHfbEBEEQ_AUIBigB#imgrc=Er7qLiHoEGPIGM:
{'bih': ['909'], 'newwindow': ['1'], 'source': ['lnms'], 'ved': ['0ahUKEwikhI2M_s3NAhXBBiwKHfbEBEEQ_AUIBigB#imgrc=Er7qLiHoEGPIGM:'], 'tbm': ['isch'], 'biw': ['1280'], 'sa': ['X']}
минималистичный+стиль&newwindow=1&biw=1280&bih=909&source=lnms&tbm=isch&sa=X&ved=0ahUKEwikhI2M_s3NAhXBBiwKHfbEBEEQ_AUIBigB#imgrc=Er7qLiHoEGPIGM:
{'bih': ['909'], 'newwindow': ['1'], 'source': ['lnms'], 'ved': ['0ahUKEwikhI2M_s3NAhXBBiwKHfbEBEEQ_AUIBigB#imgrc=Er7qLiHoEGPIGM:'], 'tbm': ['isch'], 'biw': ['1280'], 'sa': ['X']}
rjulf+ddjlbim+ytdthysq+gby+rjl+d+,fyrjvfn&ie=utf-8&oe=utf-8&gws_rd=cr&ei=ezZ0V-7iOoab6ASvlJe4Dg
{'ie': ['utf-8'], 'oe': ['utf-8'], 'gws_rd': ['cr'], 'ei': ['ezZ0V-7iOoab6ASvlJe4Dg']}
маскаи гейла&lr=10750&clid=1985551-210&win=213
{'win': ['213'], 'clid': ['1985551-210'], 'lr': ['10750']}
1&q=как+выбрать+смартфон
{'q': ['\xd0\xba\xd0\xb0\xd0\xba \xd0\xb2\xd1\x8b\xd0\xb1\xd1\x80\xd0\xb0\xd1\x82\xd1\x8c \xd1\x81\xd0\xbc\xd0\xb0\xd1\x80\xd1\x82\xd1\x84\xd0\xbe\xd0\xbd']}
Jade+Jantzen&ie=utf-8&oe=utf-8&gws_rd=cr&ei=FQB0V9WbIoahsAH5zZGACg
{'ie': ['utf-8'], 'oe': ['utf-8'], 'gws_rd': ['cr'], 'ei': ['FQB0V9WbIoahsAH5zZGACg']}

Is any way to get only text to all strings?

mhawke · Answer 1 · 2016-09-22T12:41:54.187

You can access the text using a dictionary lookup to get the list, then access the first element of the list:

d = {'text': ['\xd0\xba\xd0\xbe\xd1\x80\xd0\xbe\xd0\xbb\xd0\xb5\xd0\xb2\xd1\x8b \xd0\xba\xd1\x80\xd0\xb8\xd0\xba\xd0\xb0 \xd1\x81\xd0\xbc\xd0\xbe\xd1\x82\xd1\x80\xd0\xb5\xd1\x82\xd1\x8c \xd0\xbe\xd0\xbd\xd0\xbb\xd0\xb0\xd0\xb9\xd0\xbd'], 'clid': ['2196598'], 'lr': ['213'], 'redircnt': ['1467230336.1']}
text = d['text'][0]

>>> print text
королевы крика смотреть онлайн

Or you can get it directly from the parse_qs result:

>>> print urlparse.parse_qs(s)['text'][0]
королевы крика смотреть онлайн

To apply that to your code such that it will work for all values:

print {k: v[0] for k, v in parse_qs(str).items()}

i.e. take the first item of each values list.

If you want to print the dictionaries and have the strings appear in the proper representation, i.e. not as produced by repr, you could use the json module to dump the dictionary objects as strings, then print them:

import json

d = {'text': ['\xd0\xba\xd0\xbe\xd1\x80\xd0\xbe\xd0\xbb\xd0\xb5\xd0\xb2\xd1\x8b \xd0\xba\xd1\x80\xd0\xb8\xd0\xba\xd0\xb0 \xd1\x81\xd0\xbc\xd0\xbe\xd1\x82\xd1\x80\xd0\xb5\xd1\x82\xd1\x8c \xd0\xbe\xd0\xbd\xd0\xbb\xd0\xb0\xd0\xb9\xd0\xbd'], 'clid': ['2196598'], 'lr': ['213'], 'redircnt': ['1467230336.1']}
s = json.dumps(d, ensure_ascii=False)

>>> print s
{"text": ["королевы крика смотреть онлайн"], "clid": ["2196598"], "lr": ["213"], "redircnt": ["1467230336.1"]}

@PetrPetrov: I've misunderstood your question. Accessing the text is a simple dictionary lookup. — mhawke, Sep 22 '16 at 12:27
I wrote my problem in the question. Can you see to the strings with attributed? In some string text in attribute `text`, some in attribute `q`, but some doens't contain in attribute (`маскаи гейла&lr=10750&clid=1985551-210&win=213 {'win': ['213'], 'clid': ['1985551-210'], 'lr': ['10750']}`). And I want to get text from all strings — Petr Petrov, Sep 22 '16 at 12:37
@PetrPetrov: answer updated to show how to get the text from the `parse_qs()` result and use it to build a dictionary. — mhawke, Sep 22 '16 at 12:37
@PetrPetrov: sorry, but your question is extremely unclear. I think I understand now, check updated answer. — mhawke, Sep 22 '16 at 12:42

Python: extract text request from url

1 Answers1