2

I have a string variable called link that returns data from a remote site. How i can parse data after equal sign(token=) ? For example i want to grab "234132421reafdfasdfsdfdsf3234423edfasfdsf" from following line.

file: "http://www.aaastreams.com/playlist.m3u8?token=234132421reafdfasdfsdfdsf3234423edfasfdsf" 
});

python code:

req = urllib2.Request('http://www.somesite.com/test.php')
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0')
response = urllib2.urlopen(req)   
link = response.read()

sample response value from "print link;":

.......rest of response
    <script>

    jwplayer("container").setup({
    width:700,
    height:220,
    primary: "hls",
    title:"streams",
    autostart:true,

    image: "./1.jpg",
    file: "http://www.aaastreams.com/playlist.m3u8?token=234132421reafdfasdfsdfdsf3234423edfasfdsf" 
    });

    jwplayer().onError(function(){
    jwplayer().load({file:"http://www.aaa.com/jwplayer/ads.mp4",image:"http://aaa.com/2.png"});
    jwplayer().play();
    });

    </script>
.......rest of response
user1788736
  • 2,727
  • 20
  • 66
  • 110

5 Answers5

3

A better approach to parsing a URL is to use the urlparse module.

Here's an example:

from urlparse import urlparse, parse_qs

url = "http://www.aaastreams.com/playlist.m3u8?token=234132421reafdfasdfsdfdsf3234423edfasfdsf"
query = urlparse(url).query
params = parse_qs(query)

params will hold a dictionary with your token, and any other query parameters in the url.

eugecm
  • 1,189
  • 8
  • 14
  • Thanks for reply . But my Link string is not a url it is a chunk of html code with two "file:.." appearance in the whole link string! is there away to use this information to limit to where to look for token value ? – user1788736 Dec 01 '15 at 00:56
2

After trying different solution i came up with easiest way to solve this problem:

  tokenValue = re.search('token=(.*)"', link)
  print tokenValue.group(1);
user1788736
  • 2,727
  • 20
  • 66
  • 110
1

Instead of parsing the url any further and assuming there is only one equal sign in the whole string I would suggest doing some string manipulation like this:

In [1]: s = "http://www.aaastreams.com/playlist.m3u8?token=234132421reafdfasdfsdfdsf3234423edfasfdsf"
In [2]: s.split('=')[1]
Out[3]: '234132421reafdfasdfsdfdsf3234423edfasfdsf'
albert
  • 8,027
  • 10
  • 48
  • 84
  • Thanks for reply . As i posted to end of my post there might be other equal sign in link string so i probably will get wrong value using your approach ! But there is only two file: in the whole string is there away to to use that information to avoid getting wrong value ? – user1788736 Dec 01 '15 at 00:54
  • @user1788736 I think you can give split a 2nd argument, the number of maxoccurances you want.`splitresult = s.split("=",2)` Then do something like `splitresult[1:] if len(splitresult) > 1 else splitresult[1]` – Busturdust Dec 01 '15 at 00:57
  • I used tokenValue = link.split('token=')[1]; and i get the whole string after token= . but i want only the token value ! is there a way to only get token value not the whole link string after token =? – user1788736 Dec 01 '15 at 01:20
  • According to the snippet you provided in your question and my answer the token value should be `234132421reafdfasdfsdfdsf3234423edfasfdsf`, doesn't it? What do you mean by 'token value' and 'whole link string'? – albert Dec 01 '15 at 01:26
  • the code below "sample response value from "print link;":" in my first post is "s" variable according to your example so i want to get value between token= and first occurrence of double quote ! – user1788736 Dec 01 '15 at 01:37
  • So, instead of `234132421reafdfasdfsdfdsf3234423edfasfdsf`you got `234132421reafdfasdfsdfdsf3234423edfasfdsf"`? If so, you can change my input line [2] from `s.split('=')[1]` to `s.split('=')[1][:-1]` in order to select the part until the double-quote only. – albert Dec 01 '15 at 01:40
  • "s" variable is a chunk of html and i want to search for value between token= and first occourence of double quote !(i want :234132421reafdfasdfsdfdsf3234423edfasfdsf not the whole html chunk after token=) – user1788736 Dec 01 '15 at 01:49
  • So, step by step: Were you able to extract the `file` string including your url from your response in order to have this url as a single string? – albert Dec 01 '15 at 01:52
0

You can use regular expressions with capture groups.

A complete explanation of this can be found here scroll down to the section labeled "Groups"

"Groups are marked by the '(', ')' metacharacters. '(' and ')' have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as , +, ?, or {m,n}. For example, (ab) will match zero or more repetitions of ab.

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'

So in this case you could use a regular expression like:

'^ *[a-z][A-Z] *\=(.*)$

Group 0 is the entire match groups 1 and onward match that part of the expression that can appear in multiple parenthesis (group matching) pairs. These groups can be nested.

Note, this is a general approach and not specific to URLs

Ken Clement
  • 748
  • 4
  • 13
  • 1
    Could you please provide a full answer including a code snippet showing what you suggest in order to improve your answer's quality? – albert Dec 01 '15 at 00:47
  • @Albert, will that do? – Ken Clement Dec 01 '15 at 01:03
  • It's a general explanation of regex and not for the given problem of parsing the token out of the url. Please adapt in order to solve the problem of the above question in specific. – albert Dec 01 '15 at 01:05
  • Thanks for replies. So how to use reguler expression to get value of token= and exclude the rest of link string ?(this value=>token=.......") – user1788736 Dec 01 '15 at 01:22
  • @user1788736, you place in parenthesis the part of the value you want to extract, so the regex is specified above could be modified to '^ **([a-z][A-Z]) *=(.*)$' In this case group 0 is everything, group 1 is the identifier (surrounding blank spaces excluded, also right now the identifier is specified as alphabetic - you can add [0-9] to include numerals and -(dash) and _(underscore) prefix those with \ to be clear you are not using them as metacharacters if you wish as well. Group 2 is the right hand side. – Ken Clement Dec 01 '15 at 01:33
0

Consider the built in string method:

str.partition(sep)

Split the string at the first occurrence of sep, and return a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing the string itself, followed by two empty strings.

In your case you could use "=" as the separator (sep). str is the long string with "=" in it.

Riccati
  • 461
  • 4
  • 13