2

I have a input text like this (actual text file contains tons of garbage characters surrounding these 2 string too.)

(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)

I am trying to parse the text to store something like this: value1="xxx" and value2="yyy". I wrote python code as follows:

value1_start = content.find('value')
value1_end = content.find(';', value1_start)

value2_start = content.find('value')
value2_end = content.find(';', value2_start)


print "%s" %(content[value1_start:value1_end])
print "%s" %(content[value2_start:value2_end])

But it always returns:

value=xxx
value=xxx

Could anyone tell me how can I parse the text so that the output is:

value=xxx
value=yyy
Mike Müller
  • 82,630
  • 20
  • 166
  • 161
weefwefwqg3
  • 961
  • 10
  • 23

4 Answers4

1

For this input:

content = '(random_garbage_char_here)**value=xxx**;(random_garbage_char_here)**value=yyy**;(random_garbage_char_here)'

use a simple regex and manually strip off the first and last two characters:

import re

values = [x[2:-2] for x in re.findall(r'\*\*value=.*?\*\*', content)]
for value in values:
    print(value)

Output:

value=xxx
value=yyy

Here the assumption is that there are always two leading and two trailing * as in **value=xxx**.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
  • sorry I just edit my question, actually the text file does not just have that string, it also contains a lot of non-printing chars, and garbage chars surrounding the string too – weefwefwqg3 Dec 30 '16 at 07:54
1

Use a regex approach:

re.findall(r'\bvalue=[^;]*', s)

Or - if value can be any 1+ word (letter/digit/underscore) chars:

re.findall(r'\b\w+=[^;]*', s)

See the regex demo

Details:

  • \b - word boundary
  • value= - a literal char sequence value=
  • [^;]* - zero or more chars other than ;.

See the Python demo:

import re
rx = re.compile(r"\bvalue=[^;]*")
s = "$%$%&^(&value=xxx;$%^$%^$&^%^*value=yyy;%$#^%"
res = rx.findall(s)
print(res)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • my text has something like this: (random_text)avalue=xxx;(random_text),value=yyy; so I had to remove \b to parse both values. If \b is there, the code only parses the second value=yyy. Btw, it works now. Thank you for your dedicated answer. – weefwefwqg3 Dec 30 '16 at 08:26
  • 1
    Great you could adjust the pattern as per your real data, that's why I always provide explanation of the patterns I suggest. Yes, `\b` requires a non-word char or start of string before `v`, and if you need to match all attributes that *end with* `value`, you might try `\w*value=[^;]*`. – Wiktor Stribiżew Dec 30 '16 at 08:34
  • Hi, could you please tell me how the regex should be if the end of my want-to-parse string end with ;;$ (2 consecutive semicolons and a dollar sign). I try the regex: re.compile(r"param=[^;;$]*") to get the value, but did not succeed. – weefwefwqg3 Jan 02 '17 at 18:15
  • 1
    No, a negated character class negates only 1 char. You need `r'param=(.*?);;\$'` – Wiktor Stribiżew Jan 02 '17 at 18:21
  • OH I see. Thank you so much for your help. – weefwefwqg3 Jan 02 '17 at 18:24
1

Use regex to filter the data you want from the "junk characters":

>>> import re
>>> _input = '#4@5%value=xxx38u952035983049;3^&^*(^%$3value=yyy#%$#^&*^%;$#%$#^'
>>> matches = re.findall(r'[a-zA-Z0-9]+=[a-zA-Z0-9]+', _input)
>>> matches
['value=xxx', 'value=yyy']
>>> for match in matches:
    print(match)


value=xxx
value=yyy
>>> 

Summary or the regular expression:

  • [a-zA-Z0-9]+: One or more alphanumeric characters
  • =: literal equal sign
  • [a-zA-Z0-9]+: One or more alphanumeric characters
Christian Dean
  • 22,138
  • 7
  • 54
  • 87
1

You already have good answers based on the re module. That would certainly be the simplest way.

If for any reason (perfs?) you prefere to use str methods, it is indeed possible. But you must search the second string past the end of the first one :

value2_start = content.find('value', value1_end)
value2_end = content.find(';', value2_start)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252