How can I split text using pyparsing with a specific token?

Question

PLEASE NOTE: In Splitting text into lines with pyparsing it is about how to parse a file using a single token at the end of a line which is \n that is pretty easy peasy. My question differs as I have hard time ignoring last text which is started before : and exclude it from free text search entered before filters.

On our API I have a user input like some free text port:45 title:welcome to our website and what I need to have at the end of parsing is 2 parts -> [some free text, port:45 title:welcome]

from pyparsing import *
token = "some free text port:45 title:welcome to our website"
t = Word(alphas, " "+alphanums) + Word(" "+alphas,":"+alphanums)

This does give me an error:

pyparsing.ParseException: Expected W:( ABC..., :ABC...), found ':'  (at char 21), (line:1, col:22)

Because it gets all strings up to some free text port and then :45 title:welcome to our website.

How can I get all data before port: in a separate group and port:.... in another group using pyparsing?

With pyparsing dont know but you can use split. `x = list(toke.split(' '))` and then loop and check if it equals port and take action — DeadSec, Mar 16 '21 at 09:08
@DeadSec that's very impractical in production and I tend to use `pyparsing` or similar tools to handle this issue. Moreover it is not a fixed string like port. it can be anything! — Alireza, Mar 16 '21 at 09:12
Does this answer your question? [Splitting text into lines with pyparsing](https://stackoverflow.com/questions/31564199/splitting-text-into-lines-with-pyparsing) — DeadSec, Mar 16 '21 at 09:16

score 1 · Answer 1 · answered Mar 16 '21 at 09:35

I know that the question is about pyparsing, but for the specific use I think using regex is far more standard and simpler where instead pyparsing is probably better suited for more complicated parsing problems.

Here one possible working regex: ^(.+port\:\d+) (title:.+)$

And here the python code:

import re
pattern = "^(.+port\:\d+) (title:.+)$"
token = "some free text port:45 title:welcome to our website"
m = re.match(pattern, token)
if m:
    grp1, grp2 = m.group(1), m.group(2)

Thank you for your answer. `port` is a sample search keyword here we have more than 30 search keywords that user can enter. — Alireza, Mar 16 '21 at 09:43

score 0 · Accepted Answer · answered Mar 17 '21 at 06:01

Adding " " as one of the valid characters in a Word pretty much always has this problem, and so is general a pyparsing anti-pattern. Word does its character repetition matching inside its parse() method, so there is no way to add any kind of lookahead.

To get spaces in your expressions, you will probably need a OneOrMore, wrapped in originalTextFor, like this:

import pyparsing as pp

word = pp.Word(pp.printables, excludeChars=":")

non_tag = word + ~pp.FollowedBy(":")

# tagged value is two words with a ":"
tag = pp.Group(word + ":" + word)

# one or more non-tag words - use originalTextFor to get back 
# a single string, including intervening white space
phrase = pp.originalTextFor(non_tag[1, ...])

parser = (phrase | tag)[...]

parser.runTests("""\
    some free text port:45 title:welcome to our website
    """)

Prints:

some free text port:45 title:welcome to our website
['some free text', ['port', ':', '45'], ['title', ':', 'welcome'], 'to our website']
[0]:
  some free text
[1]:
  ['port', ':', '45']
[2]:
  ['title', ':', 'welcome']
[3]:
  to our website

How can I split text using pyparsing with a specific token?

2 Answers2