1

So let's say i'm reading a txt file in Python which is something like this:

.. Keywords- key1; key2, key3; key4 Abstract .. ..

Now i want to parse the file until i find the word "Keywords", and then put all the keywords into a list, so the list should look something like this: ["key1", "key2", "key3", "key4"]

So its basically everything before the word Abstract and the keywords can be separated either with a comma (,) or with a semicolon (;) or a combination of both.

How do I go about this question?

mihika
  • 11
  • 1

2 Answers2

1

Here's one way using regex

import re

input_str = "this is a test Keywords- key1; key2, key3; key4 Abstract other stuff here"
p = re.compile(r'Keywords- (.+?)Abstract')
output = [v.strip() for v in re.split(';|,', p.findall(input_str)[0])] if p.findall(input_str) else list()

This will return either an empty list if there are no matches or a list of matches with white-space trimmed. So in this example the returning list will be:

['key1', 'key2', 'key3', 'key4']

I use re.split as it supports splitting on multiple separators so if you had additional separators you could just add them in further pipe separated options.

Steve Mapes
  • 861
  • 8
  • 21
0

Here is another regex version. Same as Steve's without the list comprehension.


import re

s = '''Keywords- key1; key2, key3; key4 Abstract stuff
 some of other text Keywords- key1; key2, key3; key4 Abstract
Keywords- key1; key2, key3; key4 Abstract
Keywords- key1; key2, key3; key4 Abstract'''

extract = r'Keywords-\s(.*)\sAbstract'
keywordList = re.findall(extract,s)

reg = r'\w+'

keywords = []
for i in range(len(keywordList)):
    keywords += re.findall(reg, keywordList[i])

print(keywords)


# ['key1', 'key2', 'key3', 'key4', 'key1', 'key2', 'key3', 'key4', 'key1', 'key2', 'key3', 'key4', 'key1', 'key2', 'key3', 'key4']
KJDII
  • 851
  • 4
  • 11