3

I want to split either comma, semicolon or hyphen (with preceding space) separated words.
The reason for this is the inconsistent structure of a website I am scraping with Scrapy.
So far, I am able to split either comma or semicolon separated words with follwing code:

for i in response.xpath('//meta[@name="keywords"]/@content').extract():
        if ',' or ';' in i:
            for k in i.split(',') or i.split(';'):
                keywords.append([k.strip()])
        else:
            keywords.append([i.strip()])

That works if the words are separated like:

  • keyword1, keyword2, keyword3
  • keyword1; keyword2; keyword3

But sometimes the keywords are also stored as follows:

keyword1 - keyword2 - keyword3

I don't know how to split them properly, because the spaces in between the hyphens are giving me headache :). Help is very much appreciated!

סטנלי גרונן
  • 2,917
  • 23
  • 46
  • 68
Dan
  • 257
  • 3
  • 12
  • `the spaces in between the hyphens are...` - How are they causing a problem for you? - you should be more explicit about that in your question. – wwii Nov 23 '19 at 05:24
  • At first I thought Python isn't able to recognize the spaces by simply adding a space like ' - '. So, I thought I need to specify that there is a space in my code. But as I posted below just now, I could simply solve it by using an elif-statement. – Dan Nov 24 '19 at 14:26

7 Answers7

2

You may want to use Regular Expressions. re.split('\s*-\s*', mystring) should do the job.

Patol75
  • 4,342
  • 1
  • 17
  • 28
0

Have you tried:

"keyword1 - keyword2 - keyword3".split(' - ')
#  ['keyword1', 'keyword2', 'keyword3']

oppressionslayer
  • 6,942
  • 2
  • 7
  • 24
  • can you post the website, i'll take a look, i don't mind – oppressionslayer Nov 23 '19 at 05:23
  • Thank you for taking the time to check it. But as I posted just now, I could solve the problem in a different and very simple way. Just made a mistake in my code by trying to use the "or"-statement. The "elif"-statement was the solution. – Dan Nov 24 '19 at 14:29
  • @Dan, nice. you should use the answer your own question option, i'll +1 it. – oppressionslayer Nov 24 '19 at 21:53
0

You may want to look into regular expressions

import re

lines = """keyword1, keyword2, keyword3
keyword1; keyword2; keyword3
keyword1 - keyword2 - keyword3
""".splitlines()

delim = re.compile(r'\s*[-,;]\s+')
for line in lines:
    print(delim.split(line))
0

Data.replace(' - ','; ') will replace all keywords separated by hyphens and a space on each side to keywords separated by semicolons and one space. Add that into your code to prior to the if statement and you should be good to go.

Code:

data = ['Keyword1 - Keyword2 - Keyword3','Keyword4 - Keyword5']

final = [item.replace(" - ", "; ") for item in data]

print(final)

Output:

['Keyword1; Keyword2; Keyword3', 'Keyword4; Keyword5']
Moein Kameli
  • 976
  • 1
  • 12
  • 21
Jkiefn1
  • 91
  • 3
  • 16
0

you can first use strip() then try to split

"keyword1 - keyword2 - keyword3".strip().split(' - ')
hamzeh_pm
  • 181
  • 11
0

You can simply replace all special characters giving you headache with wight space then split it.

import re
string = "keyword - keyword; keyword,keyword-keyword"
re.sub("[-;,]", " ", string).split()

Output:

['keyword', 'keyword', 'keyword', 'keyword', 'keyword']
Moein Kameli
  • 976
  • 1
  • 12
  • 21
0

It appears to be a problem with my code I posted in my original question. Thus, there isn't really a problem with the spaces in between hyphens and I can simply solve the issue by using the elif statement as follows:

for i in response.xpath('//meta[@name="keywords"]/@content').extract():
        if ',' in i:
            for k in i.split(','):
                keywords.append([k.strip()])
        elif ';' in i:
            for k in i.split(';'):
                keywords.append([k.strip()])
        elif ' – ' in i:
            for k in i.split(' – '):
                keywords.append([k.strip()])
        else:
            keywords.append([i.strip()])

Anyway, thank you all for your suggestions on solving this issue.

Dan
  • 257
  • 3
  • 12