I'm trying to clean up sentences in order to create better word clouds, and I'm having an issue with hyphens splitting up words which belong together.
An extreme case is the following where I am dropping all numbers. 2-Mics
should be found in the image instead of just Mics
:
"text": "ReSpeaker 2-Mics Pi HAT - Seeed Wiki",
"lang": "English",
"confidence": 97.0,
"tags": [
[
"Mics",
"NUM"
],
[
"Pi",
"NOUN"
],
[
"HAT",
"PROPN"
],
[
"Seeed",
"NUM"
],
[
"Wiki",
"NOUN"
]
]
},
or K2-18b
would also be more meaningful than K2
and someplace else in the word cloud 18b
.
{
"text": "Supererde: Forscher finden erstmals Wasser auf K2-18b - SPIEGEL ONLINE",
"lang": "German",
"confidence": 98.0,
"tags": [
[
"Supererde",
"PROPN"
],
[
"Forscher",
"NOUN"
],
[
"finden",
"VERB"
],
[
"Wasser",
"NOUN"
],
[
"K2",
"PROPN"
],
[
"18b",
"PROPN"
],
[
"SPIEGEL",
"PROPN"
],
[
"ONLINE",
"PROPN"
]
]
},
The dashes can be removed, that is fully ok. For example the one between K2-18b
and SPIEGEL
in the segment K2-18b - SPIEGEL
.
Here's another case, where respecting the hyphens would be meaningful:
{
"text": "docker-spacy-alpine/Dockerfile at master \u00b7 cluttered-code/docker-spacy-alpine",
"lang": "English",
"confidence": 98.0,
"tags": [
[
"docker",
"NUM"
],
[
"spacy",
"NUM"
],
[
"Dockerfile",
"NUM"
],
[
"master",
"NOUN"
],
[
"cluttered",
"VERB"
],
[
"code",
"NOUN"
],
[
"docker",
"NUM"
],
[
"spacy",
"NUM"
],
[
"alpine",
"ADJ"
]
]
},
since this would then end up as docker-spacy-alpine
Dockerfile
cluttered-code
in the image, with docker-spacy-alpine
being more promintent.
This is the code I'm using
from polyglot.text import Text
#...
for item in result:
if 'title' in item:
text = Text(item['title'])
if text.language.code in ['en', 'de']:
tags = []
try:
unfiltered_tags = text.pos_tags
for tag in unfiltered_tags:
try:
x = float(tag[0])
except:
if tag[1] in ['NUM', 'ADJ', 'VERB', 'PROPN', 'INTJ', 'NOUN']:
tags.append(tag)
except:
traceback.print_exc()
titles.append({
'text': item['title'],
'lang': text.language.code,
'confidence': text.language.confidence,
'tags': tags,
})
Is there a way to tune polyglot
so it does not do this splitting, or do I need to do some manual post-processing on the sentences?