1

I'm trying to clean up sentences in order to create better word clouds, and I'm having an issue with hyphens splitting up words which belong together.

An extreme case is the following where I am dropping all numbers. 2-Mics should be found in the image instead of just Mics:

  "text": "ReSpeaker 2-Mics Pi HAT - Seeed Wiki",
  "lang": "English",
  "confidence": 97.0,
  "tags": [
    [
      "Mics",
      "NUM"
    ],
    [
      "Pi",
      "NOUN"
    ],
    [
      "HAT",
      "PROPN"
    ],
    [
      "Seeed",
      "NUM"
    ],
    [
      "Wiki",
      "NOUN"
    ]
  ]
},

or K2-18b would also be more meaningful than K2 and someplace else in the word cloud 18b.

{
  "text": "Supererde: Forscher finden erstmals Wasser auf K2-18b - SPIEGEL ONLINE",
  "lang": "German",
  "confidence": 98.0,
  "tags": [
    [
      "Supererde",
      "PROPN"
    ],
    [
      "Forscher",
      "NOUN"
    ],
    [
      "finden",
      "VERB"
    ],
    [
      "Wasser",
      "NOUN"
    ],
    [
      "K2",
      "PROPN"
    ],
    [
      "18b",
      "PROPN"
    ],
    [
      "SPIEGEL",
      "PROPN"
    ],
    [
      "ONLINE",
      "PROPN"
    ]
  ]
},

The dashes can be removed, that is fully ok. For example the one between K2-18b and SPIEGEL in the segment K2-18b - SPIEGEL.

Here's another case, where respecting the hyphens would be meaningful:

{
  "text": "docker-spacy-alpine/Dockerfile at master \u00b7 cluttered-code/docker-spacy-alpine",
  "lang": "English",
  "confidence": 98.0,
  "tags": [
    [
      "docker",
      "NUM"
    ],
    [
      "spacy",
      "NUM"
    ],
    [
      "Dockerfile",
      "NUM"
    ],
    [
      "master",
      "NOUN"
    ],
    [
      "cluttered",
      "VERB"
    ],
    [
      "code",
      "NOUN"
    ],
    [
      "docker",
      "NUM"
    ],
    [
      "spacy",
      "NUM"
    ],
    [
      "alpine",
      "ADJ"
    ]
  ]
},

since this would then end up as docker-spacy-alpine Dockerfile cluttered-code in the image, with docker-spacy-alpine being more promintent.

This is the code I'm using

from polyglot.text import Text

#...

for item in result:
  if 'title' in item:
    text = Text(item['title'])
    if text.language.code in ['en', 'de']:
      tags = []
      try:
        unfiltered_tags = text.pos_tags
        for tag in unfiltered_tags:
          try:
            x = float(tag[0])
          except:
            if tag[1] in ['NUM', 'ADJ', 'VERB', 'PROPN', 'INTJ', 'NOUN']:
              tags.append(tag)
      except:
        traceback.print_exc()
      titles.append({
        'text': item['title'],
        'lang': text.language.code,
        'confidence': text.language.confidence,
        'tags': tags,
      })

Is there a way to tune polyglot so it does not do this splitting, or do I need to do some manual post-processing on the sentences?

Daniel Widdis
  • 8,424
  • 13
  • 41
  • 63
Daniel F
  • 13,684
  • 11
  • 87
  • 116
  • Could you provide the code snippet or the command line command that you're using to generate these results? – Tiago Duque Sep 12 '19 at 10:47
  • @TiagoDuque I've added it. – Daniel F Sep 12 '19 at 10:59
  • The thing is that you would have to modify the tokenizing module of the pipeline. However, I can't find a way to do it. There's even an issue opened on Polyglot git since 2016 (https://github.com/aboSamoor/polyglot/issues/73) about unability to train custom NER (I would extend it to other tasks). – Tiago Duque Sep 12 '19 at 11:06
  • Thanks. I can see a way to do this via postprocessing with regex and set operations, so it's not that big of an issue. I was hoping for an integrated solution mostly for performance reasons. – Daniel F Sep 12 '19 at 11:10
  • You need polyglot for language detection, right? Anyway, you could make a mix with other libraries for better tuning the NER part. – Tiago Duque Sep 12 '19 at 11:13
  • ATM I have only `polyglot` and `nltk` at my disposal. I tried to compile `scapy` in an Alpine Docker container, but that failed. I think I'd need to use a `Debian` container for `scapy`, which is something that I want to do in the long run, but ATM I'm just getting my feet wet at NLP and thought that this was a good task to start with. Could `nltk` help me here? I have it installed in the same container. – Daniel F Sep 12 '19 at 11:17
  • Hmmm. I never tried to modify its pipeline. I've seen a couple cases in spacy though... Spacy works well in debian environment. – Tiago Duque Sep 12 '19 at 11:29

0 Answers0