1

i need help with regex in spacy in japanese. I have this text: 道が凍っているから気を付けなさい。 I need to find match every word until the character "を" in japanese, so essentially i need to get "道が凍っているから気を" . I tried this code:

nlp =spacy.load("ja_core_news_sm")
matcher = Matcher(nlp.vocable)
pattern = [{"TEXT": {"REGEX": "^.*?[を]"}}]
matcher.add("mypattern", [pattern])
​doc = nlp(Verbwithnoun)
matches = matcher(doc)

for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] 
print(doc[start:end)

But it prints me nothing, but when i try this pattern "^.*?[を]" on different python regex test site like Regex101 or Pythex it works perfectly, it returns me the correct sentence. But in spacy it doesn't work. It prints nothing. Can somebody please help me?

Laz22434
  • 373
  • 1
  • 12

1 Answers1

1

As Wiktor noted, the Matcher matches against tokens, not the whole sentence. Assuming you only want to match on the object marker and not words like ををがけ or something you can just walk the tokens.

def until_wo(doc):
    for tok in doc:
        if tok.text == 'を':
            return doc[0:tok.i]

text = "..."
doc = nlp(text)
print(until_wo(doc))
polm23
  • 14,456
  • 7
  • 35
  • 59