spacy regex with japanese characters

Question

i need help with regex in spacy in japanese. I have this text: 道が凍っているから気を付けなさい。 I need to find match every word until the character "を" in japanese, so essentially i need to get "道が凍っているから気を" . I tried this code:

nlp =spacy.load("ja_core_news_sm")
matcher = Matcher(nlp.vocable)
pattern = [{"TEXT": {"REGEX": "^.*?[を]"}}]
matcher.add("mypattern", [pattern])
doc = nlp(Verbwithnoun)
matches = matcher(doc)

for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] 
print(doc[start:end)

But it prints me nothing, but when i try this pattern "^.*?[を]" on different python regex test site like Regex101 or Pythex it works perfectly, it returns me the correct sentence. But in spacy it doesn't work. It prints nothing. Can somebody please help me?

The regex you are using in `TEXT` only applies to each token, not to the whole `doc`ument. — Wiktor Stribiżew, Jan 02 '22 at 12:55
@WiktorStribiżew Thank you very much, i just finished reading the documentation, you're right. But is there another way to get all characters before "を"? — Laz22434, Jan 02 '22 at 13:05
Maybe you want just `doc = nlp(Verbwithnoun[:Verbwithnoun.find("を")+1])`? — Wiktor Stribiżew, Jan 02 '22 at 13:21

score 1 · Accepted Answer · answered Jan 03 '22 at 05:17

1

As Wiktor noted, the Matcher matches against tokens, not the whole sentence. Assuming you only want to match on the object marker を and not words like ををがけ or something you can just walk the tokens.

def until_wo(doc):
    for tok in doc:
        if tok.text == 'を':
            return doc[0:tok.i]

text = "..."
doc = nlp(text)
print(until_wo(doc))

answered Jan 03 '22 at 05:17

polm23

14,456
7
35
59

As always thank you! – Laz22434 Jan 07 '22 at 18:27

spacy regex with japanese characters

1 Answers1