1

I want to remove stopwords from a given text with GATE. Therefore I use a Tokenizer and a Gazetteer: The Gazetteer returns me the stopwords which I want to delete. I think there is no GATE plugin for deleting words, isn't it? So I want to do it with a groovy script, but I don't know how: I think I should be able to receive the position of the stopwords from the Gazetteer.

And I know there is the method edit(), but it doesn't work as expected:

Long start = //startPosition of a stopwords
Long end = //endPosition of a stopwords
doc.edit(start, end, DocumentContentImpl(""))

Last line throws an exception and I couldn't figure out how to use edit() correctly - or rather what else I can do to remove stopwords.

Will
  • 14,348
  • 1
  • 42
  • 44
Munchkin
  • 4,528
  • 7
  • 45
  • 93

1 Answers1

2

How document edit works is described here. In my experience, however, it's tricky and risky. I think this method is used when you edit a document in the UI and my observations are that usually some annotations get messed up.

I don't know what is your task, but it may be enough to just remove the Token (or all) annotations or put a Stop annotation over the stopwords.

If following processing is not too complex, you can adjust it to ignore text that's not tokenized. With jape that's trivial as it only works on annotations anyway. Exploit the level of abstraction GATE gives you over the actual text.

EDIT: How to remove tokens that match with your stopword list in jape:

Rule: removeStopwords
(
  {Token, Lookup.majorType == "stopword"}
) :t
-->
:t {
    outputAS.removeAll(tAnnots.get("Token"));
}

How to traverse all tokens in your groovy script:

inputAS.findAll{
  it.type == "Token"
}.each{
 your code here
}

Once you removed the stopword tokens you'll only have here the right tokens.

Another option would be first in jape to create a new annotation "WorthyToken", matching Token, !Lookup.majorType == "stopword" and then in groovy use WorthyToken.

Hope this helps.

Yasen
  • 1,663
  • 10
  • 17
  • Thank you! All I need is to receive a string or list of words (without stopwords!) in a groovy script; and after a little algorithm I want to return a new string / wordlist. As a GATE beginner I don't know how to solve this: "adjust it to ignore text that's not tokenized" and get it into my script. – Munchkin Aug 22 '14 at 06:34
  • I hope the edited answer helps you, please also edit you question for completeness. – Yasen Aug 25 '14 at 14:00
  • great, it works as I wanted! But you missed a ")" in your code ;-) – Munchkin Aug 25 '14 at 14:20