6

I tried this but it doesn't work

from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
print(stopwords_list)

Update [January 2018]: The nltk data repository has included Arabic stopwords since October, 2017, so this issue no longer arises. The above code will work as expected.

alexis
  • 48,685
  • 16
  • 101
  • 161
lina
  • 75
  • 2
  • 5
  • The declaration of the source code encoding has nothing to do with the data you use (load/import), it is completely unrelated to your problem. – lenz Mar 06 '17 at 12:55
  • Yes I know, but i need this for another thing – lina Mar 06 '17 at 13:40

3 Answers3

7

As of October, 2017, the nltk includes a collection of Arabic stopwords. If you ran nltk.download() after that date, this issue will not arise. If you have been a user of nltk for some time and you now lack the Arabic stopwords, use nltk.download() to update your stopwords corpus.

  1. If you call nltk.download() without arguments, you'll find that the stopwords corpus is shown as "out of date" (in red). Download the current version that includes Arabic.

  2. Alternately, you can simply update the stopwords corpus by running the following code once, from the interactive prompt:

    >>> import nltk
    >>> nltk.download("stopwords")
    

Note:

Looking words up in a list is really slow. Use a set, not a list. E.g.,

arb_stopwords = set(nltk.corpus.stopwords.words("arabic"))

Original answer (still applicable to languages that are not included)

Why don't you just check what the stopwords collection contains:

>>> from nltk.corpus import stopwords
>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish',
 'turkish']

So no, there's no list for Arabic. I'm not sure what you mean by "add it", but the stopwords lists are just lists of words. They don't even do morphological analysis, or other things you might want in an inflecting language. So if you have (or can put together) a list of Arabic stopwords, just put them in a set()¹ and you're one step ahead of where you'd be if your code worked.

alexis
  • 48,685
  • 16
  • 101
  • 161
5

There's an Arabic stopword list here:

https://github.com/mohataher/arabic-stop-words/blob/master/list.txt

If you save this file in your nltk_data directory with the filename arabic you will then be able to call it with nltk using your code above, which was:

from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')

(Note that the possible locations of your nltk_data directory can be seen by typing nltk.data.path in your Python interpreter).

You can also use alexis' suggestion to check if it is found.

Do heed his advice to convert the the stopwords list to a set: stopwords_set = set(stopwords.words('arabic')), as it can make a real difference to performance.

PrettyHands
  • 568
  • 4
  • 16
  • IOError: No such file or directory: u'C:\\Users\\Lamiaa\\AppData\\Roaming\\nltk_data\\corpora\\stopwords\\arabic' i get this error – lina Mar 06 '17 at 18:41
  • One at a time, try putting it in every one of the directories listed when you type nltk.data.path – PrettyHands Mar 06 '17 at 20:04
  • If that doesn't work, try putting this at the top of your file: `import nltk` `nltk.data.path.append(u'C:\Users\Lamiaa\AppData\Roaming\nltk_data\corpora\s‌​topwords')` – PrettyHands Mar 06 '17 at 20:08
  • Nice that you found a stopwords list, but 1) Don't drop the file into the nltk corpus area, read it from your own folder with `nltk.corpus.WordListCorpusReader`. (Adapt [this](http://stackoverflow.com/a/10519171/699305) answer). 2) Write your path as a "raw" string. You've got embedded newlines. – alexis Mar 06 '17 at 22:53
  • @alexis Could you explain why it's a bad idea not to put additional stopword files in the nltk corpus area? Are they in danger of being overwritten when nltk is updated? – PrettyHands Mar 07 '17 at 15:54
  • Yes, among other things. The downloader will show you the stopwords corpus as "out of date" (or used to) because of the extra files. But mainly it's for the same reason that you shouldn't hack the nltk source itself to add new corpora: Keep your code in your project folders, and let libraries manage their own resources. – alexis Mar 07 '17 at 23:43
1

You should use this library called Arabic stop words here is the pip for it:

pip install Arabic-Stopwords

just install it it should be imported after you type:

import arabicstopwords.arabicstopwords as stp

It is much better than the one in the nltk

Andres Gardiol
  • 1,312
  • 15
  • 22