0

I'm trying to analyse a corpus and was wondering if the NLTK Part of Speech (PoS) tagging supports all languages?

For example, if I download this averaged_perceptron_tagger and use the tagger as follows, can the tagger be used for Turkish?

Does it still count verbs even if the corpus is Turkish? I have set the stop words to Turkish but I'm not sure if this also influences the PoS tagging.

from newspaper import Article, fulltext   # pip install newspaper3k 
import requests, string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
my_stopwords=set(stopwords.words('turkish'))
from collections import Counter

url = 'https://www.haber7.com/guncel/haber/3319295-erdoganin-dogal-gaz-mujdesinden-rahatsiz-oldular'
article = Article(url)
text = article.download().parse().text

clean_text = text.translate(str.maketrans('','', string.punctuation)).lower()
print(clean_text)

tokenized = nltk.word_tokenize(clean_text)

words=[]
for token in tokenized:
    if token not in my_stopwords:
        words.append(token)

def pos_taggin(tokens):
    tags=nltk.pos_tag(tokens)
    counts= Counter(tag for word, tag in tags)
    return counts

pos = pos_taggin(tokenized)
print(pos)

# List of pos-tags
nltk.download('tagsets')
nltk.help.upenn_tagset()
Kyle F Hartzenberg
  • 2,567
  • 3
  • 6
  • 24
Yucaib
  • 1
  • Obviously, in order to be able to know the POS of a token, the underlying framework has to know the grammar and lexiicon of the language. The duplicate I found is pretty old so some things might have changed since then, but the easiest way to test is probably to try it with `lang='tur'` and see if you get an error. [The current documentation](https://www.nltk.org/api/nltk.tag.pos_tag.html) seems to suggest that at least Russian might be supported in addition to English. – tripleee Apr 21 '23 at 03:49
  • thank you for your answer it made things obvious for me, I will be add lang parameter to tags=nltk.pos_tag(tokens, lang='turkish'). – Yucaib Apr 21 '23 at 14:56

0 Answers0