2

I would like to calculate the frequency of function words in Python/NLTK. I see two ways to go about it :

  • Use Part-Of-Speech tagger and sum up on POS tags which constitute to function words
  • Create a list of function words and perform a simple look up

The catch in the first case is that, my data is noisy and I don't know(for sure) which POS tags constitute as function words. The catch in the second case is I don't have a list and since my data is noisy the lookup won't be accurate.

I would prefer the first to the second or any other example which would throw me more accurate results.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Dexter
  • 11,311
  • 11
  • 45
  • 61
  • It's not really clear what you're asking here. If you're asking which approach would be better, you should state that. It seems, however, like you're asking a "show me the codez" type question. You're likely to get a better response if you show what you've tried and explain why it isn't working. If you give some code, you will most likely receive some back. – Wilduck Apr 28 '11 at 18:21
  • What do you mean by "noisy"? And, what do you mean by "function word"? It might help if you gave an example of a sentence or two from your data, with the function words identified. – rmalouf Apr 29 '11 at 15:57
  • @rmalouf: the definitions of *noisy* and *function word* are context-sensitive, but only mildly so. – Fred Foo Apr 30 '11 at 10:26
  • RMALOUF, Noisy data constitutes of data which is informal in nature. Chat/SMS stlye language can constitute noisy data. As far as function words go, they are words with little meaning ( I thought the NLP tag was enough). Do check here: http://en.wikipedia.org/wiki/Function_word – Dexter Apr 30 '11 at 21:57
  • I know what noisy data and function words are in general -- I was wondering how *specifically* your data is noisy and what *specifically* you meant by function words. And it sounds like your still not entirely clear on what you mean... otherwise it would be easy to determine which POS tags to search for. – rmalouf May 01 '11 at 02:26
  • RMALOUF, Is there a measure to tell how noisy my data is? I would really like to know _different_ meanings of function words. I am aware of just one definition. Anything which conforms to that definition is a function word. Now, my question was what tags would constitute the same ? I can't see of an easy way to explain my question. Moreover, I am throwing a question open rather than being specific to allow accommodation of other answers. You may probably want to take that into consideration before questioning my understanding. – Dexter May 01 '11 at 14:52
  • Seriously, I'm just trying to help you. Data can be "noisy" in the sense that it's informal, unedited text, or it could be SMS data, or it could be written in multiple languages, or it could have lots of misspellings, or it could be written by non-native speakers, or it could be the output of an OCR system, or it could be transcribed speech. Those are all "noisy", but each would call for a different approach. I'm just trying to get a sense of what it is you're trying to do. – rmalouf May 03 '11 at 04:18
  • rmalouf, Thanks! My data is noisy in the sense it's informal, contains transliterated text and is a chat/SMS style data. – Dexter May 04 '11 at 22:48

2 Answers2

6

I just used the LIWC English 2007 dictionary ( I paid for the same) and performed a simple lookup as of now. Any other answers are most welcome.

I must say I am a little surprised by the impulsiveness of a couple of answers here. Since, someone asked for code. Here's what I did :

''' Returns frequency of function words '''
def get_func_word_freq(words,funct_words):
    fdist = nltk.FreqDist([funct_word for funct_word in funct_words if funct_word in words]) 
    funct_freq = {}    
    for key,value in fdist.iteritems():
        funct_freq[key] = value
    return funct_freq

''' Read LIWC 2007 English dictionary and extract function words '''
def load_liwc_funct():
    funct_words = set()
    data_file = open(liwc_dict_file, 'rb')
    lines = data_file.readlines()
    for line in lines:
        row = line.rstrip().split("\t")
        if '1' in row:
            if row[0][-1:] == '*' :
                funct_words.add(row[0][:-1])
            else :
                funct_words.add(row[0])
    return list(funct_words)

Anyone who has done some code in python would tell you that performing a look up or extracting words with specific POS tags isn't rocket science. To add, tags(on the question) of NLP(Natural Language Processing) and NLTK(Natural Language ToolKit) should be enough indication to the astute minded.

Anyways, I understand & respect sentiments of people who reply here since most of it is free but I think the least we can do is show a bit of respect to question posters. As it's rightly pointed out help is received when you help others, similarly respect is received when one respect's others.

Dexter
  • 11,311
  • 11
  • 45
  • 61
  • I tried the first part of your code in Python 3, but get the error 'FreqDist' object has no attribute 'iteritems'. How do I solve this? – Bambi Apr 26 '17 at 15:55
  • In Python 3, `dict.iteritems()` has been replaced with `dict.items()`. – Dexter Apr 28 '17 at 12:06
0

You don't know which approach will work until you try. I recommend the first approach though; I've used it with success on very noisy data, where the "sentences" where email subject headers (short texts, not proper sentences) and even the language was unknown (some 85% English; the Cavnar & Trenkle algorithm broke down quickly). Success was defined as increased retrieval performance in a search engine; if you just want to count frequencies, the problem may be easier.

Make sure you use a POS tagger that takes context into account (most do). Inspect the list of words and frequencies you get and maybe eliminate some words that you don't consider function words, or even filter out words that are too long; that will eliminate the false positives.

(Disclaimer: I was using the Stanford POS tagger, not NLTK, so YMMV. I used one of the default models for English, trained, I think, on the Penn Treebank.)

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Larsmans, Thanks for the reply! I just used the LIWC English 2007 dictionary (I paid for the same) and performed a simple lookup as of now. It is giving me pretty decent results. I use the default NLTK POS Tagger which I guess is a Sequentiall Backoff Tagger and works pretty well. The problem with using POS tags though is to identify *what* POS tags constitute for function words. – Dexter Apr 30 '11 at 21:52
  • @Denzil: well, certainly not nouns, adjectives and verbs other than auxiliaries. Determiners and pronouns are function words; for adverbs, the situation is more complex. Punctuation marks are function "words" for most purposes except sentiment analysis. – Fred Foo Apr 30 '11 at 23:52
  • Larsmans, Thanks. I guess in my case a look up would be the most easiest solution as told in my answer. – Dexter May 01 '11 at 14:53