Defining the context of a word - Python

Question

I think this is an interesting question, at least for me.

I have a list of words, let's say:

photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet

and I have a list of contexts:

Programming
World news
Technology
Web Design

I need to try and match words with the appropriate context/contexts if possible.

Maybe discovering word relationships in some way.

alt text

Any ideas?

Help would be much appreciated!

I don't thing there's ready made solution to that. Maybe some machine learning algorithms? — Łukasz, Mar 23 '10 at 14:43
Please try and better frame the problem. For example: a) are the words within the "list of words" [a priori] completely independent or can we infer some their "context" from neighboring words. b) is the list of context pre-defined or should the algorithm discover these c) can a simultaneously word belong to multiple contexts d) how is this related to the word-tree centered on "Speech" image... — mjv, Mar 23 '10 at 14:51
@RadiantHex: In view of the few answers so far, you can see why I suggested better framing the problem... `Vague questions beget vague answers!` — mjv, Mar 23 '10 at 15:09
@mjv: you are right, if I framed the question better I would have had more useful answers. Reason I wasn't specific enough is that I am not quite sure if or what could be done. — RadiantHex, Mar 23 '10 at 18:19

score 3 · Answer 1 · answered Mar 23 '10 at 14:45

3

This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.

I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.

answered Mar 23 '10 at 14:45

adam

861
8
18

score 2 · Answer 2 · answered Mar 23 '10 at 14:45

Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.

For example if you have two documents like this:

D1: Need to find meaning. D2: Need to separate Apples from oranges

you matrix will look like this:

      Need to find meaning Apples Oranges Separate From
D1:   1     1   1     1      0      0       0       0
D2:   1     1   0     0      1      1       1       1

This is called term by document matrix

Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD

score 2 · Answer 3 · answered Mar 23 '10 at 15:21

I just found this a couple days ago: ConceptNet

It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.

If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.

score 1 · Answer 4 · answered Mar 24 '10 at 17:10

The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.

Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.

See here for a list of other ontologies / knowledge bases you could use.

@ferdy Oh my god!! I had the idea of using Google API to search for related Wikipedia articles last night, as keywords like 'css3' might give problems. I think I might go with your suggestion, thanks for the very informative answer! — RadiantHex, Mar 24 '10 at 17:54

Defining the context of a word - Python

4 Answers4