2

Let's begin with the question final purpose: my aim is to build a word-based neural network which should take a basic sentence and select for each individual word the meaning it is supposed to yield in the sentence itself. It is then going to learn something about the language (for example the possible correlation between two given words, what is the probability to find both in a single sentence and so on) and at the final stage (after the learning phase) try to build some very simple sentences of its own according to some input.

In order to do this I need some kind of database representing a vocabulary of a given language from which I could extract some information such as word list, definitions, synonyms et cetera. The database should be structured in a way such that I can build C data structures containing the needed information such as

typedef struct _dictEntry DictionaryEntry;
typedef struct _dict Dictionary;

struct _dictEntry {
    const char *word;               // Word string
    const char **definitions;       // Array of definition strings
    DictionaryEntry **synonyms;     // Array of pointers to synonym words
    Dictionary *dictionary;         // Pointer to parent dictionary
};

struct _dict {
    const char *language;           // Language identification string
    int count;                      // Number of elements in the dictionary
    float **correlations;           // Correlation matrix between i-th and j-th entries
    DictionaryEntry *entries;       // Array of dictionary entries
};

or equivalent Obj-C objects.

I know (from Searching the Mac OSX system dictionaries?) that apple provided dictionaries are licensed so I cannot use them to create my data structures. Basically what I want to do is the following: given an arbitrary word A I want to fetch all the dictionary entries which have a definition containing A and select such definition only. I will then implement some kind of intersection procedure to select the most appropriate definition and synonyms based on the rest of the sentence and build a correlation matrix.

Let me give a little example: let us suppose I type a sentence containing "play"; I want to fetch all the entries (such as "game", "instrument", "actor", etc.) the word "play" can be correlated to and for each of them select the corresponding definition (I don't want for example to extract the "instrument" definition which corresponds to the "tool" meaning since you cannot "play a tool"). I will then select the most appropriate of these definitions looking at the rest of the sentence: if it contains also the word "actor" then I will assign to "play" the meaning "drama" or another suitable definition. The most basic way to do this is scanning every definition in the dictionary searching for the word "play" so I will need to access all definitions without restrictions and as I understand this cannot be done using the dictionaries located under /Library/Dictionaries. Sadly this work MUST be done offline.

Is there any available resource I can download which allows me to get my hands on all the definitions and fetch my info? Currently I'm not interested in any particular file format (could be a database or an xml or anything else) but it must be something I can decompose and put in a data structure. I tried to google it but, whatever the keywords I use, if I include the word "vocabulary" or "dictionary" I (pretty obviously) only get pages about the other words definitions on some online dictionary site! I guess this is not the best thing to search for...

I hope the question is clear... If it is not I'll try to explain it in a different way! Anyway, thanks in advance to all of you for any helpful information.

Community
  • 1
  • 1
gianluca
  • 337
  • 1
  • 3
  • 15

1 Answers1

2

Probably an ontology which is free, like http://www.eat.rl.ac.uk would help you. In the university sector there are severals available.

Wolfgang Wilke
  • 490
  • 3
  • 5
  • 13
  • Thanks, this is a good resource. Unfortunately it's not what I am looking for. I want to build a similar correlation matrix but EAT has a different approach: they show a word to a large number of people asking them to reply with the first word which comes to their mind and then gather all association data. Their data has been generated by people (i.e. REAL BRAINS). What my program should do is to build the whole data by itself using ONLY the vocabulary definitions. Furthermore, the database it provides does not contain any definition. It only contains a word list and their association norms. – gianluca Jan 20 '13 at 00:37
  • Just to clarify WHY their data is not good to me. Being generated from real people associations are NOT guaranteed to be coherent. It may happen (and it does all the time) that some word given as stimulus generates a response which has nothing to do with the stimulus itself for anyone but the person who generated it. For example, if you say "wind" I could easily answer "Trieste" since I live there and in Trieste there is A LOT of wind! However Trieste has nothing to do with the "wind" concept by itself and anyone who has never been there will NEVER reply "Trieste" to your stimulus. – gianluca Jan 20 '13 at 00:45