0

We want create a multi-language phrasebook / dictionary for a specific area.

And now I'm thinking about the best data structure / data model for that.

Since it should be more phrasebook than dictionary we want to keep the data model / structure first simple. It should be only used for fast translation: i.e. user selects two languages, types a word and gets translation. The article and description parts are just for displaying, not for search.

There are some specific cases I'm thniking about:

  • One term can be expressed with several (1..n) words in any language
  • Any term can also be translated into several (1..m) words in another language
  • In some languages the word's articel could be important to know
  • For some words description could be important (e.g. for words from dialects etc.)

I'm not sure about one point: do I reinvent the wheel creating a data model by myself? But I couldn't find any solutions.

I've just created a json data model I'm not sure about if it good enough or not:

[
    {
        wordgroup-id: 1,
        en: [
                {word: 'car', plural: 'cars'},
                {word: 'auto', plural: 'autos'},
                {word: 'vehicle', plural: 'vehicles'},
            ],
        de: [
                {word: 'Auto', article: 'das', description: 'Some explanation eg. when to use this word', plural: 'Autos'},
                {word: 'Fahrzeug', article: 'das', plural: 'Fahrzeuge'}
            ],
        ru: [...],
        ...
    },
    {
        wordgroup-id: 2,
        ...
    },
    ...
]

I also thought about some "corner" cases @triplee wrote about. I thought to solve them with some kind of redundance. Only the word group id and the word within a language should be unique.

I would be very thankfull for any feedback to the first draft of the data model.

Tima
  • 12,765
  • 23
  • 82
  • 125
  • It's not just grammatical gender that you need to annotate. If you want to be portable to many languages, you need to prepare for language-specific subcategories. (English has them too; countable vs. uncountable, animate vs. inanimate, etc.) – tripleee May 18 '15 at 15:37
  • Your model assumes that words overlap completely or not at all. But some languages have distinction where others don't -- many languages don't distinguish between "roof" and "ceiling", while many have different words for the inside of a corner and the outside of a corner. Do you have a plan for corner cases like this (pardon the pun)? – tripleee May 18 '15 at 15:40
  • @tripleee Thank you for comments. we want to keep the data model / structure first simple since it should be more phrase book than dictionary. It should be only used for fast translation: ie. user selects two languages, types a word and gets translation. The article and description parts are just for displaying, not for search. About "corner" cases. I thought about some redundance. Only word group id and the word within a language should be unique. – Tima May 18 '15 at 20:54
  • @triplee i modified the draft a bit by adding 'plural' to the 'word'. but since not really happy with this. As I wrote, I can imagine, there were other people before me who tried to solve the same problem. I couldn't any documents or papers, maybe because my google search query wasn't good / correct enough for this. – Tima May 18 '15 at 21:08
  • Adding just one inflected form is also very anglo-centric. German has four cases times singular and plural, Russian a few more, plus the number aystem is more complex; Finnish legendarily dozens, plus it's an agglutinating language, so you have thousands of forms with an without the various inflections, particles, and suffixes which can get tacked on. – tripleee May 19 '15 at 04:11
  • A common approach is to mark each word with just a declination code; then you can form all inflections for all words with the same code with a number of transformations (so, for example, German *Rad* should be marked up as a neuter word which gets its plural with *-er** where the asterisk signifies *umlaut*; *Räder)* – tripleee May 19 '15 at 04:12
  • That still doesn't tell you how to go from *kaupoissammekinko* to "in our shops, too, you mean?" where the Finnish root word *kauppa* is inflected in the inessive plural, with a first person plural possessive suffix, and two clitics. – tripleee May 19 '15 at 04:14
  • But it summary, the data model alone is just a fragment, which only really makes sense when you see how it's used; you can read some NLP books with code examples to see how lexicon design is commonly done, but I can't tell you any where it's in the focus. – tripleee May 19 '15 at 04:23
  • Googling for [nlp lexicon design](https://www.google.com/search?q=nlp+lexicon+design) gets me some promising hits. – tripleee May 19 '15 at 04:25
  • @tripleee thank you for comments. It seems our "simple" idea is not that simple. Thank you also for the search term. Since phrasebook would be the easier form in our case, I'll try to find something to this topic. One link looks very promising http://nlp.hivefire.com/articles/share/31597/ – Tima May 19 '15 at 07:40
  • @tripleee another idea maybe to find a software which already solved the problem with the data model and use it for creating (create, fill, map) the "dictionary" itselfs and then just use it from the app – Tima May 19 '15 at 07:42
  • @tripleee I've found something interesting https://github.com/translate/pootle – Tima May 19 '15 at 07:56
  • That's mainly for translating user interface strings like "Enter PIN:" so there is no space for part-of-speech annotations etc. – tripleee May 19 '15 at 08:09
  • If you want to collect translation equivalences, a more versatile tool is a "translation memory". This is a piece of software used by professional translators to build a database of word and phrase correspondences. – tripleee May 19 '15 at 08:10

0 Answers0