3

I recall taking a class several years ago where I was given an interesting example of a finite state machine in which each state contained a letter and had multiple paths that led to other letters that generally followed them in a word. Some of the letters also had paths that led to a termination as well, and by starting at any point in the finite state machine and following valid paths to a termination, you could chain the letters together and (almost) always end up with a valid word. Of course this was only a subset of the words in the language (which unfortunately I forgot what language the FSM was meant for). My question involves multiple related questions:

  • Is this a viable way to randomly generate a "pseudo" word? By that I mean a word that isn't necessarily valid, but one that's spelled in a way that looks valid.
  • Is this technique used in, or as part of, any well-known random word generation algorithms, and if so, which ones?
  • Are there other common alternatives to this that generate a random word one letter at a time, or possibly to take a random string that was generated in this manner and coerce it into a "pseudo" word?
Patrick Roberts
  • 49,224
  • 10
  • 102
  • 153

1 Answers1

6

The rules

The correct answer for your case depends on the meaning of "pseudo-word", and how you want multiple generated pseudo-words to relate to each other. Since you've tagged this question with "procedural generation", I'll assume you want to construct a fake natural language; so:

  1. Every word should be pronounceable. For example, 'gotrobit' would be acceptable, but 'grrhjklmpp' would not.
  2. The general 'feel' of different words should be comparable; you don't want a set of finnish-sounding words intermixed with french-sounding words.

General issues with FSMs

You can most certainly use a finite state machine to do this, but there are two possible pitfalls:

  • If the FSM contains cycles, you can have wildly varying word lengths; this can be very bad for requirement #2. If your FSM does not contain cycles, you will end up with a huge FSM in order to generate a reasonable lexicon.
  • You will need to be very careful when constructing your FSM, or you will end up with words that do not satisfy #1.

You could add a post-processing step where you filter out 'stupid' results, but as I will show later on, there are better options.

Markov Chains

With these pitfalls in mind, a common way of seeding your FSM would be to use Markov chains.

For example, you can generate a non-determinate FSM where each state represents a character (or termination); you then analyze a corpus of e.g. English texts to calculate the possibility that a specific character is followed by another character, and use that to create your transitions.

Using Markov chains makes it easier to reach goal #2; by using e.g. a corpus of German texts, you get a completely different set of words that still somewhat resemble each other.

As said, the pitfalls remain. For example, look at the words "art" and "train". This implies that a 't' can follow an 'r', but also that an 'r' can follow a 't'. Based on these examples, you can end up with words like "trtrtrain", which in my eye violates #1.

This can be somewhat alleviated by having each state represent a combination of 2 characters, instead of one, but this will quickly lead to a state explosion.

Syllables

A much more promising approach is to not generate your words letter by letter, but syllable by syllable. You start by generating a list of allowed syllables, determine your preferred word length in syllables, and pick that amount of syllables.

For example, you can start by using a list of all consonant+vowel syllables. This will give you words like "tokuga" and "potarovo". You can also use a list of vowel+consonant syllables, which would give you words like "otukag" and "opatorov": a completely different 'language' with the same simple rules.

Of course, this gets tricky when you would e.g. allow both consonant+vowel and single-vowel syllables. Now you can end up with words like "tokuuauga", which may or may not be what you want.

You can go a bit further, classify the types of syllables and add some simple rules such as: "only two single-vowel syllables can follow each other"; or "every consonant-vowel syllable followed by a single-vowel syllable must be followed by a consonant-vowel-consonant syllable". Now you can end up with words like "tokuugat".

By choosing the set of allowed syllables and rules, you can get different 'languages' that feel somewhat coherent.

Using phonemes

If you want to make even better words, you should start by using phonemes instead of letters. This allows you to easily represent non-ASCII sounds such as "ng", "sh" and (tongue-click). You then follow the algorithm as described above, followed by a "transliteration" step where you change the phonemes into 'readable' letters.

By using different transliterations, you can get even more 'language'-like feelings. For example, you can transliterate /sh/ like 'sh' (english), or 'ch' (french) or 'sch' (dutch).

Phonological rules

Phonological rules are basically a formal way of describing the rules from the previous section, going a bit further than my previous example. By choosing the correct set of rules, you can create 'hard' languages, 'soft' languages, etc. For example, you can choose to have 'vowel+r+k+vowel' to be changed to 'vowel+r+r+vowel' (resulting in a language that sounds like a motor) or to be changed to 'vowel+k+h+vowel' (resulting in a typical dwarfish language). The possibilities are endless.

Phonological research has produced a lot of these rules, helping you to create more earth-like languages.

A nice example of this approach is Drift, a Python program that uses a list of syllables and a set of phonological rules to generate 'real' words.

Leaving the randomness and computer-generated aspects aside, I believe this is more or less the approach Tolkien used when he generated his elvish languages and dialects.

Conclusion

To sum up the answers:

  • Yes, using an FSM is a viable approach
  • Markov chains are a popular technique to create such FSMs
  • You get better results by using syllables and the research done in phonology
publysher
  • 11,214
  • 1
  • 23
  • 28