Decompose words into letters with Ruby

Question

In my language there are composite or compound letters, which consists of more than one character, eg "ty", "ny" and even "tty" and "nny". I would like to write a Ruby method (spell) which tokenize words into letters, according to this alphabet:

abc=[*%w{tty ccs lly ggy ssz nny dzs zzs sz zs cs gy ny dz ty ly q w r t z p l k j h g f d s x c v b n m y}.map{|z| [z,"c"]},*"eéuioöüóőúűáía".split(//).map{|z| [z,"v"]}].to_h

The resulting hash keys shows the existing letters / composite letters of the alphabet and also shows which letter is a consonant ("c") and which one is a vowel ("v"), becase later I would like to use this hash to decompose words into syllables. Cases of compound words when accidentally composite letters are formed at the words common boundary shoudn't be resolved by the method of course.

Examples:

spell("csobolyó") => [ "cs", "o", "b", "o", "ly", "ó" ]
spell("nyirettyű") => [ "ny", "i", "r", "e", "tty", "ű" ]
spell("dzsesszmuzsikus") => [ "dzs", "e", "ssz", "m", "u", "zs", "i", "k", "u", "s" ]

What have you tried so far? This is going to be really complicated and so if you can limit it to just a specific area you need help with I think you'll have better luck here. As it stands, there are a ton of edge cases that those who don't natively speak your language (and maybe those that do speak it) aren't going to be able to work through...for instance if I see `dzs` in a string, that could be `["dzs"]`, or `["d", "zs"]` or `["dz", "s"]` or `["d", "z", "s"]` and without a dictionary of words (or knowing a lot about this language), I don't think we'll be able to determine which is correct — Simple Lime, Sep 20 '17 at 23:17
This is why I sorted the letters in the alphabet: if a letter appears earlier, then it should be recognized instead of its simple letters. When a word contains "dzs" it should be considered to "dzs" and not to "d" and "zs". It would give some false results in rare cases, but the majority of the decompositions would work. I don't know how to do it efficiently. Maybe some built in string tokenizer, or something. — Konstantin, Sep 20 '17 at 23:21

Simple Lime · Accepted Answer · 2017-09-21T02:31:50.277

2

You might be able to get started looking at String#scan, which appears to be giving decent results for your examples:

"csobolyó".scan(Regexp.union(abc.keys))
# => ["cs", "o", "b", "o", "ly", "ó"]
"nyirettyű".scan(Regexp.union(abc.keys))
# => ["ny", "i", "r", "e", "tty", "ű"]
"dzsesszmuzsikus".scan(Regexp.union(abc.keys))
# => ["dzs", "e", "ssz", "m", "u", "zs", "i", "k", "u", "s"]

The last case doesn't match your expected output, but it matches your statement in the comments

I sorted the letters in the alphabet: if a letter appears earlier, then it should be recognized instead of its simple letters. When a word contains "dzs" it should be considered to "dzs" and not to "d" and "zs"

edited Sep 21 '17 at 02:31

answered Sep 21 '17 at 00:31

Simple Lime

10,790
2
17
32

1

In general `Regexp.union` is safer than `join("|")`, but it may not matter in this case, since we're only dealing with word chars. – Mark Thomas Sep 21 '17 at 02:27
Ah, yeah good point, don't deal with dynamic regexes very much and completely forgot `union` existed. Updated – Simple Lime Sep 21 '17 at 02:30
Yes, it works as expected, I typed wrong result in the example, now I fixed it. – Konstantin Sep 21 '17 at 12:38
String#scan is the winner, and Regexp.union, however the order of the keys counts, when tokenizing, because some of the regex patterns are prefixes of others. – Konstantin Sep 21 '17 at 13:07
`abc.keys` should return the keys in insertion order for recent versions of Ruby (> 2.0, possibly earlier), so order of the keys should be honored – Simple Lime Sep 21 '17 at 19:04

score 1 · Answer 2 · answered Sep 21 '17 at 00:31

I didn't use the preference in which you sorted, rather I used higher character word will have higher preference than lower character word.

def spell word
  abc=[*%w{tty ccs lly ggy ssz nny dzs zzs sz zs cs gy ny dz ty ly q w r t z p l k j h g f d s x c v b n m y}.map{|z| [z,"c"]},*"eéuioöüóőúűáía".split(//).map{|z| [z,"v"]}].to_h
  current_position = 0
  maximum_current_position = 2
  maximum_possible_position = word.length
  split_word = []
  while current_position < maximum_possible_position do 
    current_word = set_current_word word, current_position, maximum_current_position
    if abc[current_word] != nil
      current_position, maximum_current_position = update_current_position_and_max_current_position current_position, maximum_current_position
      split_word.push(current_word)
    else
      maximum_current_position = update_max_current_position maximum_current_position
      current_word = set_current_word word, current_position, maximum_current_position
      if abc[current_word] != nil
        current_position, maximum_current_position = update_current_position_and_max_current_position current_position, maximum_current_position
        split_word.push(current_word)
      else
        maximum_current_position = update_max_current_position maximum_current_position
        current_word = set_current_word word, current_position, maximum_current_position
        if abc[current_word] != nil
          current_position, maximum_current_position = update_current_position_and_max_current_position current_position, maximum_current_position          
          split_word.push(current_word)
        else
          puts 'This word cannot be formed in the current language'
          break
        end
      end
    end
  end
  split_word
end

def update_max_current_position max_current_position
    max_current_position = max_current_position - 1
end

def update_current_position_and_max_current_position current_position,max_current_position
    current_position = max_current_position + 1
    max_current_position = current_position + 2
    return current_position, max_current_position
end

def set_current_word word, current_position, max_current_position
  word[current_position..max_current_position]
end

puts "csobolyó => #{spell("csobolyó")}"
puts "nyirettyű => #{spell("nyirettyű")}"
puts "dzsesszmuzsikus => #{spell("dzsesszmuzsikus")}"

Output

csobolyó => ["cs", "o", "b", "o", "ly", "ó"]
nyirettyű => ["ny", "i", "r", "e", "tty", "ű"]
dzsesszmuzsikus => ["dzs", "e", "ssz", "m", "u", "zs", "i", "k", "u", "s"]

score 0 · Answer 3 · answered Sep 21 '17 at 17:22

Meanwhile I managed to write a method which works, but 5x slower than String#scan:

abc=[*%w{tty ccs lly ggy ssz nny dzs zzs sz zs cs gy ny dz ty ly q w r t z p l k j h g f d s x c v b n m y}.map{|z| [z,"c"]},*"eéuioöüóőúűáía".split(//).map{|z| [z,"v"]}].to_h

def spell(w,abc)


    s=w.split(//)
    p=""
    t=[]

    for i in 0..s.size-1 do
      p << s[i]
      if i>=s.size-2 then

       if abc[p]!=nil then
          t.push p
          p=""

       elsif abc[p[0..-2]]!=nil then
          t.push p[0..-2]
          p=p[-1]

       elsif abc[p[0]]!=nil then
          t.push p[0]
          p=p[1..-1]

       end 

      elsif p.size==3 then
       if abc[p]!=nil then
          t.push p
          p=""

       elsif abc[p[0..-2]]!=nil then
          t.push p[0..-2]
          p=p[-1]

       elsif abc[p[0]]!=nil then
          t.push p[0]
          p=p[1..-1]
       end
      end
    end

    if p.size>0 then
        if abc[p]!=nil then
          t.push p
          p=""

       elsif abc[p[0..-2]]!=nil then
          t.push p[0..-2]
          p=p[-1]
      end
    end

    if p.size>0 then
      t.push p
    end
    return t
end

Decompose words into letters with Ruby

3 Answers3