4

I need to learn how to change a transliteration of a text to another writing system. Apparently the best way would somehow involve regular expressions and perl, probably from command line? I've been using regular expressions earlier in Notepad++ and TextWrangler, so I know some basics already. If there is some really good (and relatively easy and customizable) way to do this in Ruby or something else, I can start learning that as well. There is a constant need to transliterate linguistic sample texts in my field in Uralic linguistics, where many different variants of transliteration systems are used. So it is worth investing some time.

So the material I have now consists of lines with a sentence on each line. Some lines have other data like numbers, but those should stay as they are. I want to keep the punctuation marks as they are, this is just about converting one set of unicode letter characters to another. I searched the site but a lot was about converting from ascii to unicode and so on - this is not the problem here.

So the original text is like this (in broad Finno-Ugric Transcription):

mödis ivan velöććyny pećoraö ščötövödnej kurs vylö.

And I would need it in a form like this:

мӧдiс иван велӧччыны печораӧ щӧтӧвӧднэй курс вылӧ.

This continues for some thousand lines.

There is a clear correspondence between characters used, but it is sometimes complex and involves dealing first with some digraphs and consonant + vowel combinations, etc. As you see from the example, in some situations latin i corresponds to cyrillic и but in some positions can remain as i. Different texts have different solutions, so I would need to adjust the rules in each case. I understand I would need to run a long series of regular expressions in a very specific order to make it work. This order I will figure out myself, but I need to know into what kind of tool I have feed these rules in and how to do it.

I also have often situations where I would like to have the original sentence and transliterated one separated by a tab, so that the lines would have a form like this:

mödis ivan velöććyny pećoraö ščötövödnej kurs vylö.    мӧдiс иван 
велӧччыны печораӧ щӧтӧвӧдней курс вылӧ.

Of course there are many more questions, but after learning these basics I think I can move forward independently. Learning this would help me a lot. Thanks in advance!

Niko

Jorge Campos
  • 22,647
  • 7
  • 56
  • 87
nikopartanen
  • 577
  • 8
  • 15
  • 2
    [Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.](http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html) (I should not that the linked article takes a far more nuanced stance than this quote may suggest. I recommend you read it and reconsider whether you really *need* regexes.) – JDB Dec 12 '13 at 18:58
  • 8
    This is a very nicely worded question, and it is clear what you need. However, it is mostly a spec for "I am a beginner, please teach me the basics for coding in my chosen problem domain". SO works best when there is a more specific question, about actual code that you have written. Could you perhaps show an example of a regex you have written that is an attempt at your sample sentence, and explain where it goes wrong and what is preventing you moving forward with that one sentence? – Neil Slater Dec 12 '13 at 19:00
  • 1
    Well I'm not a programmer of any sort, so there is no code that I would had ever written. But I can start learning if that's necessary. For now it would already help me a lot to know what is the best direction to go with this kind of issue, if it needs some programming, then on what language. I think I could just open the file in any text editor and start making search and replaces, like "find all instances of "ńa" and replace them with "ня". And if I do 50-100 little changes like this I'll get the result I want, but there must be a better way. Maybe some program has a tool for this? – nikopartanen Dec 12 '13 at 19:12
  • 2
    @user3096576 Probably not many programs do, but almost every modern programming language does. Perl is widely and justly renowned for its facility at text handling, and has good Unicode support; if you've never done any programming, you've got a few basic concepts to pick up, but once you've got them in hand you'll have little difficulty packaging up your 50-100 discrete transforms into a single program -- with the benefit that, if you need to do something more complex than "replace this string with that string", you'll already be using a tool with the flexibility to make that possible. – Aaron Miller Dec 12 '13 at 19:17
  • 1
    There are a few things you should read. Start by looking at http://www.perl-tutorial.org, or get a book about Perl. *Learning Perl* by Randal L. Schwartz might be useful. If you already have some knowledge of programming, look at *Beginning Perl* by Curtis Ovid Poe, which is [available for free on archive.org](http://web.archive.org/web/20120709053246/http://ofps.oreilly.com/titles/9781118013847/index.html). This looks like an interesting task that can be done in a very pragmatic way, to solve your immediate problem, or in a more sophisticated way. In the later case, please put it on CPAN. – simbabque Dec 12 '13 at 20:13

0 Answers0