0

I have a txt file which contains 32000 lines. The data is in Arabo-Persian, however, each line contains the Roman transcription of the first word.

دێان diêyan بنووڕه‌ ‌دگان نگا دگان‌

دێان‌ شكنه diêyan şêkêne دگان‌ شكنه

دیدن dîdin بنووڕه‌ ‌دید نگا دید و تركیباتش

I need to put a comma before and after the Roman transcription. I have written this, but it puts a comma after every characters of the Roman transcription:

import re

output = open("output.txt","w")
input = open("sample.txt").read()

for word in input:
    output.write(re.sub(r'^([a-z])', r',\1', word))


output.close() 

Any suggestions?

Z Azin
  • 57
  • 6

2 Answers2

1

Try

re.sub('([a-z].*[a-z])', r',\1,', word)

Output File:

دێان ,diêyan, بنووڕه‌ ‌دگان نگا دگان‌

دێان‌ شكنه ,diêyan şêkêne, دگان‌ شكنه

دیدن ,dîdin, بنووڕه‌ ‌دید نگا دید و تركیباتش

But the word has to start and end with [a-z].

Sakurai
  • 106
  • 6
  • It didn't work. ê is not Roman, but it is not important because it comes in the middle of the word. We only need to mark the beginning and end of the words. – Z Azin Feb 03 '21 at 12:31
1

Give this a try:

re.sub(r'(([a-zêîş]+ ?)+)', r',\1', word))

it will produce the following output for the sample text you've provided:

دێان ,diêyan بنووڕه‌ ‌دگان نگا دگان‌
دێان‌ شكنه ,diêyan şêkêne دگان‌ شكنه
دیدن ,dîdin بنووڕه‌ ‌دید نگا دید و تركیباتش

You'll need to add any special characters you might have in the pattern.