Insert a comma between Arabic(Persian) and English words in a text using Regex in Python

Question

I have a txt file which contains 32000 lines. The data is in Arabo-Persian, however, each line contains the Roman transcription of the first word.

دێان diêyan بنووڕه‌ ‌دگان نگا دگان‌

دێان‌ شكنه diêyan şêkêne دگان‌ شكنه

دیدن dîdin بنووڕه‌ ‌دید نگا دید و تركیباتش

I need to put a comma before and after the Roman transcription. I have written this, but it puts a comma after every characters of the Roman transcription:

import re

output = open("output.txt","w")
input = open("sample.txt").read()

for word in input:
    output.write(re.sub(r'^([a-z])', r',\1', word))


output.close()

Any suggestions?

Sakurai · Answer 1 · 2021-02-04T02:17:35.677

1

Try

re.sub('([a-z].*[a-z])', r',\1,', word)

Output File:

دێان ,diêyan, بنووڕه‌ ‌دگان نگا دگان‌

دێان‌ شكنه ,diêyan şêkêne, دگان‌ شكنه

دیدن ,dîdin, بنووڕه‌ ‌دید نگا دید و تركیباتش

But the word has to start and end with [a-z].

edited Feb 04 '21 at 02:17

answered Feb 03 '21 at 06:06

Sakurai

106
6

It didn't work. ê is not Roman, but it is not important because it comes in the middle of the word. We only need to mark the beginning and end of the words. – Z Azin Feb 03 '21 at 12:31

score 1 · Answer 2 · answered Feb 03 '21 at 19:37

Give this a try:

re.sub(r'(([a-zêîş]+ ?)+)', r',\1', word))

it will produce the following output for the sample text you've provided:

دێان ,diêyan بنووڕه‌ ‌دگان نگا دگان‌
دێان‌ شكنه ,diêyan şêkêne دگان‌ شكنه
دیدن ,dîdin بنووڕه‌ ‌دید نگا دید و تركیباتش

You'll need to add any special characters you might have in the pattern.

Insert a comma between Arabic(Persian) and English words in a text using Regex in Python

2 Answers2