1

I basically work on subtitles and I have this arabic file and when I open it up on notepad and right click and select SHOW UNICODE CONTROL CHARACTERS I give me some weird characters on the left of every line. I tried so many ways to remove it but failed I also tried NOTEPAD++ but failed.

Notepad ++ SUBTITLE EDIT EXCEL WORD

288 00:24:41,960 --> 00:24:43,840 ‫أتعلم، قللنا من شأنك فعلاً‬

289 00:24:44,000 --> 00:24:47,120 ‫كان علينا تجنيدك لتكون جاسوساً‬ ‫مكان (كاي سي)‬

290 00:24:47,280 --> 00:24:51,520 ‫لا تعلمون كم أنا سعيد‬ ‫لسماع ذلك‬

291 00:24:54,800 --> 00:24:58,160 ‫لا تقلق، سيستيقظ نشيطاً غداً‬

292 00:24:58,320 --> 00:25:00,800 ‫ولن يتذكر ما حصل‬ ‫في الساعات الـ٦‬

the unicodes are not showing in this the unicode is U+202B which shows a ¶ sign, after googling it I think it's called PILCROW.

The issue with this is that it doesn't display subtitles correctly on ps4 app.

I need this PILCROW sign to go away. with this website I can see the issue in this file https://www.soscisurvey.de/tools/view-chars.php

2 Answers2

0

The PILCROW is used by various software and publishers to show the end of a line in a document. The actual Unicode character does not exist in your file so you can't get rid of it.

JGNI
  • 3,933
  • 11
  • 21
  • But we gave this file to some agency to remove them and they took a day to remove it, but they did! link to the files are here. https://drive.google.com/file/d/138J1uxpWpIn7U9axn5seDEYL0trar1cn/view?usp=sharing https://drive.google.com/file/d/1H7Q5C9VbAvgMiwSk3FqLLYyhknn2-xZC/view?usp=sharing – Cassal Michael Jun 12 '19 at 11:39
0

The Unicode characters in these lines are 'RIGHT-TO-LEFT EMBEDDING' (code \u202b) and 'POP DIRECTIONAL FORMATTING' (code \u202c) - these are used in the text to indicate that the included text should be rendered right-to-left instead of the ocidental left-to-right direction.

Now, these characters are included as hints to the application displaying the text, rather than to actually perform the text reversing - so they likely can be removed without compromising the text displaying itself.

Now this a programing Q&A site, but you did not indicate any programming language you are familiar with - enough for at least running a program. So it is very hard to know how give an answer that is suitable to you.

Python can be used to create a small program to filter such characters from a file, but I am not willing to write a full fledged GUI program, or an web app that you could run there just as an answer here.

A program that can work from the command line just to filter out a few characters is another thing - as it is just a few lines of code.

You have to store the follwing listing as a file named, say "fixsubtitles.py" there, and, with a terminal ("cmd" if you are on Windows) type python3 fixsubtitles.py \path\to\subtitlefile.txt and press enter.

That, of course, after installing Python3 runtime from http://python.org (if you are on Mac or Linux that is already pre-installed)

import sys
from pathlib import Path
encoding = "utf-8"
remove_set = str.maketrans("\u202b\u202c")
if len(sys.argv < 2):
    print("Usage: python3 fixsubtitles.py [filename]", file=sys.stderr)
    exit(1)
path = Path(sys.argv[1])
data = path.read_text(encoding=encoding)
path.write_text(data.translate("", "", remove_set), encoding=encoding)
print("Done")

You may need to adjust the encoding - as Windows not always use utf-8 (the files can be in, for example "cp1256" - if you get an unicode error when running the program try using this in place of "utf-8") , and maybe add more characters to the set of characters to be removed - the tool you linked in the question should show you other such characters if any. Other than that, the program above should work

jsbueno
  • 99,910
  • 10
  • 151
  • 209