Removing newline \n from tesseract return values

Question

I have a bunch of image each one corresponding to a name that I'm passing to Pytesseract for recognition. Some of the names are a bit long and needed to be written in multiple lines so passing them for recognition and saving them to a .txt file resulted in each part being written in a newline.

Here's an example

This is being recognized as

MARTHE
MVUMBI

While I need them to be one the same line.

Another Example

It should be MOHAMED ASSAD YVES but it's actually being stored as:

MOHAMED

ASSAD YVES

I thought I was filtering through this sort of thing but apparently it's not working. Here's the code for recognition, storing and filtering that I'm using.

# Adding custom options
folder = rf"C:\Users\lenovo\PycharmProjects\SoftOCR_PFE\name_results"
custom_config = r'--oem 3 --psm 6'
words = []
filenames = os.listdir(folder)
filenames.sort()
for directory in filenames:
    print(directory)
    for img in glob.glob(rf"name_results\{directory}\*.png"):
        text = pytesseract.image_to_string(img, config=custom_config)
        words.append(text)
    words.append("\n")
all_caps = list([s.strip() for s in words if s == s.upper() and s != 'NOM' and s != 'PRENOM'])

no_blank = list([string for string in all_caps if string != ""])

with open('temp.txt', 'w+') as filehandle:
    for listitem in no_blank:
        filehandle.write(f'{listitem}\n')
uncleanText = open("temp.txt").read()
cleanText = re.sub('[^A-Za-z0-9\s\d]+', '', uncleanText)
open('saved_names.txt', 'w').write(cleanText)

I had to post again since my last question was posted really late at night and didn't get any action.

You could add that to a list and the use ```''.join(your_list)``` — , Jun 06 '21 at 10:37

Yuri Khristich · Accepted Answer · 2021-06-06T18:50:04.933

1

I would try to add after the line:

text = pytesseract.image_to_string(img, config=custom_config)

This line:

text = text.replace("\n", " ")

Update

There was another problem. How to join every second line with , in the file and save them back in the file. It can be done this way:

with open("temp.txt", "r") as f:
    names = f.readlines()

names = [n.replace("\n", "") for n in names]
names = [", ".join(names[i:i+2]) for i in range(0, len(names), 2)]

with open("temp.txt", "w") as f:
    f.write("\n".join(names))

edited Jun 06 '21 at 18:50

answered Jun 06 '21 at 14:01

Yuri Khristich

13,448
2
8
23

Could I possibly message you in private. A lot of file writing is going on and it's driven me quite insane at this point. – Moudhaffer Bouallegui Jun 06 '21 at 16:02
Okay. But I'm not sure how it works here. – Yuri Khristich Jun 06 '21 at 16:05
it doesn't, u can't message on SO. If you have Discord maybe? or some other form of communication – Moudhaffer Bouallegui Jun 06 '21 at 16:28
I just followed you twitter.com/MoudhafferBoua1, I don't really use twitter dunno how I could get in touch with you but here's my discord also B_moudhaffer#8596 – Moudhaffer Bouallegui Jun 06 '21 at 16:56

Removing newline \n from tesseract return values

1 Answers1