I have a bunch of image each one corresponding to a name that I'm passing to Pytesseract for recognition. Some of the names are a bit long and needed to be written in multiple lines so passing them for recognition and saving them to a .txt file resulted in each part being written in a newline.
Here's an example
This is being recognized as
MARTHE
MVUMBI
While I need them to be one the same line.
Another Example
It should be MOHAMED ASSAD YVES but it's actually being stored as:
MOHAMED
ASSAD YVES
I thought I was filtering through this sort of thing but apparently it's not working. Here's the code for recognition, storing and filtering that I'm using.
# Adding custom options
folder = rf"C:\Users\lenovo\PycharmProjects\SoftOCR_PFE\name_results"
custom_config = r'--oem 3 --psm 6'
words = []
filenames = os.listdir(folder)
filenames.sort()
for directory in filenames:
print(directory)
for img in glob.glob(rf"name_results\{directory}\*.png"):
text = pytesseract.image_to_string(img, config=custom_config)
words.append(text)
words.append("\n")
all_caps = list([s.strip() for s in words if s == s.upper() and s != 'NOM' and s != 'PRENOM'])
no_blank = list([string for string in all_caps if string != ""])
with open('temp.txt', 'w+') as filehandle:
for listitem in no_blank:
filehandle.write(f'{listitem}\n')
uncleanText = open("temp.txt").read()
cleanText = re.sub('[^A-Za-z0-9\s\d]+', '', uncleanText)
open('saved_names.txt', 'w').write(cleanText)
I had to post again since my last question was posted really late at night and didn't get any action.