I have a function where I am converting pdf files to text files. It's all working fine, except I have to clean up the paragraphs where there are bullet points by removing \n from the end of each string line that starts with the word bullet_point. I want to replace \n with "end_bullet". Then I want to join the lines that start with "Bullet_point" and end with "end_bullet" and use a ";" separator. This is so that I can treat it as a paragraph and not as separate string lines for scraping later on. I can't seem to be able to figure out how to remove the \n for those very specific string lines, I want to keep all the other \n in the file. This is what I have so far and it's working fine. I want to do it within the function below. I have placed a comment exactly where I want to remove the \n from the end of the string so that I can join these with ";" to create a paragraph.
def convert2text(name):
# get the jpgs
jpgFiles = os.listdir(destination_jpg)
jpgFiles.sort()
this_text = open(save_text_path + name, 'a', encoding="utf-8")
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract-OCR/tesseract.exe"
tessdata_dir_config = '--tessdata-dir "C:/Program Files/Tesseract-OCR/tessdata"'
#replacement = ""
# this works fine
for i in range(len(jpgFiles)):
Text1 = pytesseract.image_to_string(Image.open(destination_jpg + jpgFiles[i]),
config="tessdata_dir_config --psm 6 --oem 1")\
.replace('\n\n', " new_paragraph ")\
.replace('\n', " ")\
.replace("new_paragraph", '\n')\
.replace("«", " Bullet_Point ")\ #I want to remove the \n from these string lines
.replace("*", " Bullet_Point ")\ #I want to remove the \n from these string lines
.replace("»", " Bullet_Point ") #I want to remove the \n from these string lines
print('Page ' + str(i + 1) + ' done')
this_text.write(Text1)
print('Processing next page')
this_text.close()
print('removing the jpgs ... ')
junkjpgs(destination_jpg)
print('finished this PDF ... ')
Example:
this is a test
bullet_point bla bla bla end_bullet
bullet_point abc abc abc end_bullet
bullet_point def def def end_bullet
bullet_point hij hij hij end_bullet
Result: this is a test; bla bla bla; abc abc abc; def def def; hij hij hij.