0

I have a function where I am converting pdf files to text files. It's all working fine, except I have to clean up the paragraphs where there are bullet points by removing \n from the end of each string line that starts with the word bullet_point. I want to replace \n with "end_bullet". Then I want to join the lines that start with "Bullet_point" and end with "end_bullet" and use a ";" separator. This is so that I can treat it as a paragraph and not as separate string lines for scraping later on. I can't seem to be able to figure out how to remove the \n for those very specific string lines, I want to keep all the other \n in the file. This is what I have so far and it's working fine. I want to do it within the function below. I have placed a comment exactly where I want to remove the \n from the end of the string so that I can join these with ";" to create a paragraph.

def convert2text(name):
    # get the jpgs
    jpgFiles = os.listdir(destination_jpg)
    jpgFiles.sort()
    this_text = open(save_text_path + name, 'a', encoding="utf-8")
    pytesseract.pytesseract.tesseract_cmd = "C:/Program Files/Tesseract-OCR/tesseract.exe"
    tessdata_dir_config = '--tessdata-dir "C:/Program Files/Tesseract-OCR/tessdata"'
    #replacement = ""
    # this works fine
    for i in range(len(jpgFiles)):
        Text1 = pytesseract.image_to_string(Image.open(destination_jpg + jpgFiles[i]),
                                            config="tessdata_dir_config --psm 6 --oem 1")\
        .replace('\n\n', " new_paragraph ")\
        .replace('\n', " ")\
        .replace("new_paragraph", '\n')\
        .replace("«", " Bullet_Point ")\ #I want to remove the \n from these string lines
        .replace("*", " Bullet_Point ")\ #I want to remove the \n from these string lines
        .replace("»", " Bullet_Point ") #I want to remove the \n from these string lines
        print('Page ' + str(i + 1) + ' done')
        this_text.write(Text1) 
        print('Processing next page')
    this_text.close()
    print('removing the jpgs ... ')
    junkjpgs(destination_jpg)
    print('finished this PDF ... ')

Example: this is a test
bullet_point bla bla bla end_bullet
bullet_point abc abc abc end_bullet
bullet_point def def def end_bullet
bullet_point hij hij hij end_bullet

Result: this is a test; bla bla bla; abc abc abc; def def def; hij hij hij.

1 Answers1

0

I managed to find a solution, so I'm posting it, as it might help someone:

for i in range(len(jpgFiles)):
        Text1 = pytesseract.image_to_string(Image.open(destination_jpg + jpgFiles[i]),
                                            config="tessdata_dir_config --psm 6 --oem 1")\
        .replace('\n\n', " new_paragraph ")\
        .replace('\n', " ")\
        .replace("new_paragraph", '\n')\
        .replace("«", " Bullet_Point ")\
        .replace("*", " Bullet_Point ")\
        .replace("»", " Bullet_Point ")\
        .replace("; \n", " ")\
        .replace(" Bullet_Point ", ";")\
        .replace(": \n", ":")\
        .replace("and \nNew_Paragraph", "and")\
        .replace("\nNew_Paragraph", " ")\
        .replace("  ; ", ", ")\
        .replace(" ; ", ", ")\
        .replace(":, ", ":")