1

I am trying to extract text and images from a pdf using python using the library PyMuPdf. But unfortunately, I can't preserve the sequence of the image. for example, the Image is placed at the start of the page but while extracting it, the image is placed at the bottom of the page which eventually is not right as that may put this image in some other document. PFB my code

lst = []
img_name = 1
img_regex =  r"(?:<p><img\s(.*?)</p>)"

for i in range(len(doc)):

 page1 = doc.load_page(i)
 page1text = page1.get_text("xhtml")
 page1text = page1text.strip()
 page1text = page1text.strip('\n')
 page1text=  re.sub('\s+', ' ', page1text)


 image_list = page1.get_images()

 if not image_list:
     pass    
 else:
     img_tag_pos = re.findall(img_regex, page1text, re.MULTILINE)
     d = page1.get_text("dict")
     blocks = d["blocks"] # the list of block dictionaries
     imgblocks = [b for b in blocks if b["type"] == 1]    
     for idx in range(len(imgblocks)):
    
        try:
            image = Image.open(io.BytesIO(imgblocks[idx]['image'] ))
            image.save(open("img_{}.jpeg".format(img_name), "wb"))
            page1text = page1text.replace(img_tag_pos[idx],img_href) 
            img_name += 1
        except Exception as e:
            traceback.print_exc()

lst.append(page1text)

my output looks like

PO-1935 CLINICAL OUTCOMES AND RADIO-BIOLOGICAL FEATURES CORRELATION IN EARLY PCa: AN EXPLORATORY ANALYSIS G. Corrao1,2, G. Marvaso1,2, M. Zaffaroni1, C.I. Fodor1, S. Volpe1,2, L. Bergamaschi3,2, D. Zerini1, A. Vingiani2,4, G. Petralia2,5, S. Alessi6, P. Pricolo6, G. Renne7, R. Orecchia8, B.A. Jereczek-Fossa1,2 1IEO European Institute of Oncology IRCSS, Radiation Oncology, Milan, Italy; 2University of Milan, Oncology and Hemato- Oncology, Milan, Italy; 3IEO European Institute of Oncology IRCSS, Oncology and Hemato-Oncology, Milan, Italy; 4INT Istituto Nazionale Tumori IRCSS, Pathology, Milan, Italy; 5IEO European Institute of Oncology IRCSS, Precision imaging and research unit, Milan, Italy; 6IEO European Institute of Oncology IRCSS, Radiology, Milan, Italy; 7IEO European Institute of Oncology IRCSS, Pathology and Laboratory Medicine, Milan, Italy; 8IEO European Institute of Oncology IRCSS, Scientific Directorate, Milan, Italy Purpose or Objective Phosphatase and tensin homolog (PTEN) deletion and Ki-67 expression are two of the most promising biomarkers PCa. Multiparametric magnetic resonance imaging (mp-MRI)-guided biopsy is a powerful and well-recognized tool for precision diagnosis and staging of PCa. The aim of the study is to assess whether a correlation can be identified between the pathological stage defined by a mp-MRI guided biopsy and both Ki-67 expression and PTEN deletion. Such correlation, if present, might be informative for staging purposes and for treatment personalization in PCa. Materials and Methods The study was conducted in the context of the phase II clinical study “Short-term radiotherapy for early prostate cancer with concomitant boost to the dominant lesion” (AIRC IG-13218). Nineteen patients accepted to undergo a further mp-MRI guided biopsy on Dominant Intraprostatic Lesion (DIL), and a new Gleason Score (GS) was assessed. All samples were analyzed with Immunohistochemistry (IHC) to assess Ki-67 expression and PTEN assessment. A correlation between up/downstaging, Ki-67 expression and PTEN loss was analyzed, and related with PCa outcomes (overall survival, biochemical and clinical relapse). This study was part of research notified to our Ethical Committee (nr N79). Results By the end of recruitment, 19 patients performed a mp-MRI- guided biopsy of the DIL without complications. All patients had clinical stage cT1c–cT2c cN0 cM0 according to TNM 8th edition and a PSA < 10 ng/ml, except 3 patients which had PSA > 10 ng/ml. For 11 patients the MRI-guided biopsy confirmed the findings of the first random-biopsy. On the contrary, for 5 patients GS was upgraded, with 4 patients re-classified as intermediate-risk instead of low-risk and 1 patient as high-risk instead of intermediate-risk. Finally, for 3 patients there was a down-grading, two of them from intermediate- to low-risk and 1 from intermediate favorable to intermediate unfavorable risk. An extensive representation of results is reported in Table 1. PTEN loss and Ki-67 expression were detected on available samples. Six patients had loss of PTEN, and Ki-67 ranged from 6% to 40%. Ki-67 was assessed for 18 patients and only one had Ki-67<6%. Follow-up at two years is available and 18 patients are still alive without evidence of disease (NED). One patient had a local and a clinical relapse of disease and underwent a partial prostate re-irradiation (35 Gy in 5 fr) in 2018 and is currently with NED. Of note, this relapse was observed in the only case of upstaging (from intermediate to high risk) described above. No correlations between up/down- staging, PTEN deletion, Ki-67 expression and mp-MRI characteristics were observed in the cohort analyzed.

<img src ='https://img_1689.jpeg' >

PFB the screenshot from the pdf. enter image description here

link to the pdf https://cld.bz/3g6jJy

  • 1
    PDF is very complex file and every element and even every char may have own coordinates `(x,y)` and it can keep them in different order but using coordinates it may display them in correct order. You may need to use coordinates to get items in expected order. – furas Sep 13 '22 at 08:41
  • Could you please help me with how can I change/append coordinates in my code? Thanks in advance. – Sourav Singh Sep 13 '22 at 09:20
  • Sorry, I didn't get you. Could you please explain me the same in layman's terms. – Sourav Singh Sep 14 '22 at 11:25
  • so how can we solve this error? If possible where should I make the changes in my code to cater to this kind of problem? – Sourav Singh Sep 23 '22 at 05:21

0 Answers0