0

I'm extracting pdf using pdfminersix. I have following text: enter image description here

enter image description here

after parsing it my result is as below:

Nr 48. Promująco na rozwój chorób alergicznych i wystąpienie objawów alergii 
działa zwiększenie aktywności/ilości: 

1) limfocytów Th1; 
2) limfocytów Th2; 
3) limfocytów Th17; 
Prawidłowa odpowiedź to: 
B. tylko 2. 
A. 1,4. 

4) IL-5. 
5) IL-12. 

C. 1,3. 

D. 2,4. 

E. 3,5. 

The order of the lines is mixed. Is there a way to prevent it? For example to force pdfminer to read the file line by line. I have tried to convert pdf to html, but the result is a mess of seperate span tags for each word.

mik.ro
  • 4,381
  • 2
  • 18
  • 23

1 Answers1

0

Ok, i 've found a solution by increasing char_margin to 20 in LaParams

laparams = LAParams(char_margin = 20)
mik.ro
  • 4,381
  • 2
  • 18
  • 23
  • Note: Just because the text has a certain sequence on the page doesn't mean it has the same sequence in the file. (I'm not familiar with pdfminer - which sounds really useful - but it might not do a grand job of sorting the fragments.) – Martin Packer Apr 05 '23 at 09:35