pdfminer mixes order of lines

Question

I'm extracting pdf using pdfminersix. I have following text:

after parsing it my result is as below:

Nr 48. Promująco na rozwój chorób alergicznych i wystąpienie objawów alergii 
działa zwiększenie aktywności/ilości: 

1) limfocytów Th1; 
2) limfocytów Th2; 
3) limfocytów Th17; 
Prawidłowa odpowiedź to: 
B. tylko 2. 
A. 1,4. 

4) IL-5. 
5) IL-12. 

C. 1,3. 

D. 2,4. 

E. 3,5.

The order of the lines is mixed. Is there a way to prevent it? For example to force pdfminer to read the file line by line. I have tried to convert pdf to html, but the result is a mess of seperate span tags for each word.

score 0 · Answer 1 · answered Apr 05 '23 at 09:32

0

Ok, i 've found a solution by increasing char_margin to 20 in LaParams

laparams = LAParams(char_margin = 20)

answered Apr 05 '23 at 09:32

mik.ro

4,381
2
18
23

Note: Just because the text has a certain sequence on the page doesn't mean it has the same sequence in the file. (I'm not familiar with pdfminer - which sounds really useful - but it might not do a grand job of sorting the fragments.) – Martin Packer Apr 05 '23 at 09:35

pdfminer mixes order of lines

1 Answers1