1

I tagged a pdf by using PdfBox. Most of the pdf's are reading fine while using screen readers.

case1. I saw some pdf's while reading it's reading some letters separately.

case2. Some pdf's after tagging it shows in adobe under the tag content like individual TJ's text. (This case screen readers reading good)

I want solve both these problems. Help me to understand and how can I do it.

Example explanation :-

case1: I have an example pdf is here. The last content like "signature:", "Name:" and "Email:" these words the first character is reading separately. Like 'E' space 'mail'. After tagging my content stream looks like below.

q
0 0 612 792 re
W*
n
BT
11.04 0 0 11.04 18 56.184 Tm
/P << /MCID 16 >> BDC
  (E) Tj
ET
EMC
Q
BT
11.04 0 0 11.04 23.3765 56.191 Tm
/P << /MCID 17 >> BDC
[ (m) -5 (ail) 4 (:) ] TJ
EMC
11.04 0.0 0.0 11.04 45.53378 56.191 Tm
/P << /MCID 15 >> BDC
( ) Tj
11.04 0.0 0.0 11.04 47.98466 56.191 Tm
[ (__) 7 (__) -3 (_) 9 (__) -3 (_) 9 (__) 7 (__) -3 (_) 9 (__) -3 (_) 9 (__) -3 (__) 7 
(__) -3 (_) 9 (__) -3 (_) 9 (__) -3 (_) 9 (__) -3 (_) 9 (__) 7 (__) -3 (_) 9 (_) 9 (__) 
-3 (__) 7 (__) -3 (_) 9 (__) -3 (_) 9 (__) -3 (_) 9 (__) -3 (_) 9 (__) 7 (__) -3 (_) 9 
(_) 9 (__) -3 (__) 7 (__) -3 (_) 9 (__) -3 (_) 9 (__) -3 (_) 9 (__) -3 (_) 9 (__) 7 (__) 
-3 (_) 9.1 (_) 8.9 (__) -2.9 (__) 6.9 (__) -3 (_) 9.1 (__) -3.1 (_) 9.1 (__) -3.1 (_) 
9.1 (__ ) ] TJ
EMC
ET

What I know here is the graphics state of "E" is different than the "mail" text. So is that the reason why its reading separately? If yes then How can I remove the graphics state for 'E'?

FYI in adobe after tagging its changing the content stream like below(And its reading perfect).

enter image description here

Here it's removed the q, re, w* and Q. The graphics state it's removed. There are some use cases when we shouldn't remove that graphics state. How would I know when to delete when not?

case2: This case the pdf when I tagged its showing under adobe tag tree like below.

enter image description here

enter image description here

By using adobe if I tag its showing below.

enter image description here

enter image description here

How do I precise my content stream before tagging to achieve like adobe?

Can I alter content stream same like using "PDFBOX"?

The code I am using to tag the pdf's you can found here.

0 Answers0