2

I want to extract the pdf pages which are more than 2000 chars per page using tika parser in python. From the below code I have extracted the [metadata] and from which I have used pdf:charsPerPage to get the minimum chars limit per page (as 2000). I failed to integrate the pdf:charsPerPage code to fetch the [content] data from parser. Here is the below code:

import tika
from tika import parser
parsed = parser.from_file('C:/User/xyz/file.pdf')
parsed["metadata"]['pdf:charsPerPage']

# converting string to int to perform greater than operation 
test_list = [int(i) for i in parsed["metadata"]['pdf:charsPerPage']]
[i for i in test_list if i >= 2000]

# Sample ['pdf:charsPerPage'] data: ['1319','4930','6971','5548','5646','5974','5352','6096','6054']

Actual output from the above data: ['4930','6971','5548','5646','5974','5352','6096','6054']

From the above ['pdf:charsPerPage'] the first element has less than 2000 chars and thru the above operation we excluded the char limit to 2000. Now I want to extract/parse which are having more than 2000 chars per page.

Danny Herbert
  • 2,002
  • 1
  • 18
  • 26
jaxigox919
  • 21
  • 3
  • Grab the XHTML version of the text (rather than the plain text version as you are now), then split on the page divs to get the page text, then grab the pages you want, then down-sample back to plain text? – Gagravarr Jun 22 '20 at 05:53

0 Answers0