I want to extract the pdf pages which are more than 2000 chars per page using tika parser in python. From the below code I have extracted the [metadata] and from which I have used pdf:charsPerPage
to get the minimum chars limit per page (as 2000). I failed to integrate the pdf:charsPerPage
code to fetch the [content] data from parser. Here is the below code:
import tika
from tika import parser
parsed = parser.from_file('C:/User/xyz/file.pdf')
parsed["metadata"]['pdf:charsPerPage']
# converting string to int to perform greater than operation
test_list = [int(i) for i in parsed["metadata"]['pdf:charsPerPage']]
[i for i in test_list if i >= 2000]
# Sample ['pdf:charsPerPage'] data: ['1319','4930','6971','5548','5646','5974','5352','6096','6054']
Actual output from the above data: ['4930','6971','5548','5646','5974','5352','6096','6054']
From the above ['pdf:charsPerPage']
the first element has less than 2000 chars and thru the above operation we excluded the char limit to 2000. Now I want to extract/parse which are having more than 2000 chars per page.