3

I am using function to count occurrences of given word in pdf using PyPDF2. While the function is running I get message in terminal:

FloatObject (b'0.000000000000-14210855') invalid; use 0.0 instead

My code:

def count_words(word):
    print()
    print('Counting words..')

    files = os.listdir('./pdfs')
    counted_words = []

    for idx, file in enumerate(files, 1):
        with open(f'./pdfs/{file}', 'rb') as pdf_file:
            ReadPDF = PyPDF2.PdfFileReader(pdf_file, strict=False)
            pages = ReadPDF.numPages

            words_count = 0

            for page in range(pages):
                pageObj = ReadPDF.getPage(page)
                data = pageObj.extract_text()
                words_count += sum(1 for match in re.findall(rf'\b{word}\b', data, flags=re.I))

            counted_words.append(words_count)
        
        print(f'File: {idx}')
    
    return counted_words

How to get rid of this message?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958

2 Answers2

0

See https://pypdf2.readthedocs.io/en/latest/user/suppress-warnings.html

import logging

logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.ERROR)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
0

The PDF specification has never allowed scientific (exponent/mantissa) floats, which yours looks a little bit like. An unscrupulous PDF producer has output, therefore, a malformed PDF file. PyPDF's choice to convert it to 0.0 seems a solid response.

johnwhitington
  • 2,308
  • 1
  • 16
  • 18