-1

I recently became the maintainer of PyPDF2, a library for reading / writing PDF files. In order to get more confident/quick with merging PRs, I introduced quite a lot of tests.

I use pytest and coverage to detect if I lack unit tests for some areas.

The lines which are not covered by now might even be unreachable.

I have several thousand PDFs and sample code to execute with those PDFs. For example:

from PyPDF2 import PdfReader


def get_text(path):
    reader = PdfReader(path)
    for page in reader.pages:
        text = page.extract_text()

Is there a way to iterate over those thousands of PDFs and store the ones in a list that would increase the coverage?

Pseudo-code

I'm uncertain if I explained well what I'm looking for, so here is some pseudo-code.

What I imagine is something like this to generate the .coverage file

$ python -m coverage run -m pytest tests -vv

And then:

test_cov = load_cov_file(".coverage")

# test_cov.missing / test_cov.partial / test_cov.covered could be
# lists of (path, line) tuples:
test_cov_missing = set(test_cov.missing)
test_cov_partial = set(test_cov.partial)

detected_new = []

for path in pdf_files:
    with get_coverage() as cov:
        get_text(path)
    covered_lines = set(cov.covered)
    cov_partial = set(cov.partial)
    new_lines = (
        test_cov_missing.intersection(covered_lines)
        + test_cov_missing.intersection(cov_partial)
        + test_cov_partial.intersection(covered_lines)
    )
    detected_new.append((path, new_lines))
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958

1 Answers1

0

I managed to get it myself :-)

from rich.progress import track
import coverage
from pathlib import Path
from PyPDF2 import PdfReader
import json
import pytest

def get_text(path):
    reader = PdfReader(path)
    for page in reader.pages:
        text = page.extract_text()

def store(file2cov):
    with open("pdf-coverage.json" , "w") as fp:
        fp.write(json.dumps(file2cov, indent=4))

root = Path("pdf")
paths = root.glob("*.pdf")

file2cov = {}

source_files = [
    "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py",
    "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_cmap.py",
    "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_page.py",
    "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/generic.py",
    "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_utils.py",
]


tst_cov = coverage.Coverage()
tst_cov.start()
pytest.main(["/home/moose/Github/py-pdf/PyPDF2"])
tst_cov.stop()
tst_cov.save()
data = tst_cov.get_data()
file2cov_base = {}
for src_file in source_files:
    file2cov_base[src_file] = data.lines(src_file)
    assert file2cov_base[src_file] is not None


for path in track(paths):
    cov = coverage.Coverage()
    cov.start()
    get_text(path)
    cov.stop()
    cov.save()
    data = cov.get_data()
    
    str_path = str(path)
    file2cov[str_path] = {}
    for src_file in source_files:
        added = data.lines(src_file)
        if added is None:
            continue
        new_lines = sorted(list(set(added) - set(file2cov_base[src_file])))
        if new_lines:
            file2cov[str_path][src_file] = new_lines
            store(file2cov)

store(file2cov)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958