I recently became the maintainer of PyPDF2, a library for reading / writing PDF files. In order to get more confident/quick with merging PRs, I introduced quite a lot of tests.
I use pytest
and coverage
to detect if I lack unit tests for some areas.
The lines which are not covered by now might even be unreachable.
I have several thousand PDFs and sample code to execute with those PDFs. For example:
from PyPDF2 import PdfReader
def get_text(path):
reader = PdfReader(path)
for page in reader.pages:
text = page.extract_text()
Is there a way to iterate over those thousands of PDFs and store the ones in a list that would increase the coverage?
Pseudo-code
I'm uncertain if I explained well what I'm looking for, so here is some pseudo-code.
What I imagine is something like this to generate the .coverage
file
$ python -m coverage run -m pytest tests -vv
And then:
test_cov = load_cov_file(".coverage")
# test_cov.missing / test_cov.partial / test_cov.covered could be
# lists of (path, line) tuples:
test_cov_missing = set(test_cov.missing)
test_cov_partial = set(test_cov.partial)
detected_new = []
for path in pdf_files:
with get_coverage() as cov:
get_text(path)
covered_lines = set(cov.covered)
cov_partial = set(cov.partial)
new_lines = (
test_cov_missing.intersection(covered_lines)
+ test_cov_missing.intersection(cov_partial)
+ test_cov_partial.intersection(covered_lines)
)
detected_new.append((path, new_lines))