I read two questions here:
- How do I handle multiple students and thresholds?
- How do I make this fast?
I'll focus on an answer to "How do I make this fast" since three answers have already mentioned how to handle multiple students.
Here are two steps:
- Profile your code to find where most time is spent. Aanecdotally: there are usually a small handful of tight loops where your program spends the majority of its runtime. I've skipped this step here, but for more complicated examples this is needed.
- Benchmark snippets and optimize, even rewriting in a lower-level language as needed.
Here's a version of your code + @nbrix's answer. I've modified the check_grade
method to return True
or False
to be consistent with the numpy
version shown next.
class StudentGrades:
def __init__(self, scores):
self.scores = scores
def average(self):
return sum(self.scores) / len(self.scores)
def check_grade(self, threshold=0.7):
avg = self.average()
if avg >= threshold:
return True
return False
def run_student_grades(data):
return [StudentGrades(scores).check_grade() for scores in data]
And here is a function I've written using numpy
, which calculates the mean and whether the mean is greater than the 0.7
threshold:
import numpy as np
def run_numpy_student_grades(data):
return np.mean(data, axis=1) > 0.7
For small inputs (two students, two assignments) there is probably no difference between these. In fact, using numpy
is slightly slower:
- benchmark 'Small Input: Two Students, Two Assignments': 2 tests -
Name (time in us) Mean Median
-------------------------------------------------------------------
test_pure_python_small_input 4.9058 (1.0) 4.9330 (1.0)
test_numpy_small_input 8.5494 (1.74) 8.5580 (1.73)
-------------------------------------------------------------------
For big inputs (here: 1000 students, each with 100 assignments) the difference between these is substantial: the numpy
version is ~250x faster than the Python version that initializes objects and does list comprehension over them.
------ benchmark 'Big Input: 1000 Students, 100 Assignments': 2 tests -----
Name (time in us) Mean Median
---------------------------------------------------------------------------
test_numpy_big_input 55.5528 (1.0) 56.1480 (1.0)
test_pure_python_big_input 13,675.3789 (246.17) 13,865.2100 (246.94)
---------------------------------------------------------------------------
Which version is correct in practice will depend on your data and other outside factors: e.g. how many students and assignments you will realistically be working with.
Here is the benchmark code, assuming the run_*
methods are implemented:
# File: `benchmark.py`
# Install: `pip install pytest pytest-benchmark numpy`
# Run with: `pytest benchmark.py`
import pytest
from demo_plain import run_student_grades
from demo_numpy import run_numpy_student_grades
import numpy as np
from numpy.random import default_rng
rng = default_rng(42)
two_students_two_assignments = np.array([[0.8, 0.9], [0.6, 0.2]])
thousand_students_hundred_assignments = rng.standard_normal(size=(1000, 100))
@pytest.mark.benchmark(group="Small Input: Two Students, Two Assignments")
def test_pure_python_small_input(benchmark):
result = benchmark(run_student_grades, two_students_two_assignments)
@pytest.mark.benchmark(group="Small Input: Two Students, Two Assignments")
def test_numpy_small_input(benchmark):
result = benchmark(run_numpy_student_grades, two_students_two_assignments)
@pytest.mark.benchmark(group="Big Input: 1000 Students, 100 Assignments")
def test_pure_python_big_input(benchmark):
result = benchmark(run_student_grades, thousand_students_hundred_assignments)
@pytest.mark.benchmark(group="Big Input: 1000 Students, 100 Assignments")
def test_numpy_big_input(benchmark):
result = benchmark(run_numpy_student_grades, thousand_students_hundred_assignments)
It's separate, but here's a version for handling multiple thresholds:
def run_numpy_student_grades_thresholds(data, thresholds):
_avg = np.mean(data, axis=1)
return np.c_[
[_avg > threshold for threshold in thresholds]
]
print(run_numpy_student_grades_thresholds([[0.8, 0.9], [0.6, 0.2]], [0.7, 0.9]))
# [[ True False]
# [False False]]