k4bench.results.reliability¶
k4bench.results.reliability ¶
Conservative pass/fail reliability check for a benchmark run.
The goal here is not to score performance quality on a continuous scale, but to detect runs that were likely affected by environmental interference and should be excluded from comparisons.
A run is judged RELIABLE only when every hard criterion either passes or
has no data to judge it. Any single hard failure makes the run UNRELIABLE.
The guiding principle is conservative and asymmetric:
- only positive evidence of interference can reject a run — a criterion with
no data (
UNKNOWN) never fails it; - some signals are advisory (
WARN) and are reported but never reject a run.
Hard criteria (a single FAIL rejects the run)
* CPU efficiency ≥ 95% of the ideal for the thread count. For a
single-threaded run that is total_cpu / wall ≥ 0.95; for an
n-threaded run the floor scales to 0.95 * n. Low efficiency means
the thread spent time off-CPU — waiting, contended, preempted, or doing
I/O. This is the primary contention detector for a single-core workload.
* Load average ≤ physical core count. Load above the physical core count
means the host was oversubscribed.
* No swap activity — any swap-in/swap-out during the run rejects it.
* No thermal throttling — any throttle event during the run rejects it.
Advisory criteria (reported, never reject)
* Involuntary context switches ≤ 10× a baseline established from prior
clean runs, normalised per CPU-second so it is robust to run length and
core count. With no baseline the value is reported only (WARN) — never
a sole cause for rejection.
* RAM utilisation > 90% — a warning only. High RAM use alone is not
harmful as long as no swapping occurs (which the swap criterion covers).
This module is intentionally dependency-free (no pandas/streamlit) so it can be unit-tested in isolation; the dashboard consumes it today, and the runner/reporter can reuse it without pulling in UI dependencies.
Status ¶
Bases: str, Enum
Outcome of a single reliability criterion.
Criterion
dataclass
¶
Criterion(name: str, status: Status, detail: str, measured: str = '—', limit: str = '—', hard: bool = True)
The evaluated result of one reliability criterion.
ReliabilityVerdict
dataclass
¶
ReliabilityVerdict(criteria: tuple[Criterion, ...])
Aggregate verdict over all criteria for a single run.
failures
property
¶
failures: list[Criterion]
Hard criteria that failed (the reasons a run is rejected).
warnings
property
¶
warnings: list[Criterion]
Advisory criteria that flagged a concern without rejecting the run.
reliable
property
¶
Overall verdict.
False if any hard criterion failed. None (unknown) if no hard
criterion could be evaluated at all — there is no evidence either way.
True otherwise (at least one hard criterion passed and none failed).
evaluate_reliability ¶
evaluate_reliability(*, cpu_efficiency: float | None = None, n_threads: int = 1, involuntary_ctx_switches: int | None = None, total_cpu_s: float | None = None, ctx_switch_baseline_per_cpu_s: float | None = None, load_avg_1m_pre: float | None = None, load_avg_1m_post: float | None = None, physical_cores: int | None = None, swap_in_pages: int | None = None, swap_out_pages: int | None = None, thermal_throttle_events: int | None = None, ram_used_fraction: float | None = None) -> ReliabilityVerdict
Evaluate all reliability criteria and return an aggregate verdict.
All inputs are optional; any that are None yield an UNKNOWN criterion
that cannot reject the run. See the module docstring for the full model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cpu_efficiency
|
float | None
|
|
None
|
n_threads
|
int
|
Number of benchmark threads; scales the efficiency floor (default 1). |
1
|
involuntary_ctx_switches
|
int | None
|
The switch count is normalised by |
None
|
total_cpu_s
|
int | None
|
The switch count is normalised by |
None
|
ctx_switch_baseline_per_cpu_s
|
int | None
|
The switch count is normalised by |
None
|
load_avg_1m_pre
|
float | None
|
1-minute load average sampled before / after the run; the peak is tested. |
None
|
load_avg_1m_post
|
float | None
|
1-minute load average sampled before / after the run; the peak is tested. |
None
|
physical_cores
|
int | None
|
Physical core count the load is compared against. |
None
|
swap_in_pages
|
int | None
|
Pages swapped in/out during the run (deltas, not absolute levels). |
None
|
swap_out_pages
|
int | None
|
Pages swapped in/out during the run (deltas, not absolute levels). |
None
|
thermal_throttle_events
|
int | None
|
Throttle events accumulated during the run. |
None
|
ram_used_fraction
|
float | None
|
Fraction of RAM in use (0–1); advisory warning above 90%. |
None
|