Skip to content

k4bench.results.reliability

k4bench.results.reliability

Conservative pass/fail reliability check for a benchmark run.

The goal here is not to score performance quality on a continuous scale, but to detect runs that were likely affected by environmental interference and should be excluded from comparisons.

A run is judged RELIABLE only when every hard criterion either passes or has no data to judge it. Any single hard failure makes the run UNRELIABLE. The guiding principle is conservative and asymmetric:

  • only positive evidence of interference can reject a run — a criterion with no data (UNKNOWN) never fails it;
  • some signals are advisory (WARN) and are reported but never reject a run.

Hard criteria (a single FAIL rejects the run) * CPU efficiency ≥ 95% of the ideal for the thread count. For a single-threaded run that is total_cpu / wall ≥ 0.95; for an n-threaded run the floor scales to 0.95 * n. Low efficiency means the thread spent time off-CPU — waiting, contended, preempted, or doing I/O. This is the primary contention detector for a single-core workload. * Load average ≤ physical core count. Load above the physical core count means the host was oversubscribed. * No swap activity — any swap-in/swap-out during the run rejects it. * No thermal throttling — any throttle event during the run rejects it.

Advisory criteria (reported, never reject) * Involuntary context switches10× a baseline established from prior clean runs, normalised per CPU-second so it is robust to run length and core count. With no baseline the value is reported only (WARN) — never a sole cause for rejection. * RAM utilisation > 90% — a warning only. High RAM use alone is not harmful as long as no swapping occurs (which the swap criterion covers).

This module is intentionally dependency-free (no pandas/streamlit) so it can be unit-tested in isolation; the dashboard consumes it today, and the runner/reporter can reuse it without pulling in UI dependencies.

Status

Bases: str, Enum

Outcome of a single reliability criterion.

Criterion dataclass

Criterion(name: str, status: Status, detail: str, measured: str = '—', limit: str = '—', hard: bool = True)

The evaluated result of one reliability criterion.

rejecting property

rejecting: bool

Whether this criterion is a hard failure that rejects the run.

ReliabilityVerdict dataclass

ReliabilityVerdict(criteria: tuple[Criterion, ...])

Aggregate verdict over all criteria for a single run.

failures property

failures: list[Criterion]

Hard criteria that failed (the reasons a run is rejected).

warnings property

warnings: list[Criterion]

Advisory criteria that flagged a concern without rejecting the run.

reliable property

reliable: bool | None

Overall verdict.

False if any hard criterion failed. None (unknown) if no hard criterion could be evaluated at all — there is no evidence either way. True otherwise (at least one hard criterion passed and none failed).

evaluate_reliability

evaluate_reliability(*, cpu_efficiency: float | None = None, n_threads: int = 1, involuntary_ctx_switches: int | None = None, total_cpu_s: float | None = None, ctx_switch_baseline_per_cpu_s: float | None = None, load_avg_1m_pre: float | None = None, load_avg_1m_post: float | None = None, physical_cores: int | None = None, swap_in_pages: int | None = None, swap_out_pages: int | None = None, thermal_throttle_events: int | None = None, ram_used_fraction: float | None = None) -> ReliabilityVerdict

Evaluate all reliability criteria and return an aggregate verdict.

All inputs are optional; any that are None yield an UNKNOWN criterion that cannot reject the run. See the module docstring for the full model.

Parameters:

Name Type Description Default
cpu_efficiency float | None

total_cpu_s / wall_time_s for the run (≈ 1.0 for a clean single thread).

None
n_threads int

Number of benchmark threads; scales the efficiency floor (default 1).

1
involuntary_ctx_switches int | None

The switch count is normalised by total_cpu_s and compared against the baseline (switches per CPU-second from prior clean runs). Advisory only.

None
total_cpu_s int | None

The switch count is normalised by total_cpu_s and compared against the baseline (switches per CPU-second from prior clean runs). Advisory only.

None
ctx_switch_baseline_per_cpu_s int | None

The switch count is normalised by total_cpu_s and compared against the baseline (switches per CPU-second from prior clean runs). Advisory only.

None
load_avg_1m_pre float | None

1-minute load average sampled before / after the run; the peak is tested.

None
load_avg_1m_post float | None

1-minute load average sampled before / after the run; the peak is tested.

None
physical_cores int | None

Physical core count the load is compared against.

None
swap_in_pages int | None

Pages swapped in/out during the run (deltas, not absolute levels).

None
swap_out_pages int | None

Pages swapped in/out during the run (deltas, not absolute levels).

None
thermal_throttle_events int | None

Throttle events accumulated during the run.

None
ram_used_fraction float | None

Fraction of RAM in use (0–1); advisory warning above 90%.

None
Source code in k4bench/results/reliability.py
def evaluate_reliability(
    *,
    cpu_efficiency: float | None = None,
    n_threads: int = 1,
    involuntary_ctx_switches: int | None = None,
    total_cpu_s: float | None = None,
    ctx_switch_baseline_per_cpu_s: float | None = None,
    load_avg_1m_pre: float | None = None,
    load_avg_1m_post: float | None = None,
    physical_cores: int | None = None,
    swap_in_pages: int | None = None,
    swap_out_pages: int | None = None,
    thermal_throttle_events: int | None = None,
    ram_used_fraction: float | None = None,
) -> ReliabilityVerdict:
    """Evaluate all reliability criteria and return an aggregate verdict.

    All inputs are optional; any that are ``None`` yield an ``UNKNOWN`` criterion
    that cannot reject the run. See the module docstring for the full model.

    Parameters
    ----------
    cpu_efficiency:
        ``total_cpu_s / wall_time_s`` for the run (≈ 1.0 for a clean single thread).
    n_threads:
        Number of benchmark threads; scales the efficiency floor (default 1).
    involuntary_ctx_switches, total_cpu_s, ctx_switch_baseline_per_cpu_s:
        The switch count is normalised by ``total_cpu_s`` and compared against the
        baseline (switches per CPU-second from prior clean runs). Advisory only.
    load_avg_1m_pre, load_avg_1m_post:
        1-minute load average sampled before / after the run; the peak is tested.
    physical_cores:
        Physical core count the load is compared against.
    swap_in_pages, swap_out_pages:
        Pages swapped in/out *during* the run (deltas, not absolute levels).
    thermal_throttle_events:
        Throttle events accumulated during the run.
    ram_used_fraction:
        Fraction of RAM in use (0–1); advisory warning above 90%.
    """
    criteria = (
        _eval_cpu_efficiency(cpu_efficiency, n_threads),
        _eval_load(load_avg_1m_pre, load_avg_1m_post, physical_cores),
        _eval_swap(swap_in_pages, swap_out_pages),
        _eval_thermal(thermal_throttle_events),
        _eval_ctx_switches(involuntary_ctx_switches, total_cpu_s, ctx_switch_baseline_per_cpu_s),
        _eval_ram(ram_used_fraction),
    )
    return ReliabilityVerdict(criteria=criteria)