`k4bench.results.reliability`¶

k4bench.results.reliability ¶

Conservative pass/fail reliability check for a benchmark run.

The goal here is not to score performance quality on a continuous scale, but to detect runs that were likely affected by environmental interference and should be excluded from comparisons.

A run is judged RELIABLE only when every hard criterion either passes or has no data to judge it. Any single hard failure makes the run UNRELIABLE. The guiding principle is conservative and asymmetric:

only positive evidence of interference can reject a run — a criterion with no data (UNKNOWN) never fails it;
some signals are advisory (WARN) and are reported but never reject a run.

Hard criteria (a single FAIL rejects the run) * CPU efficiency ≥ 95% of the ideal for the thread count. For a single-threaded run that is total_cpu / wall ≥ 0.95; for an n-threaded run the floor scales to 0.95 * n. Low efficiency means the thread spent time off-CPU — waiting, contended, preempted, or doing I/O. This is the primary contention detector for a single-core workload. * Load average ≤ physical core count. Load above the physical core count means the host was oversubscribed. * No swap activity — any swap-in/swap-out during the run rejects it. * No thermal throttling — any throttle event during the run rejects it.

Advisory criteria (reported, never reject) * Involuntary context switches ≤ 10× a baseline established from prior clean runs, normalised per CPU-second so it is robust to run length and core count. With no baseline the value is reported only (WARN) — never a sole cause for rejection. * RAM utilisation > 90% — a warning only. High RAM use alone is not harmful as long as no swapping occurs (which the swap criterion covers).

This module is intentionally dependency-free (no pandas/streamlit) so it can be unit-tested in isolation; the dashboard consumes it today, and the runner/reporter can reuse it without pulling in UI dependencies.

Status ¶

Bases: str, Enum

Outcome of a single reliability criterion.

Criterion `dataclass` ¶

Criterion(name: str, status: Status, detail: str, measured: str = '—', limit: str = '—', hard: bool = True)

The evaluated result of one reliability criterion.

rejecting `property` ¶

rejecting: bool

Whether this criterion is a hard failure that rejects the run.

ReliabilityVerdict `dataclass` ¶

ReliabilityVerdict(criteria: tuple[Criterion, ...])

Aggregate verdict over all criteria for a single run.

failures `property` ¶

failures: list[Criterion]

Hard criteria that failed (the reasons a run is rejected).

warnings `property` ¶

warnings: list[Criterion]

Advisory criteria that flagged a concern without rejecting the run.

reliable `property` ¶

reliable: bool | None

Overall verdict.

False if any hard criterion failed. None (unknown) if no hard criterion could be evaluated at all — there is no evidence either way. True otherwise (at least one hard criterion passed and none failed).

evaluate_reliability ¶

evaluate_reliability(*, cpu_efficiency: float | None = None, n_threads: int = 1, involuntary_ctx_switches: int | None = None, total_cpu_s: float | None = None, ctx_switch_baseline_per_cpu_s: float | None = None, load_avg_1m_pre: float | None = None, load_avg_1m_post: float | None = None, physical_cores: int | None = None, swap_in_pages: int | None = None, swap_out_pages: int | None = None, thermal_throttle_events: int | None = None, ram_used_fraction: float | None = None) -> ReliabilityVerdict

Evaluate all reliability criteria and return an aggregate verdict.

All inputs are optional; any that are None yield an UNKNOWN criterion that cannot reject the run. See the module docstring for the full model.

Parameters:

Name	Type	Description	Default
`cpu_efficiency`	`float \| None`	`total_cpu_s / wall_time_s` for the run (≈ 1.0 for a clean single thread).	`None`
`n_threads`	`int`	Number of benchmark threads; scales the efficiency floor (default 1).	`1`
`involuntary_ctx_switches`	`int \| None`	The switch count is normalised by `total_cpu_s` and compared against the baseline (switches per CPU-second from prior clean runs). Advisory only.	`None`
`total_cpu_s`	`int \| None`	The switch count is normalised by `total_cpu_s` and compared against the baseline (switches per CPU-second from prior clean runs). Advisory only.	`None`
`ctx_switch_baseline_per_cpu_s`	`int \| None`	The switch count is normalised by `total_cpu_s` and compared against the baseline (switches per CPU-second from prior clean runs). Advisory only.	`None`
`load_avg_1m_pre`	`float \| None`	1-minute load average sampled before / after the run; the peak is tested.	`None`
`load_avg_1m_post`	`float \| None`	1-minute load average sampled before / after the run; the peak is tested.	`None`
`physical_cores`	`int \| None`	Physical core count the load is compared against.	`None`
`swap_in_pages`	`int \| None`	Pages swapped in/out during the run (deltas, not absolute levels).	`None`
`swap_out_pages`	`int \| None`	Pages swapped in/out during the run (deltas, not absolute levels).	`None`
`thermal_throttle_events`	`int \| None`	Throttle events accumulated during the run.	`None`
`ram_used_fraction`	`float \| None`	Fraction of RAM in use (0–1); advisory warning above 90%.	`None`

Source code in k4bench/results/reliability.py

def evaluate_reliability(
    *,
    cpu_efficiency: float | None = None,
    n_threads: int = 1,
    involuntary_ctx_switches: int | None = None,
    total_cpu_s: float | None = None,
    ctx_switch_baseline_per_cpu_s: float | None = None,
    load_avg_1m_pre: float | None = None,
    load_avg_1m_post: float | None = None,
    physical_cores: int | None = None,
    swap_in_pages: int | None = None,
    swap_out_pages: int | None = None,
    thermal_throttle_events: int | None = None,
    ram_used_fraction: float | None = None,
) -> ReliabilityVerdict:
    """Evaluate all reliability criteria and return an aggregate verdict.

    All inputs are optional; any that are ``None`` yield an ``UNKNOWN`` criterion
    that cannot reject the run. See the module docstring for the full model.

    Parameters
    ----------
    cpu_efficiency:
        ``total_cpu_s / wall_time_s`` for the run (≈ 1.0 for a clean single thread).
    n_threads:
        Number of benchmark threads; scales the efficiency floor (default 1).
    involuntary_ctx_switches, total_cpu_s, ctx_switch_baseline_per_cpu_s:
        The switch count is normalised by ``total_cpu_s`` and compared against the
        baseline (switches per CPU-second from prior clean runs). Advisory only.
    load_avg_1m_pre, load_avg_1m_post:
        1-minute load average sampled before / after the run; the peak is tested.
    physical_cores:
        Physical core count the load is compared against.
    swap_in_pages, swap_out_pages:
        Pages swapped in/out *during* the run (deltas, not absolute levels).
    thermal_throttle_events:
        Throttle events accumulated during the run.
    ram_used_fraction:
        Fraction of RAM in use (0–1); advisory warning above 90%.
    """
    criteria = (
        _eval_cpu_efficiency(cpu_efficiency, n_threads),
        _eval_load(load_avg_1m_pre, load_avg_1m_post, physical_cores),
        _eval_swap(swap_in_pages, swap_out_pages),
        _eval_thermal(thermal_throttle_events),
        _eval_ctx_switches(involuntary_ctx_switches, total_cpu_s, ctx_switch_baseline_per_cpu_s),
        _eval_ram(ram_used_fraction),
    )
    return ReliabilityVerdict(criteria=criteria)

k4bench.results.reliability¶