Reliability =========== Overview -------- Four reliability controls let LoadDensity run unattended in CI without turning into a flake factory: * **Adaptive retry** — exponential backoff + jitter + per-error-class budgets, so transient flakes recover while real bugs surface immediately. * **Failure budget / circuit breaker** — sliding-window error rate; the run aborts itself if a regression starts cascading. * **Network conditioner** — inject latency / jitter / loss per task without kernel ``tc`` or external proxies. * **Process supervisor** — kill orphan Locust / gevent workers and enforce a hard wall-clock timeout on any callable. Each control is independently optional. They all live under ``je_load_density.utils.reliability``. Adaptive retry -------------- ``classify_error`` buckets an exception into one of three categories: * ``transient`` — connection failures, timeouts, remote disconnects (default budget: 5). * ``flaky`` — ``AssertionError``, ``JSONDecodeError`` (default budget: 2). * ``permanent`` — everything else (budget: 0, raised immediately). .. code-block:: python from je_load_density import AdaptiveRetryPolicy, run_with_retry policy = AdaptiveRetryPolicy( transient_budget=5, flaky_budget=2, base_delay=0.1, max_delay=2.0, backoff_factor=2.0, jitter=0.25, ) run_with_retry(lambda: do_request(), policy=policy) Per-task declaration inside an action JSON: .. code-block:: json {"method": "post", "request_url": "${var.base}/x", "retry": {"transient": 3, "flaky": 1, "base_delay": 0.2}} Failure budget -------------- .. code-block:: python from je_load_density import install_failure_budget budget = install_failure_budget( threshold=0.05, # >5% errors window_seconds=30, # …in the last 30s min_samples=50, # …once at least 50 requests have run runner_quit_callback=lambda: env.runner.quit(), ) Tripping fires the ``runner_quit_callback`` once; subsequent failures are ignored. ``current_budget()`` returns the active sub-system for inspection. Network conditioner ------------------- Inject latency / jitter / packet loss per task. Drops are simulated by raising a ``ConnectionError`` before the request fires (so retry budgets see them as transient). .. code-block:: python from je_load_density import install_network_conditioner install_network_conditioner( latency_ms=50, jitter_ms=20, loss_rate=0.01, name_filter="/checkout", # only this endpoint ) Process supervisor ------------------ .. code-block:: python from je_load_density import ProcessSupervisor, with_watchdog # Kill orphan Locust / gevent processes (psutil soft-dep) killed_pids = ProcessSupervisor().kill_orphans() # Hard wall-clock raise after N seconds result = with_watchdog( lambda: execute_action(action_json), timeout_seconds=600, on_timeout=lambda: print("dumping state…"), ) The watchdog runs the callable in a daemon thread; on timeout it raises :class:`TimeoutError` on the caller and leaves the thread to the process exit. Action JSON commands -------------------- .. list-table:: :header-rows: 1 :widths: 35 65 * - Command - Summary * - ``LD_install_failure_budget`` - Subscribe to Locust request events with a sliding-window budget. * - ``LD_uninstall_failure_budget`` - Detach the budget listener. * - ``LD_install_network_conditioner`` - Install global latency / jitter / loss injector. * - ``LD_uninstall_network_conditioner`` - Detach the conditioner.