Why Didn't That Lead Go Out? Tracing a Decision That Happens in 800ms

A while back I wrote about the lead distribution engine — the piece of LeadSwitchboard that takes a lead, figures out which buyers are eligible, charges credits, and fans it out, all in roughly 800ms. That post was about getting the decision right.

This one is about a question the engine couldn't answer: why did it decide that?

Because the failure mode that generates support tickets isn't "the engine crashed." It's a buyer calling to say "I should have gotten that lead — it's in my zip, my balance is fine, why didn't it come to me?" And the honest answer, for a long time, was: I don't know. The engine made a dozen branching decisions in under a second and returned a single word — distributed, or failed — with a one-line reason if you were lucky. Everything in between was gone.

So I built a trace.

The shape of the problem

Distribution is a decision tree, not a straight line. A lead comes in and the engine resolves a campaign, prices it, composes a pool of enrolled buyers, filters that pool by service area and onboarding answers, ranks what survives, locks credit balances, and assigns. Every step can end the journey: no enrolled buyers, everyone filtered out by geography, the lead failed a tier's content filter, a credit lock timed out.

The old logging told me the outcome of that tree. What I needed was the path through it — for one lead, on one attempt, which branch we took at each fork, and why. The constraints made it interesting:

It has to be per-attempt. A lead can be distributed more than once — a retry after a transient failure, a manual re-assign by an operator. "Why didn't buyer X get it" is meaningless without knowing which run you're asking about.
It can never slow down or break distribution. This is observability bolted onto the revenue-critical path. If the trace write fails, the lead still has to go out. Tracing that can break the thing it traces is worse than no tracing.
It has to be queryable in the product, not grepped out of a log aggregator. The person who needs the answer is an operator looking at a lead in the admin UI, not me with a terminal open.

The trace itself

The trace is an append-only table, LeadDistributionEvent. Every meaningful fork in the engine emits a row. The events form a vocabulary:

distribution_started
campaign_resolved
pricing_computed
buyer_pool_composed
buyer_skipped        ← one per buyer the pool dropped, with a reason
buyer_selected
distributed          ← terminal, success
failed               ← terminal, with a skip_reason
gated_for_verification
ping_post_opened

Each row carries typed columns I query on — buyer_id, cost, outcome, skip_reason, duration_ms — plus a data JSONB blob for step-specific detail not worth a column. Every row is stamped with the attempt_id, so a trace is just all rows for one attempt, in order: the engine narrating its own decision.

The recorder is deliberately boring:

def _record_distribution_event(db, *, lead_id, agency_id, attempt_id, event, **fields):
    """Append a LeadDistributionEvent row to the active transaction.

    Commit is the caller's responsibility; this helper must not flush or
    commit so that event writes stay atomic with the distribution work they
    describe. Any failure to construct the row is swallowed — tracing must
    never break distribution itself.
    """
    try:
        db.add(LeadDistributionEvent(lead_id=..., attempt_id=attempt_id,
                                     event=event, **fields))
    except Exception:
        logger.exception("distribution_event_record_failed", ...)

Two design decisions are doing all the work here, and they pull in opposite directions.

It writes on the caller's transaction. I don't flush or commit — I just add the row to whatever session the distribution work is already using. That makes the trace atomic with the work it describes: if the assignment commits, its trace commits with it; if the assignment rolls back, so does the trace. A trace that says "buyer selected, lead assigned" for an assignment that never committed would be a liar, and a lying trace is worse than none.

It swallows everything. The body is wrapped in a try/except that only logs. Tracing sits on the revenue path; it gets zero votes on whether a lead goes out. A malformed data payload or a bad identifier can't be allowed to take down a distribution.

That second decision is uncontroversial. The first — atomicity with the caller's transaction — is the one that had a bug hiding in it.

Bug one: the trace vanishes exactly when you need it

Here's the failure that prompted all of this. A lead comes in, the engine starts composing the buyer pool, and it hits credit-lock contention — two distributions racing for the same buyer's balance, and this one loses the lock. The correct response is to roll back and let the lead retry. So the engine rolls back the transaction.

And rolling back the transaction throws away every trace row written during it.

So the operator opens the lead, looks at the distribution drawer, and sees nothing. The trace is emptiest precisely when someone is most likely to come asking — a lead that mysteriously didn't go out. The atomicity that keeps the trace honest also lets a rollback erase the evidence.

The fix is to recognize that a terminal failure is different from the in-flight steps. The in-flight steps (buyer_pool_composed, pricing_computed) should die with a rollback — they describe work that didn't happen. But the fact that the lead failed, and why is true regardless. That fact needs to outlive the transaction.

So terminal failures on a rolled-back path get written on a fresh session:

async def _emit_failure_trace_fresh(self, *, lead_id, attempt_id, skip_reason, data=None):
    """Persist a terminal `failed` event on a FRESH session.

    Paths that roll back the caller's transaction (tier lock contention)
    discard every in-transaction event row, leaving the drawer empty. Writing
    the trace on its own session keeps a record of *why* the lead didn't fan
    out even after the rollback.

    Best-effort: this runs after the caller transaction was rolled back, so a
    fresh-session failure (the lead row was never committed, or was deleted
    concurrently and the FK rejects the trace) must NOT propagate — otherwise
    it would mask the intended distribution outcome with a generic error.
    """
    try:
        async with AsyncSessionLocal() as trace_db:
            _record_distribution_event(trace_db, ..., event="failed",
                                       outcome="failed", skip_reason=skip_reason,
                                       data=data or {})
            await trace_db.commit()
    except Exception:
        logger.exception("distribution_failure_trace_persist_failed", ...)

This deliberately breaks the atomicity rule — and that's the point. The general events ride the transaction so they can't lie about work that rolled back; the terminal failure breaks out of it because a rolled-back distribution still happened, and the operator needs the record of why.

The best-effort swallow matters more than it looks. This runs after the caller rolled back. The lead row might not be committed yet, or might have been deleted concurrently, in which case the foreign key rejects the insert. Let that error propagate and it surfaces as a generic 500 that masks the real outcome — a database error instead of "this lead lost a credit lock and will retry." The write whose entire job is to explain what happened must never become the thing that obscures it.

Bug two: a transient label on a permanent failure

The same change fixed a subtler bug — not in the trace plumbing, but in telling the truth in the skip_reason.

When the buyer pool comes back empty, the engine has to classify why, because the classification decides what happens next. A transient failure (no_enrolled_buyers_transient) means "try again in a moment" — back on the retry path. A permanent failure (no_enrolled_buyers, no_eligible_buyers) means "stop; a human needs to fix the config" — surface it to an operator. Get that wrong in either direction and you either retry forever against a campaign that will never have an eligible buyer, or park a perfectly distributable lead for manual review.

The old code had the first bug. For round-robin campaigns, when the pool came back empty it counted the campaign's total active enrollments, and if that was greater than zero, called the failure transient — "buyers exist, they must all be locked, retry." But that count was campaign-wide. It happily counted buyers filtered out by service area or onboarding answers — buyers who were never eligible for this lead. A campaign full of buyers who don't cover the lead's zip got flagged "all locked, transient" and the lead retried against a wall.

The fix classifies by which filter emptied the pool:

# Genuine lock contention is raised as _TierLockContention from the selection
# loop (handled above), so reaching here means no tier produced a candidate
# for a NON-transient reason. Classify by which filter emptied the pool so the
# trace and failure reason are accurate — these are permanent config/coverage
# gaps, not retryable.

tier_filter_rejections = sum(
    1 for r in tier_skip_reasons if r.get("reason") == "lead_filter_rejected"
)
pool_filtered_out = (
    allowed_buyer_ids is not None
    and len(allowed_buyer_ids) == 0
    and len(all_enrolled_ids) > 0
)

if tier_filter_rejections > 0:
    failure_reason = "no_eligible_buyers_after_tier_filters"  # lead failed a content filter
elif pool_filtered_out:
    failure_reason = "no_eligible_buyers"                     # everyone filtered out by geo/onboarding
else:
    failure_reason = "no_enrolled_buyers"                     # campaign genuinely has nobody

The key move: genuine lock contention now has its own exception (_TierLockContention), raised from the selection loop and handled separately — the one truly transient path, and the one that gets the fresh-session failure trace from bug one. By the time control reaches this block, the empty pool is not a locking problem; it's a permanent gap, and the only question is which filter caused it. Three filters, three different conversations with the operator, three different skip_reasons in the trace.

This is the whole reason the trace exists. "No eligible buyers" is a useless answer to a buyer's question. "Every enrolled buyer was removed by the service-area filter because none of them cover 30312" is an answer the operator can act on — go enroll a buyer who covers that zip.

Surfacing it where the question gets asked

A trace nobody reads is just an expensive log. The events land in the lead's detail drawer in the admin UI, rendered in order — the same drawer where I also surfaced the lead's campaign and state, because "which campaign was this even routing through" turned out to start every investigation.

So now: buyer asks "why didn't I get it," operator opens the lead, reads the trace top to bottom. distribution_started → campaign_resolved → buyer_pool_composed → a failed row with skip_reason: no_eligible_buyers and a data blob showing eight enrolled buyers, zero surviving the service-area filter. Answer in fifteen seconds, no engineer involved.

What this taught me about tracing

Three things I'd carry to any system where a fast, branching decision needs to be explainable after the fact:

A trace is bound to an attempt; a log is bound to a moment. The most useful column in the table is attempt_id — it's what turns scattered rows into a story you can read end to end. Logs tell you what was happening at 14:32:07; a trace tells you what happened to this lead on this run. The product question is always the second kind.

Atomicity is a feature until the failure path needs an exception. Riding the caller's transaction is right for events describing work that might roll back — it keeps the trace from lying. But the record of why something failed has to outlive the rollback that caused it, which means deliberately breaking the rule for that one case. Knowing which events are "part of the work" and which are "the verdict on the work" is the whole design.

The classification is the product, not the plumbing. I could have built the event table perfectly and still shipped a useless feature, because the value isn't in having a skip_reason — it's in its being true. A trace that confidently says "transient, retrying" about a permanent coverage gap is worse than silence, because now someone trusts it. The engineering that matters most in an observability feature is the part that decides what to call things.

The distribution engine answers "where does this lead go." The trace answers "why" — and the second question, it turns out, is the one that was costing me afternoons.