Runs vs. Steps + retry behavior

The automation engine has two layers of state worth understanding:

Layer	Granularity
Run	One per trigger event. A run is the lifecycle of a single end-to-end execution of a flow.
Step	One per node executed within a run. A run with 5 nodes produces up to 5 step records.

Knowing the difference matters when debugging — because a Run can be marked “Completed” even when individual Steps failed.

Run statuses

The flow_run_status_enum has these values:

Status	Meaning
`scheduled`	The run is queued; will execute at its scheduled time
`running`	Currently executing
`completed`	The run’s last step ran successfully (today; see caveat below)
`failed`	The run aborted at some step
`retrying`	A step failed transiently and the run is between retry attempts
`interrupted`	The run was paused by a platform-level event (e.g. a graceful worker restart); resumable
`cancelled`	The run was explicitly cancelled (manual operator action or by a downstream dependency removal)

“Overdue” is not an enum value — it’s a derived view: runs in scheduled whose scheduledFor is in the past. The ops dashboard surfaces them, but in the database they’re still scheduled.

The completed status today is a coarse signal — it means “the last step finished without throwing”. A run with 5 steps where step 3 silently produced empty output and step 5 succeeded also reports completed. This isn’t ideal but it’s where things are; a future refinement will mark a run failed if any step failed.

Step statuses

The flow_run_step_status_enum is similar but with no cancelled:

Status	Meaning
`scheduled`	Step queued; waiting for upstream to complete
`running`	Currently executing
`completed`	Step finished without error
`failed`	Step errored and exhausted retries
`retrying`	Step is between retry attempts
`interrupted`	Step was paused mid-execution

Retry behavior

When a step fails:

The engine retries with exponential backoff (a few seconds, then a minute, then five minutes, then back-off-and-eventually-give-up).
HTTP Request nodes against transient network errors typically succeed on retry.
Errors due to bad config (missing variable, deleted Custom Field, unauthorized API key) won’t be fixed by retry and just exhaust the retry budget.

A failed run that’s exhausted retries can be manually re-triggered by an operator if the underlying issue is fixed. There’s no in-flight resumption — a re-trigger creates a new run from the start.

What’s “overdue” vs. “failed”

Failed means the engine tried and the step’s code or external dependency returned an error.
Overdue means the engine never got around to trying. This happens during platform outages or large backlogs.

If the platform was fully down for a day, a queue of “should have fired by now” runs accumulates as Overdue. When the engine comes back online, it processes the queue and converts overdue runs into either Completed or Failed.

Nothing is lost in an outage — runs persist in the queue and re-execute when the engine recovers. The cost is that time-sensitive runs (like a “1 day before” appointment reminder) fire late or not at all if the appointment time has already passed.

When to escalate

If as an operator you notice:

A spike in Failed runs across multiple flows in a short window — likely a deployed change has broken something. Escalate immediately.
A growing Overdue count — engine is falling behind on its queue. Escalate.
A single customer reporting “I never got my email” — most likely a per-flow config issue. Check the run history for that specific flow and look at the failed-step details.

For the dashboards used in the day-to-day, see the ops dashboard article (gated).

Debugging a single failed run

In the console: Automations → [flow] → Run history. Each run is listed with its trigger event, status, and a per-step breakdown. Click a failed step to see:

The variables it received as input.
The error message it produced.
The HTTP request/response if the failure was in an HTTP node.

Most failures fall into:

Missing variable — the variable path doesn’t resolve. Usually a typo or a deleted custom field. See Variable placeholders.
External API error — a third-party (SendGrid, Make.com, a customer’s CRM) returned an error. Read the error body for the actual cause.
Authentication failure — an API key or token has been rotated or revoked. Update the integration config.

What’s next

You’re at the end of the public Administration handbook. Internal-only operator notes (workarounds, infra, roadmap) are in the Internal section, gated behind operator authentication.