Skip to content

Runs vs. Steps + retry behavior

The automation engine has two layers of state worth understanding:

LayerGranularity
RunOne per trigger event. A run is the lifecycle of a single end-to-end execution of a flow.
StepOne per node executed within a run. A run with 5 nodes produces up to 5 step records.

Knowing the difference matters when debugging — because a Run can be marked “Completed” even when individual Steps failed.

The flow_run_status_enum has these values:

StatusMeaning
scheduledThe run is queued; will execute at its scheduled time
runningCurrently executing
completedThe run’s last step ran successfully (today; see caveat below)
failedThe run aborted at some step
retryingA step failed transiently and the run is between retry attempts
interruptedThe run was paused by a platform-level event (e.g. a graceful worker restart); resumable
cancelledThe run was explicitly cancelled (manual operator action or by a downstream dependency removal)

“Overdue” is not an enum value — it’s a derived view: runs in scheduled whose scheduledFor is in the past. The ops dashboard surfaces them, but in the database they’re still scheduled.

The completed status today is a coarse signal — it means “the last step finished without throwing”. A run with 5 steps where step 3 silently produced empty output and step 5 succeeded also reports completed. This isn’t ideal but it’s where things are; a future refinement will mark a run failed if any step failed.

The flow_run_step_status_enum is similar but with no cancelled:

StatusMeaning
scheduledStep queued; waiting for upstream to complete
runningCurrently executing
completedStep finished without error
failedStep errored and exhausted retries
retryingStep is between retry attempts
interruptedStep was paused mid-execution

When a step fails:

  • The engine retries with exponential backoff (a few seconds, then a minute, then five minutes, then back-off-and-eventually-give-up).
  • HTTP Request nodes against transient network errors typically succeed on retry.
  • Errors due to bad config (missing variable, deleted Custom Field, unauthorized API key) won’t be fixed by retry and just exhaust the retry budget.

A failed run that’s exhausted retries can be manually re-triggered by an operator if the underlying issue is fixed. There’s no in-flight resumption — a re-trigger creates a new run from the start.

  • Failed means the engine tried and the step’s code or external dependency returned an error.
  • Overdue means the engine never got around to trying. This happens during platform outages or large backlogs.

If the platform was fully down for a day, a queue of “should have fired by now” runs accumulates as Overdue. When the engine comes back online, it processes the queue and converts overdue runs into either Completed or Failed.

Nothing is lost in an outage — runs persist in the queue and re-execute when the engine recovers. The cost is that time-sensitive runs (like a “1 day before” appointment reminder) fire late or not at all if the appointment time has already passed.

If as an operator you notice:

  • A spike in Failed runs across multiple flows in a short window — likely a deployed change has broken something. Escalate immediately.
  • A growing Overdue count — engine is falling behind on its queue. Escalate.
  • A single customer reporting “I never got my email” — most likely a per-flow config issue. Check the run history for that specific flow and look at the failed-step details.

For the dashboards used in the day-to-day, see the ops dashboard article (gated).

In the console: Automations → [flow] → Run history. Each run is listed with its trigger event, status, and a per-step breakdown. Click a failed step to see:

  • The variables it received as input.
  • The error message it produced.
  • The HTTP request/response if the failure was in an HTTP node.

Most failures fall into:

  1. Missing variable — the variable path doesn’t resolve. Usually a typo or a deleted custom field. See Variable placeholders.
  2. External API error — a third-party (SendGrid, Make.com, a customer’s CRM) returned an error. Read the error body for the actual cause.
  3. Authentication failure — an API key or token has been rotated or revoked. Update the integration config.

You’re at the end of the public Administration handbook. Internal-only operator notes (workarounds, infra, roadmap) are in the Internal section, gated behind operator authentication.