Slack / Discord notifications
A training job is two waits back-to-back: GPU allocation at the start (variable, can be several minutes when a worker starts from cold; see the Quickstart for why) and then the training run itself (about 7 to 12 minutes for the templates). Nobody actually watches a browser tab for that long. The terminalonCompleted and onFailed callbacks are the natural spot to fan a status message out to wherever your team already lives so you only look back when the run is genuinely done.
This recipe uses Slack incoming webhooks; Discord, Microsoft Teams, and arbitrary HTTP endpoints work the same way. Anything you can fetch, you can notify.
If you only want a desktop ping for your own runs, Studio already shows a browser notification, an in-page toast, and a tab-title indicator on
training.completed / training.failed once you click Run training (and accept the permission prompt). This recipe is for fanning the same signal into a shared channel.The pattern
<!here> mention only fires on failure, so successful runs do not page anyone. Adjust the urgency to match how often your team’s training jobs actually fail.
Why the inner try / catch matters
If the webhook request throws (Slack outage, DNS hiccup, a non-2xx response that your code rethrows on), the callback rejects. The Arkor runtime catches that rejection and routes it through the SSE reconnect loop (SDK § Lifecycle callbacks). With maxReconnectAttempts at its default of unlimited, a flaky webhook can quietly retry forever, and Last-Event-ID advancing across the retry can swallow the original event.
Treat the webhook as a side effect, not as part of the run’s success criterion. Catch inside; log if you want to know.
Variations
Per-step progress pings. Combine withonLog to post a one-line progress message every N steps:
process.env.NOTIFY_PROGRESS === "1" if you only want it for important runs.
Mid-run sample sharing. Combine with the Mid-run evaluation recipe: post each checkpoint sample to a review channel so colleagues can react with reactions while the run continues.
capture(), a Datadog event, a database insert: the shape is the same. Put the side effect behind an async helper that swallows its own errors and call it from the lifecycle callbacks. The trainer file does not need any extra orchestration.
What to keep in mind
- Inner
try / catchis mandatory. Notifications are nice to have; an outage in your webhook should never silently retry your training event stream. - Keep secrets out of the trainer file. The example reads
SLACK_WEBHOOK_URLfromprocess.envso the webhook does not land ingit. Same idea for any token-based destination. - Remember
erroris astring.onFailed’serrorargument is the string the backend sent (SDK § Lifecycle callbacks), not anErrorinstance. Embed it directly; do not call.messageon it.