Detecting Slow Flow Runs

The Hyperscience Platform provides built-in capabilities to detect flow runs (or submissions) that take longer than expected. These regular checks ensure that potential delays are flagged in real-time through application logs and OpenTelemetry metrics, allowing users to set up alerts for Service Level Agreement (SLA) risks without relying on manual monitoring.

Availability
This feature is available in v39.0 and later.

In this article, you’ll learn:

How to configure the detection of slow flow runs
What predefined thresholds are used and how they can be customized
Where the checks for slow flows are reported
What application log messages are and how to interpret them
What OpenTelemetry metrics are available for monitoring
How to set up alerts for potential SLA risks

Activating slow flow detection

To activate the detection of slow flow runs, add the following line to your deployment configuration:

For Docker Compose: Update the .env file:

HYPERFLOW_DETECT_SLOW_FLOWS=true

For Kubernetes: Update the values.yaml file:

HYPERFLOW_DETECT_SLOW_FLOWS: true

Predefined thresholds

Hyperscience uses predefined thresholds to identify slow flow-runs or tasks. These thresholds determine when alerts are triggered and can be customized through environment variables. Below are the standard thresholds:

Flow-Run thresholds

Machine Processing Time (total time, Supervision excluded):
- Thresholds: 5m, 10m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w, 5w, 10w, 30w, 60w, 120w, 300w.
Total Flow Duration (machine + manual time):
- Thresholds: 10m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w, 5w, 10w, 30w, 60w, 120w, 300w.

Task thresholds

Tasks Pending and Running
- Thresholds: 1m, 5m, 10m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w 5w, 10w, 30w, 60w, 120w, 300w.

Logs

The Hyperscience Platform generates detailed log entries whenever flow runs or tasks exceed predefined thresholds. These logs provide valuable, actionable insights, allowing you to identify potential SLA risks and system bottlenecks quickly. Below is an overview of the types of issues detected and their corresponding log messages to help you interpret and act on them effectively.

Application Logs

Application logs capture real-time messages when flow runs or tasks exceed thresholds, offering clear and actionable details to quickly identify delays or potential issues. See details and descriptions in the table below:

Condition	Log Value	Notes
Flow run exceeds machine processing time	Workflow has been in machine processing for `machine_duration=362.252s, correlation_id=<correlation_id>`	`machine_duration` and `correlation_id` values vary
Flow run not completing	Workflow has been running for `total_duration=623.251s, correlation_id=<correlation_id>`	`total_duration` and `correlation_id` values vary
Task pending for too long	Task has been scheduled for a while. `task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, duration=60s`	Task-related identifiers and `duration` values vary
Task running for too long	Task has been running for a while. `task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, duration=60s`	Task-related identifiers and `duration` values vary.
Tasks in terminal state	We have `number_tasks=15 tasks` in terminal state waiting for `hyperflow_engine` to process them. The oldest one has been waiting for `duration=60s.`	`number_tasks` and `duration` values vary
Failed flow-runs	There are `num_flows=10 FAILED` flows of type `flow_name=DOCUMENT_PROCESSING.` They failed between `max=734.835s ago at oldest_failed_correlation_id=<id> and min=245.7246s ago at <id>.`	Provides time since failure, correlation IDs, and flow type.

OpenTelemetry metrics

OpenTelemetry metrics provide detailed, customizable telemetry data for monitoring flow and task performance. These metrics can be integrated with external tools like Prometheus or Grafana to track and analyze trends. Metrics are emitted in milliseconds and help track performance thresholds.

Parameters
LE (“less than or equal to”) represents a duration threshold in milliseconds. It captures data for events with durations up to and including the specified value.
For example, le="600000.0" tracks events lasting 10 minutes or less, while le="inf" includes all durations.
NAME indicates the name of the flow or the task.

Flow-run metrics

Metric Name	Description	Parameters
`hyperflow_running_flows_by_machine_duration_count_gauge`	Number of currently running flows. Only flows that are currently in machine processing will be included. Sorted by flow name and duration threshold.	`name`, `le`
`hyperflow_running_flows_by_machine_duration_total_gauge`	Sum of machine durations of currently running flows. Only flows that are currently in machine processing will be included. Sorted by flow name and duration threshold.	`name`, `le`
`hyperflow_running_flows_by_total_duration_count_gauge`	Number of currently running flows. Sorted by flow name and duration threshold.	`name`, `le`
`hyperflow_running_flows_by_total_duration_total_gauge`	Sum of total durations of running flows.Sorted by flow name and duration threshold.	`name`
`hyperflow_running_flows_max_machine_duration_gauge`	Maximum machine time that any currently running flow has spent. Sorted by flow name.	`name`
`hyperflow_running_flows_max_total_duration_gauge`	Maximum total time (machine + manual) that any currently running flow has spent. Sorted by flow name.	`name`
`hyperflow_failed_flows_count`	Number of FAILED flows. Sorted by flow name.	`name`
`hyperflow_failed_flows_oldest`	Milliseconds since the oldest flow failure. Sorted by flow name.	`name`
`hyperflow_failed_flows_newest`	Milliseconds since the most recent flow failure. Sorted by flow name.	`name`

Task Metrics

Metric Name	Description	Parameters
`hyperflow_scheduled_tasks_by_duration_count_gauge`	Number of currently scheduled tasks. Sorted by task name and duration threshold.	`name`, `le`
`hyperflow_scheduled_tasks_by_duration_total_gauge`	Sum of time (in milliseconds) that current scheduled (pending) tasks have been spent in this state. Sorted by task name and duration threshold.	`name`, `le`
`hyperflow_scheduled_tasks_max_duration_gauge`	Maximum time currently scheduled (pending) tasks have been spent in this state (in milliseconds). Sorted by task name.	`name`
`hyperflow_running_tasks_by_duration_count_gauge`	Number of currently running tasks. Sorted by task name and duration threshold.	`name`, `le`
`hyperflow_running_tasks_by_duration_total_gauge`	Sum of time (in milliseconds) currently running tasks in this state. Sorted by task name and duration threshold.	`name`, `le`
`hyperflow_running_tasks_max_duration_gauge`	Maximum time (in milliseconds) that the currently running tasks have spent in this state. Sorted by task name and duration threshold.	`name`
`hyperflow_still_terminal_tasks_max_duration_gauge`	Maximum time spent in terminal state (in milliseconds) of current terminal tasks.	-
`hyperflow_still_terminal_tasks_mean_duration_gauge`	Average time spent in the terminal state (in milliseconds) of any current terminal task.	-
`hyperflow_still_terminal_tasks_count`	Number tasks in the terminal state.	-

Recommended alerts

Configure alerts for the following key metrics:

Flow Processing Time
- Metric: hyperflow_running_flows_max_machine_duration_gauge
- Log: Workflow has been in machine processing for machine_duration=7200s, correlation_id=<correlation_id>
- Recommended threshold: 7,200,000ms (2 hours).
- Alert description: If this alert is triggered, a flow has been in machine processing for more than 2 hours.
Total Flow Duration
- Metric: hyperflow_running_flows_max_total_duration_gauge
- Log message: Workflow has been running for total_duration=86400s, correlation_id=<correlation_id>
- Recommended threshold: 86,400,000ms (24 hours)
- Alert description: If this alert is triggered, a flow has taken more than 24 hours to complete. The alert is triggered regardless of time spent in machine or manual processing.
Task Pending Time
- Metric: hyperflow_scheduled_tasks_max_duration_gauge
- Log message: Task has been scheduled for a while. task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, reference_name=<reference_name>, duration=600s
- Recommended threshold: 600,000ms (10 minutes)
- Alert description: If this alert is triggered, the system may be underprovisioned (not enough workers), or workers may be experiencing issues.
Task Running Time
- Metric: hyperflow_running_tasks_max_duration_gauge
- Log message: Task has been running for a while. task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, reference_name=<reference_name>, duration=1800s
- Recommended threshold: 1,800,000ms (30 minutes)
- Alert description: This alert indicates that a worker may be stalled while processing a task.
Task in Terminal State
- Metric: hyperflow_still_terminal_tasks_max_duration_gauge
- Log message: We have number_tasks=15 tasks in terminal state waiting for hyperflow_engine to process them. The oldest one has been waiting for duration=60s.
- Recommended threshold: 60,000ms (1 minute)
- Alert description: The Hyperflow Engine cannot catch up, possibly due to underprovisioning or other system issues.
  Learn more about monitoring the application in our Monitoring Hyperscience section.