The Hyperscience Platform provides built-in capabilities to detect flow runs (or submissions) that take longer than expected. These regular checks ensure that potential delays are flagged in real-time through application logs and OpenTelemetry metrics, allowing users to set up alerts for Service Level Agreement (SLA) risks without relying on manual monitoring.
Availability
This feature is available in v39.0 and later.
In this article, you’ll learn:
How to configure the detection of slow flow runs
What predefined thresholds are used and how they can be customized
Where the checks for slow flows are reported
What application log messages are and how to interpret them
What OpenTelemetry metrics are available for monitoring
How to set up alerts for potential SLA risks
Activating slow flow detection
To activate the detection of slow flow runs, add the following line to your deployment configuration:
For Docker Compose: Update the
.env
file:
HYPERFLOW_DETECT_SLOW_FLOWS=true
For Kubernetes: Update the
values.yaml
file:
HYPERFLOW_DETECT_SLOW_FLOWS: true
Predefined thresholds
Hyperscience uses predefined thresholds to identify slow flow-runs or tasks. These thresholds determine when alerts are triggered and can be customized through environment variables. Below are the standard thresholds:
Flow-Run thresholds
Machine Processing Time (total time, Supervision excluded):
Thresholds: 5m, 10m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w, 5w, 10w, 30w, 60w, 120w, 300w.
Total Flow Duration (machine + manual time):
Thresholds: 10m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w, 5w, 10w, 30w, 60w, 120w, 300w.
Task thresholds
Tasks Pending and Running
Thresholds: 1m, 5m, 10m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w 5w, 10w, 30w, 60w, 120w, 300w.
Logs
The Hyperscience Platform generates detailed log entries whenever flow runs or tasks exceed predefined thresholds. These logs provide valuable, actionable insights, allowing you to identify potential SLA risks and system bottlenecks quickly. Below is an overview of the types of issues detected and their corresponding log messages to help you interpret and act on them effectively.
Application Logs
Application logs capture real-time messages when flow runs or tasks exceed thresholds, offering clear and actionable details to quickly identify delays or potential issues. See details and descriptions in the table below:
Condition | Log Value | Notes |
---|---|---|
Flow run exceeds machine processing time | Workflow has been in machine processing for
|
|
Flow run not completing | Workflow has been running for |
|
Task pending for too long | Task has been scheduled for a while.
| Task-related identifiers and |
Task running for too long | Task has been running for a while.
| Task-related identifiers and |
Tasks in terminal state | We have |
|
Failed flow-runs | There are | Provides time since failure, correlation IDs, and flow type. |
OpenTelemetry metrics
OpenTelemetry metrics provide detailed, customizable telemetry data for monitoring flow and task performance. These metrics can be integrated with external tools like Prometheus or Grafana to track and analyze trends. Metrics are emitted in milliseconds and help track performance thresholds.
Parameters
LE (“less than or equal to”) represents a duration threshold in milliseconds. It captures data for events with durations up to and including the specified value.
For example,
le="600000.0"
tracks events lasting 10 minutes or less, whilele="inf"
includes all durations.NAME indicates the name of the flow or the task.
Flow-run metrics
Metric Name | Description | Parameters |
---|---|---|
| Number of currently running flows. Only flows that are currently in machine processing will be included. Sorted by flow name and duration threshold. |
|
| Sum of machine durations of currently running flows. Only flows that are currently in machine processing will be included. Sorted by flow name and duration threshold. |
|
| Number of currently running flows. Sorted by flow name and duration threshold. |
|
| Sum of total durations of running flows.Sorted by flow name and duration threshold. |
|
| Maximum machine time that any currently running flow has spent. Sorted by flow name. |
|
| Maximum total time (machine + manual) that any currently running flow has spent. Sorted by flow name. |
|
| Number of FAILED flows. Sorted by flow name. |
|
| Milliseconds since the oldest flow failure. Sorted by flow name. |
|
| Milliseconds since the most recent flow failure. Sorted by flow name. |
|
Task Metrics
Metric Name | Description | Parameters |
---|---|---|
| Number of currently scheduled tasks. Sorted by task name and duration threshold. |
|
| Sum of time (in milliseconds) that current scheduled (pending) tasks have been spent in this state. Sorted by task name and duration threshold. |
|
| Maximum time currently scheduled (pending) tasks have been spent in this state (in milliseconds). Sorted by task name. |
|
| Number of currently running tasks. Sorted by task name and duration threshold. |
|
| Sum of time (in milliseconds) currently running tasks in this state. Sorted by task name and duration threshold. |
|
| Maximum time (in milliseconds) that the currently running tasks have spent in this state. Sorted by task name and duration threshold. |
|
| Maximum time spent in terminal state (in milliseconds) of current terminal tasks. | - |
| Average time spent in the terminal state (in milliseconds) of any current terminal task. | - |
| Number tasks in the terminal state. | - |
Recommended alerts
Configure alerts for the following key metrics:
Flow Processing Time
Metric:
hyperflow_running_flows_max_machine_duration_gauge
Log:
Workflow has been in machine processing for machine_duration=7200s, correlation_id=<correlation_id>
Recommended threshold: 7,200,000ms (2 hours).
Alert description: If this alert is triggered, a flow has been in machine processing for more than 2 hours.
Total Flow Duration
Metric:
hyperflow_running_flows_max_total_duration_gauge
Log message:
Workflow has been running for total_duration=86400s, correlation_id=<correlation_id>
Recommended threshold: 86,400,000ms (24 hours)
Alert description: If this alert is triggered, a flow has taken more than 24 hours to complete. The alert is triggered regardless of time spent in machine or manual processing.
Task Pending Time
Metric:
hyperflow_scheduled_tasks_max_duration_gauge
Log message:
Task has been scheduled for a while. task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, reference_name=<reference_name>, duration=600s
Recommended threshold: 600,000ms (10 minutes)
Alert description: If this alert is triggered, the system may be underprovisioned (not enough workers), or workers may be experiencing issues.
Task Running Time
Metric:
hyperflow_running_tasks_max_duration_gauge
Log message:
Task has been running for a while. task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, reference_name=<reference_name>, duration=1800s
Recommended threshold: 1,800,000ms (30 minutes)
Alert description: This alert indicates that a worker may be stalled while processing a task.
Task in Terminal State
Metric:
hyperflow_still_terminal_tasks_max_duration_gauge
Log message:
We have number_tasks=15 tasks in terminal state waiting for hyperflow_engine to process them. The oldest one has been waiting for duration=60s.
Recommended threshold: 60,000ms (1 minute)
Alert description: The Hyperflow Engine cannot catch up, possibly due to underprovisioning or other system issues.
Learn more about monitoring the application in our Monitoring Hyperscience section.