Detecting Slow Flow Runs

Prev Next

The Hyperscience Platform provides built-in capabilities to detect flow runs (or submissions) that take longer than expected. These regular checks ensure that potential delays are flagged in real-time through application logs and OpenTelemetry metrics, allowing users to set up alerts for Service Level Agreement (SLA) risks without relying on manual monitoring.

Availability

This feature is available in v39.0 and later.

In this article, you’ll learn:

  • How to configure the detection of slow flow runs

  • What predefined thresholds are used and how they can be customized

  • Where the checks for slow flows are reported

  • What application log messages are and how to interpret them

  • What OpenTelemetry metrics are available for monitoring

  • How to set up alerts for potential SLA risks

Activating slow flow detection

To activate the detection of slow flow runs, add the following line to your deployment configuration:

  • For Docker Compose: Update the .env file:

HYPERFLOW_DETECT_SLOW_FLOWS=true
  • For Kubernetes:  Update the values.yaml file:

HYPERFLOW_DETECT_SLOW_FLOWS: true

Predefined thresholds

Hyperscience uses predefined thresholds to identify slow flow-runs or tasks. These thresholds determine when alerts are triggered and can be customized through environment variables. Below are the standard thresholds:

Flow-Run thresholds

  • Machine Processing Time (total time, Supervision excluded):

    • Thresholds: 5m, 10m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w, 5w, 10w, 30w, 60w, 120w, 300w.

  • Total Flow Duration (machine + manual time):

    • Thresholds: 10m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w, 5w, 10w, 30w, 60w, 120w, 300w.

Task thresholds

  • Tasks Pending and Running

    • Thresholds: 1m, 5m, 10m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d, 2d, 3d, 1w, 2w 5w, 10w, 30w, 60w, 120w, 300w.

Logs

The Hyperscience Platform generates detailed log entries whenever flow runs or tasks exceed predefined thresholds. These logs provide valuable, actionable insights, allowing you to identify potential SLA risks and system bottlenecks quickly. Below is an overview of the types of issues detected and their corresponding log messages to help you interpret and act on them effectively.

Application Logs

Application logs capture real-time messages when flow runs or tasks exceed thresholds, offering clear and actionable details to quickly identify delays or potential issues. See details and descriptions in the table below:

Condition

Log Value

Notes

Flow run exceeds machine processing time

Workflow has been in machine processing for

machine_duration=362.252s, correlation_id=<correlation_id>

machine_duration and correlation_id values vary

Flow run not completing

Workflow has been running for total_duration=623.251s, correlation_id=<correlation_id>

total_duration and correlation_id values vary

Task pending for too long

Task has been scheduled for a while.

task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, duration=60s

Task-related identifiers and duration values vary

Task running for too long

Task has been running for a while.

task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, duration=60s

Task-related identifiers and duration values vary.

Tasks in terminal state

We have number_tasks=15 tasks in terminal state waiting for hyperflow_engine to process them. The oldest one has been waiting for duration=60s.

number_tasks and duration values vary

Failed flow-runs

There are num_flows=10 FAILED flows of type flow_name=DOCUMENT_PROCESSING. They failed between max=734.835s ago at oldest_failed_correlation_id=<id> and min=245.7246s ago at <id>.

Provides time since failure, correlation IDs, and flow type.

OpenTelemetry metrics

OpenTelemetry metrics provide detailed, customizable telemetry data for monitoring flow and task performance. These metrics can be integrated with external tools like Prometheus or Grafana to track and analyze trends. Metrics are emitted in milliseconds and help track performance thresholds.

Parameters

LE (“less than or equal to”) represents a duration threshold in milliseconds. It captures data for events with durations up to and including the specified value.

For example, le="600000.0" tracks events lasting 10 minutes or less, while le="inf" includes all durations.

NAME indicates the name of the flow or the task.

Flow-run metrics

Metric Name

Description

Parameters

hyperflow_running_flows_by_machine_duration_count_gauge

Number of currently running flows. Only flows that are currently in machine processing will be included. Sorted by flow name and duration threshold.

name, le

hyperflow_running_flows_by_machine_duration_total_gauge

Sum of machine durations of currently running flows. Only flows that are currently in machine processing will be included. Sorted by flow name and duration threshold.

name, le

hyperflow_running_flows_by_total_duration_count_gauge

Number of currently running flows. Sorted by flow name and duration threshold.

name, le

hyperflow_running_flows_by_total_duration_total_gauge

Sum of total durations of running flows.Sorted by flow name and duration threshold.

name

hyperflow_running_flows_max_machine_duration_gauge

Maximum machine time that any currently running flow has spent. Sorted by flow name.

name

hyperflow_running_flows_max_total_duration_gauge

Maximum total time (machine + manual) that any currently running flow has spent. Sorted by flow name.

name

hyperflow_failed_flows_count

Number of FAILED flows. Sorted by flow name.

name

hyperflow_failed_flows_oldest

Milliseconds since the oldest flow failure. Sorted by flow name.

name

hyperflow_failed_flows_newest

Milliseconds since the most recent flow failure. Sorted by flow name.

name

Task Metrics

Metric Name

Description

Parameters

hyperflow_scheduled_tasks_by_duration_count_gauge

Number of currently scheduled tasks. Sorted by task name and duration threshold.

name, le

hyperflow_scheduled_tasks_by_duration_total_gauge

Sum of time (in milliseconds) that current scheduled (pending) tasks have been spent in this state. Sorted by task name and duration threshold.

name, le

hyperflow_scheduled_tasks_max_duration_gauge

Maximum time currently scheduled (pending) tasks have been spent in this state (in milliseconds). Sorted by task name.

name

hyperflow_running_tasks_by_duration_count_gauge

Number of currently running tasks. Sorted by task name and duration threshold.

name, le

hyperflow_running_tasks_by_duration_total_gauge

Sum of time (in milliseconds) currently running tasks in this state. Sorted by task name and duration threshold.

name, le

hyperflow_running_tasks_max_duration_gauge

Maximum time (in milliseconds) that the currently running tasks have spent in this state. Sorted by task name and duration threshold.

name

hyperflow_still_terminal_tasks_max_duration_gauge

Maximum time spent in terminal state (in milliseconds) of current terminal tasks.

-

hyperflow_still_terminal_tasks_mean_duration_gauge

Average time spent in the terminal state (in milliseconds) of any current terminal task.

-

hyperflow_still_terminal_tasks_count

Number tasks in the terminal state.

-

Recommended alerts

Configure alerts for the following key metrics:

  1. Flow Processing Time

    • Metric: hyperflow_running_flows_max_machine_duration_gauge

    • Log: Workflow has been in machine processing for machine_duration=7200s, correlation_id=<correlation_id>

    • Recommended threshold: 7,200,000ms (2 hours).

    • Alert description: If this alert is triggered, a flow has been in machine processing for more than 2 hours.

  2. Total Flow Duration

    • Metric: hyperflow_running_flows_max_total_duration_gauge

    • Log message: Workflow has been running for total_duration=86400s, correlation_id=<correlation_id>

    • Recommended threshold:  86,400,000ms (24 hours)

    • Alert description: If this alert is triggered, a flow has taken more than 24 hours to complete. The alert is triggered regardless of time spent in machine or manual processing.

  3. Task Pending Time

    • Metric: hyperflow_scheduled_tasks_max_duration_gauge

    • Log message: Task has been scheduled for a while. task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, reference_name=<reference_name>, duration=600s

    • Recommended threshold: 600,000ms (10 minutes)

    • Alert description: If this alert is triggered, the system may be underprovisioned (not enough workers), or workers may be experiencing issues.

  4. Task Running Time

    • Metric: hyperflow_running_tasks_max_duration_gauge

    • Log message: Task has been running for a while. task_uuid=<task_uuid>, correlation_id=<correlation_id>, workflow_instance_id=<workflow_instance_id>, task_name=<task_name>, reference_name=<reference_name>, duration=1800s

    • Recommended threshold: 1,800,000ms (30 minutes)

    • Alert description: This alert indicates that a worker may be stalled while processing a task.

  5. Task in Terminal State

    • Metric: hyperflow_still_terminal_tasks_max_duration_gauge

    • Log message: We have number_tasks=15 tasks in terminal state waiting for hyperflow_engine to process them. The oldest one has been waiting for duration=60s.

    • Recommended threshold: 60,000ms (1 minute)

    • Alert description: The Hyperflow Engine cannot catch up, possibly due to underprovisioning or other system issues.

      Learn more about monitoring the application in our Monitoring Hyperscience section.