OpenTelemetry

Prev Next

This feature is available in v37 and later.

OpenTelemetry allows you to monitor the usage and health of your Hyperscience instance alongside that of other applications in your IT infrastructure. The data stream includes metrics for submission volume and throughput, time to completion for blocks and submissions, response times, error rates, and connectivity issues, among others.

The Hyperscience application emits only traces and metrics telemetry and is instrumented with the OpenTelemetry Python SDK. For more details, see the SDK’s GitHub repository page. Logs are not supported in OpenTelemetry format at this time.

Enabling OpenTelemetry in Hyperscience

Follow the steps below to expose an OpenTelemetry data stream to your application-performance monitoring tool.

Kubernetes deployments

1. Enable OpenTelemetry metrics and traces

To enable OpenTelemetry from the Hyperscience Helm chart to export metrics and traces to an OpenTelemetry collector, add the following to the values.yaml file:

opentelemetry:
  enabled: true
  metrics:
    endpoint: 'http://<collector-fqdn>:<port>/v1/metrics'
  traces:
    endpoint: 'http://<collector-fqdn>:<port>/v1/traces'

By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.

NOTE: In principle, Hyperscience could be configured with the gRPC protocol. However we strongly discourage use of OTLP/gRPC with Hyperscience. Our internal tests surfaced critical issues with the currently used versions of the Python OpenTelemetry client and gRPC. In a multiprocessing environment, it results in segfaults, hanging threads, and gRPC exceptions. For more details, see the Forking with PeriodicExportingMetricReader results in ValueError issue in GitHub.

2. Secure an OTLP connection

It’s possible to enable TLS for the telemetry stream emitted from the Hyperscience application. However, only TLS is supported in OpenTelemetry Python; mTLS is not currently supported (see details of the Support mtls for otlp exporter issue in GitHub).

Note that, if you’re exporting OTLP data with TLS to an OpenTelemetry collector, then the collector must be configured to receive encrypted data.

To enable TLS in the Hyperscience application:

a. The server certificate has to be added as a Kubernetes Secret object to your cluster. In the values.yaml file for the Helm chart, the following snippet must be added (shown below are the default values from the chart, in case a given property is omitted):

opentelemetry:
  tls:
     certSecretName: ''
     certName: 'certificate.pem'
  • opentelemetry.tls.certSecretName should be the name of the created Kubernetes Secret, holding the server certificate.

  • opentelemetry.tls.certNameshould be the name of the item inside the Secret data, where the certificate is stored (defaults to certificate.pem).

b. The schema of the endpoints must be changed to “https://”. Additionally, the port must be changed to the TLS receiving port of the OpenTelemetry collector (see Secure an OTLP connection under "Docker Compose deployments" in this article for more information).

opentelemetry:
  enabled: true
  metrics:
    endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/metrics'
  traces:
    endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/traces'

NOTE: The value of the OTEL_EXPORTER_OTLP_CERTIFICATE “.env” file variable is automatically set by the Hyperscience Helm chart; it shouldn’t be configured manually!

3. Add extra configuration variables as needed.

OpenTelemetry Python's opentelemetry.sdk.environment_variables describes the variables provided by the SDK. Additionally, Hyperscience-specific environment variables are also available and described at the end of this document.

Environment variables should be added to both the app and trainer sections in values.yaml. For example, if you would like metrics and traces to be exported every 10 seconds (instead of the default 30 seconds):

app:
  dotenv:
    OTEL_METRIC_EXPORT_INTERVAl=10000
…
trainer
  env:
    OTEL_METRIC_EXPORT_INTERVAl=10000

Docker Compose deployments

1.  Enable OpenTelemetry metrics and traces.

To enable OpenTelemetry in the Hyperscience application to export metrics and traces to an OpenTelemetry collector, the following environment variables must be set in the “.env” file:

OTEL_SDK_DISABLED=false
OTEL_PYTHON_LOG_CORRELATION=true

OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://:/v1/metrics
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://:/v1/traces

OTEL_SDK_DISABLED instruments the application with metrics and traces, and OTEL_PYTHON_LOG_CORRELATION enables the addition of span and trace IDs in the application logs.

Our application also sets default values for the following environment variables:

OTEL_SERVICE_NAME=hyperscience

OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp

OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'

OTEL_METRIC_EXPORT_INTERVAl=30000

By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.

NOTE: In principle, Hyperscience could be configured with the gRPC protocol. However we strongly discourage use of OTLP/gRPC with Hyperscience. Our internal tests surfaced critical issues with the currently used versions of the Python OpenTelemetry client and gRPC. In a multiprocessing environment, it results in segfaults, hanging threads, and gRPC exceptions. For more details, see the Forking with PeriodicExportingMetricReader results in ValueError issue in GitHub.

If you do not add these variables to your “.env” file, the above default values will be used after you add OTEL_SDK_DISABLED to your ".env" file and run one of the commands described in Editing the “.env” file and running the application.

Many more options are available for configuring OpenTelemetry exporters and protocols in the Hyperscience application. Refer to OpenTelemetry’s official documentation for details:

2.  Secure an OTLP connection.

For OTLP specifically, if you want to use a secure connection, the schema must be changed to “https://”, and extra environment variables should be set up (e.g., pointing to the certificate path, key, etc.). The certificate must be available in HS application. The steps below give an example configuration and instructions for adding it.

  1. Create the directory that will hold the certificate, if it does not already exist. Assuming “/mnt/hs/” is used as the HS_PATH, run the following command to create the directory:

    mkdir -p /mnt/hs/certs
  1. Copy the certificate to this directory by running the following command:

    chmod -R 1000:1000 /mnt/hs/certs
  1. If SELinux is enabled, execute:

    chcon -t container_file_t -R /mnt/hs/certs/

    If SELinux is enabled, each time a file is added to the certs directory above for any reason, you will need to execute the chcon command again.

  1. Add the following to the “.env” file:

    OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://:/v1/metrics
    OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=https://:/v1/traces
    OTEL_EXPORTER_OTLP_CERTIFICATE=/etc/nginx/certs/

More information is provided in OpenTelemetry’s documentation:

3.  Add extra configuration variables as needed.

OpenTelemetry Python's opentelemetry.sdk.environment_variables describes the variables provided by the SDK. Additionally, Hyperscience-specific environment variables are also available and described at the end of this document. As usual, any environment variables should be configured in the “.env” file.

OpenTelemetry traces

Hyperscience creates and emits a trace for every created submission and every API request received. Traces are useful when troubleshooting issues.

Application logs contain trace_id and span_id data, as shown below. Logs are recorded automatically when the OTEL_PYTHON_LOG_CORRELATION “.env” file variable is set to true.

OpenTelemetryTraceIDsSpanIDs.png

When troubleshooting a problem (e.g., latency, errors), you can copy the trace_id from the corresponding logs and search for them in your tracing backend, if configured. An example with Grafana and Tempo is shown below.

OpenTelemetryTraceExample.png

Note that when viewing traces of submissions, gaps in time are possible in the trace. They occur when the submission is waiting for manual tasks to be performed by keyers (e.g., Transcription Supervision).

Available Hyperscience metrics

Group

Metric

What do we use if for

Notes

Hyperflow

hyperflow_task_count

Number of tasks; set when task changes status.

attributes: name, ref_name, status

hyperflow_task_backlog_duration

Duration of task in the backlog; set when task becomes IN_PROGRESS.

attributes: name, ref_name

hyperflow_task_run_duration

Duration of task run time; set when task terminates.

attributes: name, ref_name

hyperflow_task_poll_duration

Duration of fetching a limited count of pending tasks.

attributes: name, count

hyperflow_workflow_count

Number of flows; set when flow changes status.

attributes: name, top_level, version, status

The 'top_level=True' tag is useful for tracking number of submissions.

hyperflow_workflow_run_duration

Duration of flow run; set when flow terminates.

attributes: name, top_level, version

hyperflow_payload_offload_count

Number of WFE payloads offloaded to the object store.

attributes: bucket_10k (size in 10s of KiB)

hyperflow_payload_store_duration

Duration of storing WFE payload into the object store.

attributes: bucket_10k (size in 10s of KiB)

hyperflow_payload_fetch_duration

Duration of fetching WFE payload from the object store.

 

hyperflow_engine_poll_workflows_duration

Duration of fetching workflows ready to be advanced.

 

hyperflow_engine_advance_workflow_duration

Duration of advancing a single workflow instance.

 

Job queue

jobs_count

Number of jobs; set when job changes state.

attributes: type, state

jobs_backlog_duration

Duration jobs wait in the queue.

attributes: type

jobs_exec_duration

Duration of the job run time.

attributes: type

jobs_cpu_time

CPU time for running the job.

attributes: type

jobs_system_time

System time for running the job.

attributes: type

jobs_query_duration

Total duration of the DB queries executed in job.

attributes: type

jobs_worker_duration

Total duration of job execution (queries + everything else).

attributes: type

jobs_worker_query_count

Number of DB queries executed in job.

attributes: type

Transcription

hs_task_time_taken_ms

Milliseconds a human or machine transcribed a single field.

attributes: task_type (transcription), entry_type (machine/human); optional attributes: user_is_staff, username

hs_machine_field_transcriptions

Milliseconds a machine transcribed a single field with non-zero confidence.

attributes: confidence_rd5, ml_model

hs_completed_human_entries_count

Number of completed fields in a Transcription SV task.

attributes: worker (username), task_type, status (DONE)

hs_finished_qa_records

Number of completed Transcription QA supervision tasks.

attributes: (multiple)

Submission pages

hs_submission_page_count

Number of created submission pages.

 

hs_submission_page_completed_count

Number of completed submission pages.

attributes: (optional) error_type

TDM (fka KDM)

hs_kdm_table_loaded_count

Number of loaded (shown) training documents for Tables through the TDM API.

example dashboard 

hs_kdm_table_saved_count

Number of updated training documents for Tables through the TDM API.

 

Table Layouts

hs_live_layout_tables_count

Number of tables in live layouts (sent once daily).

attributes: n_items

hs_live_layout_columns_count

Number of table columns in live layouts (sent once daily).

attributes: n_items

hs_working_layout_tables_count

Number of tables in draft layouts (sent once daily).

attributes: n_items

hs_working_layout_columns_count

Number of table columns in draft layouts (sent once daily).

attributes: n_items

Table SV

hs_copycat_time_taken_ms

Milliseconds for copy-cat algorithm runtime (part of API call)

 

hs_table_id_qa_tasks_until_consensus

Number of times consensus was reached in Table ID QA, tagged with number of QA tasks used.

attributes: num_qa_tasks

SV

task_response_time_taken_ms

Time spent on a manual supervision task.

attributes: worker (username), task_type, status (DONE)

task_response_submit_success

Number of successfully completed manual supervision tasks.

attributes: worker (username), task_type, status (DONE)

task_response_submit_fail

Number of invalid responses for manual supervision tasks.

attributes: worker (username), task_type, status (DONE)

crowd_user_activity

Number of times a user starts/stops working on a manual supervision task.

attributes: worker (username), activity (take a break/start working)

crowd_query_next_task

Misc. timing for several small DB queries when fetching next manual tasks.

 

DB deadlocks

hs_retry_db_transaction_count

Number of times a DB transaction is retried, e.g. due to DB deadlock.

attributes: deadlock_retrial_count (optional, when set is 1), retry_transaction_exception_count (optional, when set is 1)

Hyperscience-specific environment variables

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION=
[0,100,250,500,750,1000,2500,5000,7500,10sec,15sec,20sec,30sec,45sec,1min,2min,3min,
4min,5min,10min,30min,60min,3h,12h,24h]

This variable configures the bucket boundaries for all Histogram metrics that have a name ending in "duration" or "time" (e.g., http.client.duration or jobs_cpu_time). The bucket boundaries are in milliseconds; for display purposes in this document, we have used abbreviations (e.g. 60min). If you wish to change the above default values, all time representations MUST be substituted with their millisecond equivalents (e.g., 60min would become 3600000).

The Hyperscience application uses Explicit Bucket Histograms. Explicit buckets are stated in terms of their upper boundary. Buckets are exclusive of their lower boundary and inclusive of their upper boundary, except at positive infinity. Each measurement belongs to the greatest-numbered bucket with a boundary that is greater than or equal to the measurement. For information, see OpenTelemetry’s Metrics SDK documentation in GitHub. 

Too few buckets leads to less accurate metrics, but too many could potentially cause high RAM usage, high disk usage, and slower performance.

We do not recommend changing the boundaries once the Hyperscience application is deployed and running, as it would lead to further incompatible bucket ranges. These ranges would be problematic, for example, when calculating quantiles over overlapping periods of time (e.g., overlapping old and new layouts). Also, excessive buckets could lead to high cardinality, which could potentially cause high RAM usage, high disk usage, and slower performance.

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS=
[0,1000,5000,15sec,30sec,1min,5min,15min,30min,60min]

This variable configures the bucket boundaries for Histogram metrics that have a name ending in "_duration_tasks" (e.g., hyperflow_task_backlog_duration_tasks or hyperflow_task_run_duration_tasks).

By default, these metrics have fewer buckets, because they have higher cardinality than the rest. Otherwise, they would generate higher load on the observability infrastructure, leading to higher RAM usage, higher disk usage, and slower query performance.

Also, see the details on buckets given in OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION.