OpenTelemetry

This feature is available in v37 and later.

OpenTelemetry allows you to monitor the usage and health of your Hyperscience instance alongside that of other applications in your IT infrastructure. The data stream includes metrics for submission volume and throughput, time to completion for blocks and submissions, response times, error rates, and connectivity issues, among others.

The Hyperscience application emits only traces and metrics telemetry and is instrumented with the OpenTelemetry Python SDK. For more details, see the SDK’s GitHub repository page. Logs are not supported in OpenTelemetry format at this time.

Enabling OpenTelemetry in Hyperscience

Follow the steps below to expose an OpenTelemetry data stream to your application-performance monitoring tool.

Kubernetes deployments

1. Enable OpenTelemetry metrics and traces

To enable OpenTelemetry from the Hyperscience Helm chart to export metrics and traces to an OpenTelemetry collector, add the following to the values.yaml file:

opentelemetry:
  enabled: true
  metrics:
    endpoint: 'http://<collector-fqdn>:<port>/v1/metrics'
  traces:
    endpoint: 'http://<collector-fqdn>:<port>/v1/traces'

By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.

NOTE: In principle, Hyperscience could be configured with the gRPC protocol. However we strongly discourage use of OTLP/gRPC with Hyperscience. Our internal tests surfaced critical issues with the currently used versions of the Python OpenTelemetry client and gRPC. In a multiprocessing environment, it results in segfaults, hanging threads, and gRPC exceptions. For more details, see the Forking with PeriodicExportingMetricReader results in ValueError issue in GitHub.

2. Secure an OTLP connection

It’s possible to enable TLS for the telemetry stream emitted from the Hyperscience application. However, only TLS is supported in OpenTelemetry Python; mTLS is not currently supported (see details of the Support mtls for otlp exporter issue in GitHub).

Note that, if you’re exporting OTLP data with TLS to an OpenTelemetry collector, then the collector must be configured to receive encrypted data.

To enable TLS in the Hyperscience application:

a. The server certificate has to be added as a Kubernetes Secret object to your cluster. In the values.yaml file for the Helm chart, the following snippet must be added (shown below are the default values from the chart, in case a given property is omitted):

opentelemetry:
  tls:
     certSecretName: ''
     certName: 'certificate.pem'

opentelemetry.tls.certSecretName should be the name of the created Kubernetes Secret, holding the server certificate.
opentelemetry.tls.certNameshould be the name of the item inside the Secret data, where the certificate is stored (defaults to certificate.pem).

b. The schema of the endpoints must be changed to “https://”. Additionally, the port must be changed to the TLS receiving port of the OpenTelemetry collector (see Secure an OTLP connection under "Docker Compose deployments" in this article for more information).

opentelemetry:
  enabled: true
  metrics:
    endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/metrics'
  traces:
    endpoint: 'https://<collector-fqdn>:<TLS-port>/v1/traces'

NOTE: The value of the OTEL_EXPORTER_OTLP_CERTIFICATE “.env” file variable is automatically set by the Hyperscience Helm chart; it shouldn’t be configured manually!

3. Add extra configuration variables as needed.

OpenTelemetry Python's opentelemetry.sdk.environment_variables describes the variables provided by the SDK. Additionally, Hyperscience-specific environment variables are also available and described at the end of this document.

Environment variables should be added to both the app and trainer sections in values.yaml. For example, if you would like metrics and traces to be exported every 10 seconds (instead of the default 30 seconds):

app:
  dotenv:
    OTEL_METRIC_EXPORT_INTERVAl=10000
…
trainer
  env:
    OTEL_METRIC_EXPORT_INTERVAl=10000

Docker Compose deployments

1. Enable OpenTelemetry metrics and traces.

To enable OpenTelemetry in the Hyperscience application to export metrics and traces to an OpenTelemetry collector, the following environment variables must be set in the “.env” file:

OTEL_SDK_DISABLED=false
OTEL_PYTHON_LOG_CORRELATION=true

OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://:/v1/metrics
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://:/v1/traces

OTEL_SDK_DISABLED instruments the application with metrics and traces, and OTEL_PYTHON_LOG_CORRELATION enables the addition of span and trace IDs in the application logs.

Our application also sets default values for the following environment variables:

OTEL_SERVICE_NAME=hyperscience

OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp

OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf'

OTEL_METRIC_EXPORT_INTERVAl=30000

By default, the Hyperscience application uses an OTLP exporter with the HTTP + protobuf protocol. See OpenTelemetry’s OTLP Exporter Configuration for more information.

If you do not add these variables to your “.env” file, the above default values will be used after you add OTEL_SDK_DISABLED to your ".env" file and run one of the commands described in Editing the “.env” file and running the application.

Many more options are available for configuring OpenTelemetry exporters and protocols in the Hyperscience application. Refer to OpenTelemetry’s official documentation for details:

2. Secure an OTLP connection.

For OTLP specifically, if you want to use a secure connection, the schema must be changed to “https://”, and extra environment variables should be set up (e.g., pointing to the certificate path, key, etc.). The certificate must be available in HS application. The steps below give an example configuration and instructions for adding it.

Create the directory that will hold the certificate, if it does not already exist. Assuming “/mnt/hs/” is used as the HS_PATH, run the following command to create the directory:
```
mkdir -p /mnt/hs/certs
```

Copy the certificate to this directory by running the following command:
```
chmod -R 1000:1000 /mnt/hs/certs
```

If SELinux is enabled, execute:
```
chcon -t container_file_t -R /mnt/hs/certs/
```
If SELinux is enabled, each time a file is added to the certs directory above for any reason, you will need to execute the chcon command again.

Add the following to the “.env” file:

OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=https://:/v1/metrics
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=https://:/v1/traces
OTEL_EXPORTER_OTLP_CERTIFICATE=/etc/nginx/certs/

More information is provided in OpenTelemetry’s documentation:

3. Add extra configuration variables as needed.

OpenTelemetry traces

Hyperscience creates and emits a trace for every created submission and every API request received. Traces are useful when troubleshooting issues.

Application logs contain trace_id and span_id data, as shown below. Logs are recorded automatically when the OTEL_PYTHON_LOG_CORRELATION “.env” file variable is set to true.

When troubleshooting a problem (e.g., latency, errors), you can copy the trace_id from the corresponding logs and search for them in your tracing backend, if configured. An example with Grafana and Tempo is shown below.

Note that when viewing traces of submissions, gaps in time are possible in the trace. They occur when the submission is waiting for manual tasks to be performed by keyers (e.g., Transcription Supervision).

Available Hyperscience metrics

Group	Metric	What do we use if for	Notes
Hyperflow	hyperflow_task_count	Number of tasks; set when task changes status.	attributes: name, ref_name, status
	hyperflow_task_backlog_duration	Duration of task in the backlog; set when task becomes IN_PROGRESS.	attributes: name, ref_name
	hyperflow_task_run_duration	Duration of task run time; set when task terminates.	attributes: name, ref_name
	hyperflow_task_poll_duration	Duration of fetching a limited count of pending tasks.	attributes: name, count
	hyperflow_workflow_count	Number of flows; set when flow changes status.	attributes: name, top_level, version, status The 'top_level=True' tag is useful for tracking number of submissions.
	hyperflow_workflow_run_duration	Duration of flow run; set when flow terminates.	attributes: name, top_level, version
	hyperflow_payload_offload_count	Number of WFE payloads offloaded to the object store.	attributes: bucket_10k (size in 10s of KiB)
	hyperflow_payload_store_duration	Duration of storing WFE payload into the object store.	attributes: bucket_10k (size in 10s of KiB)
	hyperflow_payload_fetch_duration	Duration of fetching WFE payload from the object store.
	hyperflow_engine_poll_workflows_duration	Duration of fetching workflows ready to be advanced.
	hyperflow_engine_advance_workflow_duration	Duration of advancing a single workflow instance.
Job queue	jobs_count	Number of jobs; set when job changes state.	attributes: type, state
	jobs_backlog_duration	Duration jobs wait in the queue.	attributes: type
	jobs_exec_duration	Duration of the job run time.	attributes: type
	jobs_cpu_time	CPU time for running the job.	attributes: type
	jobs_system_time	System time for running the job.	attributes: type
	jobs_query_duration	Total duration of the DB queries executed in job.	attributes: type
	jobs_worker_duration	Total duration of job execution (queries + everything else).	attributes: type
	jobs_worker_query_count	Number of DB queries executed in job.	attributes: type
Transcription	hs_task_time_taken_ms	Milliseconds a human or machine transcribed a single field.	attributes: task_type (transcription), entry_type (machine/human); optional attributes: user_is_staff, username
	hs_machine_field_transcriptions	Milliseconds a machine transcribed a single field with non-zero confidence.	attributes: confidence_rd5, ml_model
	hs_completed_human_entries_count	Number of completed fields in a Transcription SV task.	attributes: worker (username), task_type, status (DONE)
	hs_finished_qa_records	Number of completed Transcription QA supervision tasks.	attributes: (multiple)
Submission pages	hs_submission_page_count	Number of created submission pages.
Submission pages	hs_submission_page_completed_count	Number of completed submission pages.	attributes: (optional) error_type
TDM (fka KDM)	hs_kdm_table_loaded_count	Number of loaded (shown) training documents for Tables through the TDM API.	example dashboard
TDM (fka KDM)	hs_kdm_table_saved_count	Number of updated training documents for Tables through the TDM API.
Table Layouts	hs_live_layout_tables_count	Number of tables in live layouts (sent once daily).	attributes: n_items
	hs_live_layout_columns_count	Number of table columns in live layouts (sent once daily).	attributes: n_items
	hs_working_layout_tables_count	Number of tables in draft layouts (sent once daily).	attributes: n_items
	hs_working_layout_columns_count	Number of table columns in draft layouts (sent once daily).	attributes: n_items
Table SV	hs_copycat_time_taken_ms	Milliseconds for copy-cat algorithm runtime (part of API call)
Table SV	hs_table_id_qa_tasks_until_consensus	Number of times consensus was reached in Table ID QA, tagged with number of QA tasks used.	attributes: num_qa_tasks
SV	task_response_time_taken_ms	Time spent on a manual supervision task.	attributes: worker (username), task_type, status (DONE)
	task_response_submit_success	Number of successfully completed manual supervision tasks.	attributes: worker (username), task_type, status (DONE)
	task_response_submit_fail	Number of invalid responses for manual supervision tasks.	attributes: worker (username), task_type, status (DONE)
	crowd_user_activity	Number of times a user starts/stops working on a manual supervision task.	attributes: worker (username), activity (take a break/start working)
	crowd_query_next_task	Misc. timing for several small DB queries when fetching next manual tasks.
DB deadlocks	hs_retry_db_transaction_count	Number of times a DB transaction is retried, e.g. due to DB deadlock.	attributes: deadlock_retrial_count (optional, when set is 1), retry_transaction_exception_count (optional, when set is 1)

Hyperscience-specific environment variables

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION=
[0,100,250,500,750,1000,2500,5000,7500,10sec,15sec,20sec,30sec,45sec,1min,2min,3min,
4min,5min,10min,30min,60min,3h,12h,24h]

This variable configures the bucket boundaries for all Histogram metrics that have a name ending in "duration" or "time" (e.g., http.client.duration or jobs_cpu_time). The bucket boundaries are in milliseconds; for display purposes in this document, we have used abbreviations (e.g. 60min). If you wish to change the above default values, all time representations MUST be substituted with their millisecond equivalents (e.g., 60min would become 3600000).

The Hyperscience application uses Explicit Bucket Histograms. Explicit buckets are stated in terms of their upper boundary. Buckets are exclusive of their lower boundary and inclusive of their upper boundary, except at positive infinity. Each measurement belongs to the greatest-numbered bucket with a boundary that is greater than or equal to the measurement. For information, see OpenTelemetry’s Metrics SDK documentation in GitHub.

Too few buckets leads to less accurate metrics, but too many could potentially cause high RAM usage, high disk usage, and slower performance.

We do not recommend changing the boundaries once the Hyperscience application is deployed and running, as it would lead to further incompatible bucket ranges. These ranges would be problematic, for example, when calculating quantiles over overlapping periods of time (e.g., overlapping old and new layouts). Also, excessive buckets could lead to high cardinality, which could potentially cause high RAM usage, high disk usage, and slower performance.

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS

OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION_TASKS=
[0,1000,5000,15sec,30sec,1min,5min,15min,30min,60min]

This variable configures the bucket boundaries for Histogram metrics that have a name ending in "_duration_tasks" (e.g., hyperflow_task_backlog_duration_tasks or hyperflow_task_run_duration_tasks).

By default, these metrics have fewer buckets, because they have higher cardinality than the rest. Otherwise, they would generate higher load on the observability infrastructure, leading to higher RAM usage, higher disk usage, and slower query performance.

Also, see the details on buckets given in OTEL_HISTOGRAM_BUCKET_BOUNDARIES_DURATION.