Application Monitoring

Monitoring your Hyperscience Platform effectively is critical for identifying and addressing issues promptly. This article provides instructions for monitoring the system, focusing on resource utilization, job failures, and workflow processing. For more information, contact your Hyperscience representative or Hyperscience Support.

Standard Monitoring

To ensure the stability and performance of your VM hosting the Hyperscience application, monitor the following resources:

Ram

The application uses up to 95% of available RAM when processing submissions.

Recommendation
Set alerts if RAM usage exceeds 95%.

Storage

Monitor all types of storage:

Local Storage: Independently monitor all volumes.
Networked File Storage (NFS): Verify that NFS is mounted correctly and accessible from all VMs.
Database Storage: Track DB storage usage.

Recommendation
Set alerts if any storage exceeds 95% usage.

CPU

Hyperscience maximizes CPU utilization during submission processing.

CPU Metrics
100% CPU usage over extended periods does not necessarily indicate an issue. Use CPU metrics for debugging rather than alerting.

Health Check API

Periodically check the health of the application using the Health Check API.
Impact: If an issue is detected, processing on the affected host will pause until resolved. Other hosts will remain unaffected.

Application-specific monitoring

Hyperscience processes submissions asynchronously through background jobs. Failures may occasionally occur, requiring monitoring and manual intervention.

Monitoring failures

Monitor for log entries indicating job or flow failures:

Job failures:
- Set up alerts for the following log messages:
  - WORKER_FAIL
  - WORKER_JOB_FAIL
- These indicate job failures requiring manual intervention.
Flow failures:
- Most halted submissions now produce a log line containing Workflow failed.

Recommendation
Monitor this logline to detect issues early.

Actions for handling failures

When failures are detected, follow these steps to resolve them:

Handle halted jobs
1. Go to the list of halted jobs:
  <application_URL>/administration/jobs?state=HALTED
2. Review halted jobs and address the issue causing the failure.
Handle failed flows
1. Go to the list of failed flows:
  <application_URL>/administration/flows?state=FAILED
2. Use the Actions drop-down menu to select Retry failed flows runs in filter after addressing the root cause.
Gather failure details
1. For jobs:
  - In the list of halted jobs, use the job’s menu to view more details (View Jobs), and review the State Description field for insights into the issue.
2. Include relevant information when reaching out to Hyperscience Support for assistance.

Best practices

Configure log triggers for:
- WORKER_FAIL
- WORKER_JOB_FAIL
- Workflow failed
Monitor for container restarts, as they may indicate underlying system issues.
Ensure sufficient resources (RAM, storage, etc.) are allocated to avoid preventable failures.
Regularly review and retry failed jobs and flows via the provided URLs.