Monitoring your Hyperscience Platform effectively is critical for identifying and addressing issues promptly. This article provides instructions for monitoring the system, focusing on resource utilization, job failures, and workflow processing. For more information, contact your Hyperscience representative or Hyperscience Support.
Standard Monitoring
To ensure the stability and performance of your VM hosting the Hyperscience application, monitor the following resources:
Ram
The application uses up to 95% of available RAM when processing submissions.
Recommendation
Set alerts if RAM usage exceeds 95%.
Storage
Monitor all types of storage:
Local Storage: Independently monitor all volumes.
Networked File Storage (NFS): Verify that NFS is mounted correctly and accessible from all VMs.
Database Storage: Track DB storage usage.
Recommendation
Set alerts if any storage exceeds 95% usage.
CPU
Hyperscience maximizes CPU utilization during submission processing.
CPU Metrics
100% CPU usage over extended periods does not necessarily indicate an issue. Use CPU metrics for debugging rather than alerting.
Health Check API
Periodically check the health of the application using the Health Check API.
Impact: If an issue is detected, processing on the affected host will pause until resolved. Other hosts will remain unaffected.
Application-specific monitoring
Hyperscience processes submissions asynchronously through background jobs. Failures may occasionally occur, requiring monitoring and manual intervention.
Monitoring failures
Monitor for log entries indicating job or flow failures:
Job failures:
Set up alerts for the following log messages:
WORKER_FAIL
WORKER_JOB_FAIL
These indicate job failures requiring manual intervention.
Flow failures:
Most halted submissions now produce a log line containing
Workflow failed
.
Recommendation
Monitor this logline to detect issues early.
Actions for handling failures
When failures are detected, follow these steps to resolve them:
Handle halted jobs
Go to the list of halted jobs:
<application_URL>/administration/jobs?state=HALTED
Review halted jobs and address the issue causing the failure.
Handle failed flows
Go to the list of failed flows:
<application_URL>/administration/flows?state=FAILED
Use the Actions drop-down menu to select Retry failed flows runs in filter after addressing the root cause.
Gather failure details
For jobs:
In the list of halted jobs, use the job’s menu to view more details (View Jobs), and review the State Description field for insights into the issue.
Include relevant information when reaching out to Hyperscience Support for assistance.
Best practices
Configure log triggers for:
WORKER_FAIL
WORKER_JOB_FAIL
Workflow failed
Monitor for container restarts, as they may indicate underlying system issues.
Ensure sufficient resources (RAM, storage, etc.) are allocated to avoid preventable failures.
Regularly review and retry failed jobs and flows via the provided URLs.