Tweaks
Using a shared volume mount file store
Our recommended approach for kubernetes installations is to use the S3 storage mode. For users that do not have access to an S3-compatible object storage like AWS S3 or min.io we also support file storage mode. This file store is only supported with volumes that allow the ReadWriteMany access mode. Example NFS/Glusterfs/CephFS/others. Full matrix here.
IMPORTANT: Migrating from FILE to S3 storage mode or vice versa is not supported. Picking an object storage is supported only on clean installations.
Prerequisites
helm chart version >= 8.8.0
hyperoperator version >= 5.6.0
To enable it supply the following helm chart values:
app:
storage_mode:
file:
# Only persistentVolumeClaim that allow ReadWriteMany access mode are supported
persistentVolumeClaim: nfs-pvc # Name of the PVC created previously
mountPath: /var/www/forms/forms/media
Using HTTP Proxy
If you have a proxy setup for your network, you need to pass the proxy settings to the application through the environment variables HTTP_PROXY, HTTPS_PROXY and NO_PROXY. These variables are used by the containers to communicate with the external services and resources.
To configure the proxy settings, you need to edit the values.yaml file in the helm chart and add the following section under the env key:
app:
dotenv:
- name: HTTP_PROXY
value: http://proxy.example.com:8080 # replace with your proxy URL
- name: HTTPS_PROXY
value: http://proxy.example.com:8080 # replace with your proxy URL
- name: NO_PROXY
value: localhost,cluster.local,127.0.0.1,.example.com # replace with your no proxy domains
operator:
env:
- name: HTTP_PROXY
value: http://proxy.example.com:8080 # replace with your proxy URL
- name: HTTPS_PROXY
value: http://proxy.example.com:8080 # replace with your proxy URL
- name: NO_PROXY
value: localhost,cluster.local,127.0.0.1,.example.com # replace with your no proxy domains
trainer:
env:
- name: HTTP_PROXY
value: http://proxy.example.com:8080 # replace with your proxy URL
- name: HTTPS_PROXY
value: http://proxy.example.com:8080 # replace with your proxy URL
- name: NO_PROXY
value: localhost,cluster.local,127.0.0.1,.example.com # replace with your no proxy domains
Blocks do not need to be configured with those variables as they inherit them from the backend. Save the file and deploy the helm chart with the updated values. The application will use the proxy settings from the environment variables.
Using SDM blocks with separate docker repositories
By default hyperoperator will look for the block images in a single repository with the following format:
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks:vpc...36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks:segmentation...36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks:python_code...36.0.2
...
If for some reason you can't upload all the images in the previously mentioned format, and you want to store each block in a separate repository, then you will need to instruct the hyperoperator for that. You need to export the environment variable HS_SEPARATE_BLOCK_REPOS=True
to the hyperoperator pod. This will tell the hyperoperator to look for the images in different repositories instead of a single one.
The format which the operator will then expect is
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks/vpc:36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks/segmentation:36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks/python_code:36.0.2
...
To set this environment variable for the hyperoperator, you can add the following snippet to your values.yaml file:
operator:
env:
HS_SEPARATE_BLOCK_REPOS: True
Using Universal Folder Listener block
NOTE: In kubernetes clusters a Universal Folder Listener is only supported with volumes that allow the ReadWriteMany access mode. Example NFS/Glusterfs/CephFS/others. Full matrix here.
Prerequisites
helm chart version >= 8.7.0
hyperoperator version >= 5.5.0
Expected downtime from this change
restart of hyperoperator
update and restart of the backend deployment
update and restart of the frontend deployment
Configuration
Modify your values.yaml to add the following (example) configuration. This will add the required volume to the universal folder listener block.
blocks:
volumes:
- name: my-nfs-name
nfs:
server: fs-test.efs.us-east-1.amazonaws.com
path: /media
volumeMounts:
- mountPath: /var/www/forms/forms/input
name: my-nfs-name
IMPORTANT: If a
mountPath
different from/var/www/forms/forms/input
is specified, then theFS_INPUT_PATH
environment variable has to be passed toapp.dotenv
with the same path value.
Troubleshooting
Running preflight (pre-installation) checks
Preflight checks run several queries against the Kubernetes APIs to determine whether your cluster has met the minimum requirements for installing Hyperscience.
Run preflight checks using the following command:
kubectl config set-context $(kubectl config current-context) --namespace=""
hsk8s preflight values.yaml
Generating diagnostics bundle
By default, executing the hsk8s support-bundle ...
command will collect both application-related and cluster-related data. The result is packed into a diagnostics.tar.gz
archive in the directory where the command is executed.
Generating a bundle will :
run several queries against the Kubernetes APIs to gather diagnostics information about your installation;
fetch some of the pod logs used by the Hyperscience deployments;
spin up a pod in the hyperscience namespace and execute some queries against the HS database to collect application information.
Generate the bundle using the following command:
kubectl config set-context $(kubectl config current-context) --namespace=""
hsk8s support-bundle -z --token
The --token
flag is necessary to fetch the latest version of the diagnostics.py
script from the Cloudsmith repository. The script is stored in the default $HOME/.hsk8s
folder. The script can also be provided with path via the --diag-path
flag.
If the token is configured in the hsk8s $HOME/config
file, the flag is not necessary.
The -z
flag or --prepare-zd-support
will split diagnostics.tar.gz
into chunks of the max allowed limit of Zendesk for file transfer. This happens only if the tarball is bigger in size than the limit (49 MB). You should see the following files generated in case of splitting: diagnostics.tar.gz1
, diagnostics.tar.gz2
, etc.
Send the resulting files to your Hyperscience support representative.
To extract the bundle execute the following command:
cat diagnostics_.tar.gz* > diagnostics_.tar.gz
tar -xzf diagnostics.tar.gz .
Collecting application troubleshooting data
Collecting application data is performed via the diagnostics.py
script, kept in $HOME/.hsk8s
folder. The script is executed against the hyperscience database from within a pod in the kubernetes cluster. After the information is collected it is copied from the pod to your local machine and the pod is cleaned up.
If you want to collect only application-related data, use the exclude-support-bundle flag, --xsb
.
hsk8s support-bundle -z --token --xsb
To control the diagnostics.py
, ".env" variables can be passed to the pod. These environment variables can be passed to the script via the -e
flag in hsk8s
:
hsk8s support-bundle -z --token -e MAX_WORKFLOW_DOWNLOAD_SECONDS=200 -e WORKFLOW_EXPORT_PERIOD_IN_SECS=300
Env vars options:
DATA_EXPORT_DEFAULT_PERIOD_IN_DAYS (default: 45)
CAPTURE_PII (default: false)
DISTINCT_CORRELATION_IDS_LIMIT (default: 50000)
MAX_WORKFLOW_DOWNLOAD_SECONDS (default: 900)
Less common env var options:
Grab only the most recent 1 hour worth of workflows:
WORKFLOW_EXPORT_PERIOD_IN_SECS=3600
Grab one hour of workflows between:
(WORKFLOW_EXPORT_LATEST_DT - WORKFLOW_EXPORT_PERIOD_IN_SECS, WORKFLOW_EXPORT_LATEST_DT)
WORKFLOW_EXPORT_LATEST_DT=2022-01-11T18:00:00-00:00 WORKFLOW_EXPORT_PERIOD_IN_SECS=3600
WORKFLOW_EXPORT_LATEST_DT=2022-01-11T20:00:00-08:00 WORKFLOW_EXPORT_PERIOD_IN_SECS=3600
To collect the application-diagnostics data, a kubernetes Job is deployed into the cluster. If the pod, created by the Job, cannot create a container within 5 minutes, the job will time out with an error (this period of time can be configured via flag --diag-run-timeout=300
; in case 5 minutes are not enough for some reason to initialize the pod, increase that time). After the data is collected in the pods ephemeral storage, it needs to be collected from the pod to the local machine. There is a timeout of 50 seconds for the collector container to start up (in case the time needs to be modified, use the --diag-collect-timeout=50
).
By default the following data is collected by the diagnostics.py
script:
capture_system_settings_export
capture_layout_release
capture_usage_report
capture_trained_models_metadata
capture_workflows_definitions
capture_threshold_audits
capture_machine_audit_logs
capture_attached_trainers
capture_latest_trainer_runs
capture_latest_jobs
capture_init_entries
capture_health_records
capture_health_statistics_records
capture_workflow_instances
capture_workflow_dsls
capture_task_counts
capture_workflow_counts
capture_db_info
To explicitly exclude one of the above listed actions, use the --actions
flag, combined with -b
/--blacklist-actions
hsk8s support-bundle -z --token --actions=capture_workflow_counts,capture_workflow_instances -b
To include only specific actions:
hsk8s support-bundle -z --token --actions=capture_workflow_counts,capture_workflow_instances
Collecting support-bundle
If you want to collect only support-bundle (cluster related data and pod's logs), exclude the application data with the flag --xappdata
. Cloudsmith token is not necessary to collect support-bundle.
hsk8s support-bundle -z --xappdata
To limit the number of logs lines (or logs' age) collected per pod one of the following flags (by default max number of lines that can be collected is 10,000) Valid time units for max-age are 'ns', 'us' (or 'µs'), 'ms', 's', 'm', 'h'.
hsk8s support-bundle -z --max-age=12h
hsk8s support-bundle -z --max-lines=100
For very big deployments (pod > 500), a flag must be used to specify either pods by label or only the essential pods for the logs to be collected. If none specified hsk8s results in error.
hsk8s support-bundle -z --label app.kubernetes.io/component=
hsk8s support-bundle -z --essential-only
Pod does not start
Pod stuck in Init:ImagePullBackOff status
A pod could get stuck in Init for several reasons, however the ImagePullBackOff error tells us a lot without even investigating further - Kubernetes has issues trying to pull the required image for starting the container in the Pod. Here we demonstrate some example steps to find the exact reason.
Get the list of pods
$ kubectl get pod
The following is a sample output:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hyperscience-backend-8659c86b7b-zcn4t 0/7 Init:ImagePullBackOff 0 7m58s
hyperscience-frontend-878f45f6f-wjf8p 0/2 Pending 0 7m58s
hyperscience-hyperoperator-5f7d6f7d95-489r7 0/1 CreateContainerConfigError 0 7m58s
Describe a pod stuck in Init:ImagePullBackOff.
$ kubectl describe pod hyperscience-backend-8659c86b7b-zcn4t
The following is a sample output:
Name: hyperscience-backend-8659c86b7b-zcn4t
Namespace: hyperscience
Priority: 0
Node: ip-10-10-93-158.ec2.internal/10.10.93.158
Labels: app.kubernetes.io/component=backend
app.kubernetes.io/instance=hyperscience
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=hyperscience
app.kubernetes.io/part-of=hyperscience
app.kubernetes.io/version=34.0.4
helm.sh/chart=hyperscience-5.0.0
pod-template-hash=8659c86b7b
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m17s default-scheduler Successfully assigned hyperscience/hyperscience-backend-8659c86b7b-zcn4t to ip-10-10-93-158.ec2.internal
Normal Pulling 5m50s (x4 over 7m16s) kubelet, ip-10-10-93-158.ec2.internal Pulling image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4"
Warning Failed 5m50s (x4 over 7m16s) kubelet, ip-10-10-93-158.ec2.internal Failed to pull image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4": rpc error: code = Unknown desc = Error response from daemon: manifest for 1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4 not found: manifest unknown: Requested image not found
Warning Failed 5m50s (x4 over 7m16s) kubelet, ip-10-10-93-158.ec2.internal Error: ErrImagePull
Warning Failed 5m28s (x6 over 7m16s) kubelet, ip-10-10-93-158.ec2.internal Error: ImagePullBackOff
Normal BackOff 2m7s (x20 over 7m16s) kubelet, ip-10-10-93-158.ec2.internal Back-off pulling image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4"
The pod description indicates that the kubelet cannot pull the specified forms image from the container registry. Check manually wether the image can actually be found in the docker registry, if not - use our hsk8s guide to stream the required image to your registry.
Pod stuck in Init Status
Another reason for a pod being stuck in Init could be a failing init container. Let's look at such example
Get the list of pods
$ kubectl get pod
The following is a sample output:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hyperscience-backend-6f694486f-l5bmg 0/7 Init:0/1 0 12m
hyperscience-frontend-546f8f74dc-mq4fr 0/2 Init:0/1 0 12m
hyperscience-hyperoperator-846688864d-vf4mg 1/1 Running 0 12m
Describe a pod stuck in Init
kubectl describe pod hyperscience-backend-6f694486f-l5bmg
The following is a sample output:
Name: hyperscience-backend-6f694486f-l5bmg
Namespace: hyperscience
Priority: 0
Node: ip-10-10-93-158.ec2.internal/10.10.93.158
Labels: app.kubernetes.io/component=backend
app.kubernetes.io/instance=hyperscience
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=hyperscience
app.kubernetes.io/part-of=hyperscience
app.kubernetes.io/version=34.0.3
helm.sh/chart=hyperscience-5.0.0
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20m (x2 over 20m) default-scheduler 0/2 nodes are available: 1 Insufficient memory, 1 node(s) didn't match Pod's node affinity/selector.
Normal Scheduled 19m default-scheduler Successfully assigned hyperscience/hyperscience-backend-6f694486f-l5bmg to ip-10-10-93-158.ec2.internal
Normal Pulled 19m kubelet, ip-10-10-93-158.ec2.internal Container image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.3" already present on machine
Normal Created 19m kubelet, ip-10-10-93-158.ec2.internal Created container init
Normal Started 19m kubelet, ip-10-10-93-158.ec2.internal Started container init
From the description of this pod, we can see that the pod was successfully assigned to a node - ip-10-10-93-158.ec2.internal
. The image was already pulled and the pod was started, with the init container starting first.
Take a look at the logs of the init container.
$ kubectl logs -f hyperscience-backend-6f694486f-l5bmg -c init
...
...
CommandError: Failed connecting to db. (connection to server at "hyperscience.x1x1x1x1.us-east-1.rds.amazonaws.com" (10.10.61.115), port 5432 failed: FATAL: password authentication failed for user "my-postgres-role"
connection to server at "hyperscience.x1x1x1x1.us-east-1.rds.amazonaws.com" (10.10.61.115), port 5432 failed: FATAL: password authentication failed for user "my-postgres-role"
)
The logs indicate that the init container's attempts to authenticate into the RDS database are unsuccessful. Investigate if the secret you created while following the Helm Chart guide contains the correct credentials and endpoint.
Pod stuck in Pending status
Get the list of pods
$ kubectl get pod
The following is a sample output:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hyperscience-backend-8659c86b7b-zcn4t 0/7 Init:ImagePullBackOff 0 7m58s
hyperscience-frontend-878f45f6f-wjf8p 0/2 Pending 0 7m58s
hyperscience-hyperoperator-5f7d6f7d95-489r7 0/1 CreateContainerConfigError 0 7m58s
Describe the pod stuck in Pending status
kubectl describe pod hyperscience-frontend-878f45f6f-wjf8p
The following is a sample output:
Name: hyperscience-frontend-878f45f6f-wjf8p
Namespace: hyperscience
Priority: 0
Node:
Labels: app.kubernetes.io/component=frontend
app.kubernetes.io/instance=hyperscience
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=hyperscience
app.kubernetes.io/part-of=hyperscience
app.kubernetes.io/version=34.0.4
helm.sh/chart=hyperscience-5.0.0
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 0s (x23 over 21m) default-scheduler 0/2 nodes are available: 1 Insufficient memory, 1 node(s) didn't match Pod's node affinity/selector.
The pod description indicates that the kubernetes scheduler could not find a suitable node to assign the pod to. One of them doesn't have enough available resources to match the pod memory requests and the other one doesn't have the required node selector labels.
The following is an example output of a node which has the required label to match the Pod's node affinity. The important part here is the hs-component=platform
label
$ kubectl describe node ip-10-10-93-158.ec2.internal
Name: ip-10-10-93-158.ec2.internal
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m5.2xlarge
beta.kubernetes.io/os=linux
eks.amazonaws.com/capacityType=ON_DEMAND
eks.amazonaws.com/nodegroup=platform
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
hs-component=platform
...
If this is the reason your pod is not being scheduled you should inspect your nodes and add the required label as needed per our Infrastructure Requirements.
Operator stuck in CreateContainerConfigError status
Get the list of pods
$ kubectl get pod
The following is a sample output:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hyperscience-backend-8659c86b7b-zcn4t 0/7 Init:ImagePullBackOff 0 7m58s
hyperscience-frontend-878f45f6f-wjf8p 0/2 Pending 0 7m58s
hyperscience-hyperoperator-5f7d6f7d95-489r7 0/1 CreateContainerConfigError 0 7m58s
Describe the hyperoperator pod stuck in CreateContainerConfigError status
$ kubectl describe pod hyperscience-hyperoperator-5f7d6f7d95-489r7
The following is a sample output:
Name: hyperscience-hyperoperator-5f7d6f7d95-489r7
Namespace: hyperscience
Priority: 0
Node: ip-10-10-93-158.ec2.internal/10.10.93.158
Labels: app.kubernetes.io/component=hyperoperator
app.kubernetes.io/instance=hyperscience
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=hyperscience
app.kubernetes.io/part-of=hyperscience
app.kubernetes.io/version=34.0.4
helm.sh/chart=hyperscience-5.0.0
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 59m default-scheduler Successfully assigned hyperscience/hyperscience-hyperoperator-5f7d6f7d95-489r7 to ip-10-10-93-158.ec2.internal
Normal Pulling 59m kubelet, ip-10-10-93-158.ec2.internal Pulling image "1234567890.dkr.ecr.us-east-1.amazonaws.com/hyperoperator:3.3.4"
Normal Pulled 59m kubelet, ip-10-10-93-158.ec2.internal Successfully pulled image "1234567890.dkr.ecr.us-east-1.amazonaws.com/hyperoperator:3.3.4" in 171.682343ms
Warning Failed 56m (x12 over 59m) kubelet, ip-10-10-93-158.ec2.internal Error: secret "hyperscience-platform" not found
Normal Pulled 4m2s (x258 over 59m) kubelet, ip-10-10-93-158.ec2.internal Container image "1234567890.dkr.ecr.us-east-1.amazonaws.com/hyperoperator:3.3.4" already present on machine
The pod description indicates that a required kubernetes secret is missing. Refer to the Helm Chart article to find more information about the required secret.
Submissions are stuck
No block pods are present
Submissions are processed by worker containers, which we call blocks. If helm install
and helm uninstall
are called multiple times successively, they might disappear. If they are missing, it's likely that the deployment state is the following.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hyperscience-backend-5b5dcd984f-f6p8j 6/6 Running 0 18m
hyperscience-frontend-56756b55f6-t7rfw 2/2 Running 0 18m
hyperscience-hyperflow-engine-7bb598699f-tvb52 1/1 Running 0 18m
hyperscience-hyperoperator-666f8fc5db-vgvrv 2/2 Running 0 18m
hyperscience-idp-sync-manager-5b59847777-sdbrf 1/1 Running 0 18m
First, check if the HyperBlockManager resource is present. If it is, force delete its finalizers so it can be deleted by Kubernetes.
$ kubectl get hyperblockmanagers
NAME AGE
hyperscience 20m
$ kubectl patch hyperblockmanager hyperscience -p '{"metadata":{"finalizers":[]}}' --type=merge
hyperblockmanager.hyperscience.net/hyperscience patched
Then, recreate it with the following command, which will in turn recreate the blocks.
$ helm upgrade $HS_HELM_RELEASE -f values.yaml $HS_HELM_CHART