Kubernetes Troubleshooting and Tweaks

Tweaks

Using a shared volume mount file store

Our recommended approach for kubernetes installations is to use the S3 storage mode. For users that do not have access to an S3-compatible object storage like AWS S3 or min.io we also support file storage mode. This file store is only supported with volumes that allow the ReadWriteMany access mode. Example NFS/Glusterfs/CephFS/others. Full matrix here.

IMPORTANT: Migrating from FILE to S3 storage mode or vice versa is not supported. Picking an object storage is supported only on clean installations.

Prerequisites

  • helm chart version >= 8.8.0

  • hyperoperator version >= 5.6.0

To enable it supply the following helm chart values:

app:
  storage_mode:
    file:
      # Only persistentVolumeClaim that allow ReadWriteMany access mode are supported
      persistentVolumeClaim: nfs-pvc  # Name of the PVC created previously
      mountPath: /var/www/forms/forms/media

Using HTTP Proxy

If you have a proxy setup for your network, you need to pass the proxy settings to the application through the environment variables HTTP_PROXY, HTTPS_PROXY and NO_PROXY. These variables are used by the containers to communicate with the external services and resources.

To configure the proxy settings, you need to edit the values.yaml file in the helm chart and add the following section under the env key:

app:
  dotenv:
  - name: HTTP_PROXY
    value: http://proxy.example.com:8080 # replace with your proxy URL
  - name: HTTPS_PROXY
    value: http://proxy.example.com:8080 # replace with your proxy URL
  - name: NO_PROXY
    value: localhost,cluster.local,127.0.0.1,.example.com # replace with your no proxy domains

operator:
  env:
  - name: HTTP_PROXY
    value: http://proxy.example.com:8080 # replace with your proxy URL
  - name: HTTPS_PROXY
    value: http://proxy.example.com:8080 # replace with your proxy URL
  - name: NO_PROXY
    value: localhost,cluster.local,127.0.0.1,.example.com # replace with your no proxy domains

trainer:
  env:
  - name: HTTP_PROXY
    value: http://proxy.example.com:8080 # replace with your proxy URL
  - name: HTTPS_PROXY
    value: http://proxy.example.com:8080 # replace with your proxy URL
  - name: NO_PROXY
    value: localhost,cluster.local,127.0.0.1,.example.com # replace with your no proxy domains

Blocks do not need to be configured with those variables as they inherit them from the backend. Save the file and deploy the helm chart with the updated values. The application will use the proxy settings from the environment variables.

Using SDM blocks with separate docker repositories

By default hyperoperator will look for the block images in a single repository with the following format:

0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks:vpc...36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks:segmentation...36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks:python_code...36.0.2
...

If for some reason you can't upload all the images in the previously mentioned format, and you want to store each block in a separate repository, then you will need to instruct the hyperoperator for that. You need to export the environment variable HS_SEPARATE_BLOCK_REPOS=True to the hyperoperator pod. This will tell the hyperoperator to look for the images in different repositories instead of a single one.

The format which the operator will then expect is

0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks/vpc:36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks/segmentation:36.0.2
0123456789.dkr.ecr.us-east-1.amazonaws.com/sdm_blocks/python_code:36.0.2
...

To set this environment variable for the hyperoperator, you can add the following snippet to your values.yaml file:

operator:
  env:
    HS_SEPARATE_BLOCK_REPOS: True

Using Universal Folder Listener block

NOTE:  In kubernetes clusters a Universal Folder Listener is only supported with volumes that allow the ReadWriteMany access mode. Example NFS/Glusterfs/CephFS/others. Full matrix here.

Prerequisites

  • helm chart version >= 8.7.0

  • hyperoperator version >= 5.5.0

Expected downtime from this change

  • restart of hyperoperator

  • update and restart of the backend deployment

  • update and restart of the frontend deployment

Configuration

Modify your values.yaml to add the following (example) configuration. This will add the required volume to the universal folder listener block.

blocks:
  volumes:
  - name: my-nfs-name
    nfs:
      server: fs-test.efs.us-east-1.amazonaws.com
      path: /media
  volumeMounts:
  - mountPath: /var/www/forms/forms/input
    name: my-nfs-name

IMPORTANT: If a mountPath different from /var/www/forms/forms/input is specified, then the FS_INPUT_PATH environment variable has to be passed to app.dotenv with the same path value.

Troubleshooting

Running preflight (pre-installation) checks

Preflight checks run several queries against the Kubernetes APIs to determine whether your cluster has met the minimum requirements for installing Hyperscience.

Run preflight checks using the following command:

kubectl config set-context $(kubectl config current-context) --namespace=""
hsk8s preflight values.yaml

Generating diagnostics bundle

By default, executing the hsk8s support-bundle ... command will collect both application-related and cluster-related data. The result is packed into a diagnostics.tar.gz archive in the directory where the command is executed.

Generating a bundle will :

  • run several queries against the Kubernetes APIs to gather diagnostics information about your installation;

  • fetch some of the pod logs used by the Hyperscience deployments;

  • spin up a pod in the hyperscience namespace and execute some queries against the HS database to collect application information.

Generate the bundle using the following command:

kubectl config set-context $(kubectl config current-context) --namespace=""
hsk8s support-bundle  -z --token 

The --token flag is necessary to fetch the latest version of the diagnostics.py script from the Cloudsmith repository. The script is stored in the default $HOME/.hsk8s folder. The script can also be provided with path via the --diag-path flag.

If the token is configured in the hsk8s $HOME/config file, the flag is not necessary.

The -z flag or --prepare-zd-support will split diagnostics.tar.gz into chunks of the max allowed limit of Zendesk for file transfer. This happens only if the tarball is bigger in size than the limit (49 MB). You should see the following files generated in case of splitting: diagnostics.tar.gz1, diagnostics.tar.gz2, etc.

Send the resulting files to your Hyperscience support representative.

To extract the bundle execute the following command:

cat diagnostics_.tar.gz* > diagnostics_.tar.gz
tar -xzf diagnostics.tar.gz .

Collecting application troubleshooting data

Collecting application data is performed via the diagnostics.py script, kept in $HOME/.hsk8s folder. The script is executed against the hyperscience database from within a pod in the kubernetes cluster. After the information is collected it is copied from the pod to your local machine and the pod is cleaned up.

If you want to collect only application-related data, use the exclude-support-bundle flag, --xsb.

hsk8s support-bundle  -z --token  --xsb

To control the diagnostics.py, ".env" variables can be passed to the pod. These environment variables can be passed to the script via the -e flag in hsk8s:

hsk8s support-bundle  -z --token  -e MAX_WORKFLOW_DOWNLOAD_SECONDS=200 -e WORKFLOW_EXPORT_PERIOD_IN_SECS=300

Env vars options:

DATA_EXPORT_DEFAULT_PERIOD_IN_DAYS (default: 45)
CAPTURE_PII (default: false)
DISTINCT_CORRELATION_IDS_LIMIT (default: 50000)
MAX_WORKFLOW_DOWNLOAD_SECONDS (default: 900)

Less common env var options:

Grab only the most recent 1 hour worth of workflows:

WORKFLOW_EXPORT_PERIOD_IN_SECS=3600

Grab one hour of workflows between:

(WORKFLOW_EXPORT_LATEST_DT - WORKFLOW_EXPORT_PERIOD_IN_SECS, WORKFLOW_EXPORT_LATEST_DT)
WORKFLOW_EXPORT_LATEST_DT=2022-01-11T18:00:00-00:00 WORKFLOW_EXPORT_PERIOD_IN_SECS=3600
WORKFLOW_EXPORT_LATEST_DT=2022-01-11T20:00:00-08:00 WORKFLOW_EXPORT_PERIOD_IN_SECS=3600

To collect the application-diagnostics data, a kubernetes Job is deployed into the cluster. If the pod, created by the Job, cannot create a container within 5 minutes, the job will time out with an error (this period of time can be configured via flag --diag-run-timeout=300; in case 5 minutes are not enough for some reason to initialize the pod, increase that time). After the data is collected in the pods ephemeral storage, it needs to be collected from the pod to the local machine. There is a timeout of 50 seconds for the collector container to start up (in case the time needs to be modified, use the --diag-collect-timeout=50).

By default the following data is collected by the diagnostics.py script:

  • capture_system_settings_export

  • capture_layout_release

  • capture_usage_report

  • capture_trained_models_metadata

  • capture_workflows_definitions

  • capture_threshold_audits

  • capture_machine_audit_logs

  • capture_attached_trainers

  • capture_latest_trainer_runs

  • capture_latest_jobs

  • capture_init_entries

  • capture_health_records

  • capture_health_statistics_records

  • capture_workflow_instances

  • capture_workflow_dsls

  • capture_task_counts

  • capture_workflow_counts

  • capture_db_info

To explicitly exclude one of the above listed actions, use the --actions flag, combined with -b/--blacklist-actions

hsk8s support-bundle  -z --token  --actions=capture_workflow_counts,capture_workflow_instances -b

To include only specific actions:

hsk8s support-bundle  -z --token  --actions=capture_workflow_counts,capture_workflow_instances

Collecting support-bundle

If you want to collect only support-bundle (cluster related data and pod's logs), exclude the application data with the flag --xappdata. Cloudsmith token is not necessary to collect support-bundle.

hsk8s support-bundle  -z --xappdata

To limit the number of logs lines (or logs' age) collected per pod one of the following flags (by default max number of lines that can be collected is 10,000) Valid time units for max-age are 'ns', 'us' (or 'µs'), 'ms', 's', 'm', 'h'.

hsk8s support-bundle  -z --max-age=12h
hsk8s support-bundle  -z --max-lines=100

For very big deployments (pod > 500), a flag must be used to specify either pods by label or only the essential pods for the logs to be collected. If none specified hsk8s results in error.

hsk8s support-bundle  -z --label app.kubernetes.io/component=
hsk8s support-bundle  -z --essential-only

Pod does not start

Pod stuck in Init:ImagePullBackOff status

A pod could get stuck in Init for several reasons, however the ImagePullBackOff error tells us a lot without even investigating further - Kubernetes has issues trying to pull the required image for starting the container in the Pod. Here we demonstrate some example steps to find the exact reason.

Get the list of pods

$ kubectl get pod

The following is a sample output:

$ kubectl get pod
NAME                                          READY   STATUS                       RESTARTS   AGE
hyperscience-backend-8659c86b7b-zcn4t         0/7     Init:ImagePullBackOff        0          7m58s
hyperscience-frontend-878f45f6f-wjf8p         0/2     Pending                      0          7m58s
hyperscience-hyperoperator-5f7d6f7d95-489r7   0/1     CreateContainerConfigError   0          7m58s

Describe a pod stuck in Init:ImagePullBackOff.

$ kubectl describe pod hyperscience-backend-8659c86b7b-zcn4t

The following is a sample output:

Name:         hyperscience-backend-8659c86b7b-zcn4t
Namespace:    hyperscience
Priority:     0
Node:         ip-10-10-93-158.ec2.internal/10.10.93.158
Labels:       app.kubernetes.io/component=backend
              app.kubernetes.io/instance=hyperscience
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=hyperscience
              app.kubernetes.io/part-of=hyperscience
              app.kubernetes.io/version=34.0.4
              helm.sh/chart=hyperscience-5.0.0
              pod-template-hash=8659c86b7b
...
...
Events:
  Type     Reason     Age                    From                                   Message
  ----     ------     ----                   ----                                   -------
  Normal   Scheduled  7m17s                  default-scheduler                      Successfully assigned hyperscience/hyperscience-backend-8659c86b7b-zcn4t to ip-10-10-93-158.ec2.internal
  Normal   Pulling    5m50s (x4 over 7m16s)  kubelet, ip-10-10-93-158.ec2.internal  Pulling image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4"
  Warning  Failed     5m50s (x4 over 7m16s)  kubelet, ip-10-10-93-158.ec2.internal  Failed to pull image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4": rpc error: code = Unknown desc = Error response from daemon: manifest for 1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4 not found: manifest unknown: Requested image not found
  Warning  Failed     5m50s (x4 over 7m16s)  kubelet, ip-10-10-93-158.ec2.internal  Error: ErrImagePull
  Warning  Failed     5m28s (x6 over 7m16s)  kubelet, ip-10-10-93-158.ec2.internal  Error: ImagePullBackOff
  Normal   BackOff    2m7s (x20 over 7m16s)  kubelet, ip-10-10-93-158.ec2.internal  Back-off pulling image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.4"

The pod description indicates that the kubelet cannot pull the specified forms image from the container registry. Check manually wether the image can actually be found in the docker registry, if not - use our hsk8s guide to stream the required image to your registry.

Pod stuck in Init Status

Another reason for a pod being stuck in Init could be a failing init container. Let's look at such example

Get the list of pods

$ kubectl get pod

The following is a sample output:

$ kubectl get pod
NAME                                          READY   STATUS             RESTARTS   AGE
hyperscience-backend-6f694486f-l5bmg          0/7     Init:0/1           0          12m
hyperscience-frontend-546f8f74dc-mq4fr        0/2     Init:0/1           0          12m
hyperscience-hyperoperator-846688864d-vf4mg   1/1     Running            0          12m

Describe a pod stuck in Init

kubectl describe pod hyperscience-backend-6f694486f-l5bmg

The following is a sample output:

Name:         hyperscience-backend-6f694486f-l5bmg
Namespace:    hyperscience
Priority:     0
Node:         ip-10-10-93-158.ec2.internal/10.10.93.158
Labels:       app.kubernetes.io/component=backend
              app.kubernetes.io/instance=hyperscience
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=hyperscience
              app.kubernetes.io/part-of=hyperscience
              app.kubernetes.io/version=34.0.3
              helm.sh/chart=hyperscience-5.0.0
...
...
Events:
  Type     Reason            Age                From                                   Message
  ----     ------            ----               ----                                   -------
  Warning  FailedScheduling  20m (x2 over 20m)  default-scheduler                      0/2 nodes are available: 1 Insufficient memory, 1 node(s) didn't match Pod's node affinity/selector.
  Normal   Scheduled         19m                default-scheduler                      Successfully assigned hyperscience/hyperscience-backend-6f694486f-l5bmg to ip-10-10-93-158.ec2.internal
  Normal   Pulled            19m                kubelet, ip-10-10-93-158.ec2.internal  Container image "1234567890.dkr.ecr.us-east-1.amazonaws.com/forms-w2whd9mv:34.0.3" already present on machine
  Normal   Created           19m                kubelet, ip-10-10-93-158.ec2.internal  Created container init
  Normal   Started           19m                kubelet, ip-10-10-93-158.ec2.internal  Started container init

From the description of this pod, we can see that the pod was successfully assigned to a node - ip-10-10-93-158.ec2.internal. The image was already pulled and the pod was started, with the init container starting first.

Take a look at the logs of the init container.

$ kubectl logs -f hyperscience-backend-6f694486f-l5bmg -c init
...
...
CommandError: Failed connecting to db. (connection to server at "hyperscience.x1x1x1x1.us-east-1.rds.amazonaws.com" (10.10.61.115), port 5432 failed: FATAL:  password authentication failed for user "my-postgres-role"
connection to server at "hyperscience.x1x1x1x1.us-east-1.rds.amazonaws.com" (10.10.61.115), port 5432 failed: FATAL:  password authentication failed for user "my-postgres-role"
)

The logs indicate that the init container's attempts to authenticate into the RDS database are unsuccessful. Investigate if the secret you created while following the Helm Chart guide contains the correct credentials and endpoint.

Pod stuck in Pending status

Get the list of pods

$ kubectl get pod

The following is a sample output:

$ kubectl get pod
NAME                                          READY   STATUS                       RESTARTS   AGE
hyperscience-backend-8659c86b7b-zcn4t         0/7     Init:ImagePullBackOff        0          7m58s
hyperscience-frontend-878f45f6f-wjf8p         0/2     Pending                      0          7m58s
hyperscience-hyperoperator-5f7d6f7d95-489r7   0/1     CreateContainerConfigError   0          7m58s

Describe the pod stuck in Pending status

kubectl describe pod hyperscience-frontend-878f45f6f-wjf8p

The following is a sample output:

Name:           hyperscience-frontend-878f45f6f-wjf8p
Namespace:      hyperscience
Priority:       0
Node:           
Labels:         app.kubernetes.io/component=frontend
                app.kubernetes.io/instance=hyperscience
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=hyperscience
                app.kubernetes.io/part-of=hyperscience
                app.kubernetes.io/version=34.0.4
                helm.sh/chart=hyperscience-5.0.0
...
...

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  0s (x23 over 21m)  default-scheduler  0/2 nodes are available: 1 Insufficient memory, 1 node(s) didn't match Pod's node affinity/selector.

The pod description indicates that the kubernetes scheduler could not find a suitable node to assign the pod to. One of them doesn't have enough available resources to match the pod memory requests and the other one doesn't have the required node selector labels.

The following is an example output of a node which has the required label to match the Pod's node affinity. The important part here is the hs-component=platform label

$ kubectl describe node ip-10-10-93-158.ec2.internal
Name:               ip-10-10-93-158.ec2.internal
Roles:              
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m5.2xlarge
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=ON_DEMAND
                    eks.amazonaws.com/nodegroup=platform
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1b
                    hs-component=platform
                    ...

If this is the reason your pod is not being scheduled you should inspect your nodes and add the required label as needed per our Infrastructure Requirements.

Operator stuck in CreateContainerConfigError status

Get the list of pods

$ kubectl get pod

The following is a sample output:

$ kubectl get pod
NAME                                          READY   STATUS                       RESTARTS   AGE
hyperscience-backend-8659c86b7b-zcn4t         0/7     Init:ImagePullBackOff        0          7m58s
hyperscience-frontend-878f45f6f-wjf8p         0/2     Pending                      0          7m58s
hyperscience-hyperoperator-5f7d6f7d95-489r7   0/1     CreateContainerConfigError   0          7m58s

Describe the hyperoperator pod stuck in CreateContainerConfigError status

$ kubectl describe pod hyperscience-hyperoperator-5f7d6f7d95-489r7

The following is a sample output:

Name:         hyperscience-hyperoperator-5f7d6f7d95-489r7
Namespace:    hyperscience
Priority:     0
Node:         ip-10-10-93-158.ec2.internal/10.10.93.158
Labels:       app.kubernetes.io/component=hyperoperator
              app.kubernetes.io/instance=hyperscience
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=hyperscience
              app.kubernetes.io/part-of=hyperscience
              app.kubernetes.io/version=34.0.4
              helm.sh/chart=hyperscience-5.0.0
...
...
Events:
  Type     Reason     Age                   From                                   Message
  ----     ------     ----                  ----                                   -------
  Normal   Scheduled  59m                   default-scheduler                      Successfully assigned hyperscience/hyperscience-hyperoperator-5f7d6f7d95-489r7 to ip-10-10-93-158.ec2.internal
  Normal   Pulling    59m                   kubelet, ip-10-10-93-158.ec2.internal  Pulling image "1234567890.dkr.ecr.us-east-1.amazonaws.com/hyperoperator:3.3.4"
  Normal   Pulled     59m                   kubelet, ip-10-10-93-158.ec2.internal  Successfully pulled image "1234567890.dkr.ecr.us-east-1.amazonaws.com/hyperoperator:3.3.4" in 171.682343ms
  Warning  Failed     56m (x12 over 59m)    kubelet, ip-10-10-93-158.ec2.internal  Error: secret "hyperscience-platform" not found
  Normal   Pulled     4m2s (x258 over 59m)  kubelet, ip-10-10-93-158.ec2.internal  Container image "1234567890.dkr.ecr.us-east-1.amazonaws.com/hyperoperator:3.3.4" already present on machine

The pod description indicates that a required kubernetes secret is missing. Refer to the Helm Chart article to find more information about the required secret.

Submissions are stuck

No block pods are present

Submissions are processed by worker containers, which we call blocks. If helm install and helm uninstall are called multiple times successively, they might disappear. If they are missing, it's likely that the deployment state is the following.

$ kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
hyperscience-backend-5b5dcd984f-f6p8j                  6/6     Running   0          18m
hyperscience-frontend-56756b55f6-t7rfw                 2/2     Running   0          18m
hyperscience-hyperflow-engine-7bb598699f-tvb52         1/1     Running   0          18m
hyperscience-hyperoperator-666f8fc5db-vgvrv            2/2     Running   0          18m
hyperscience-idp-sync-manager-5b59847777-sdbrf         1/1     Running   0          18m

First, check if the HyperBlockManager resource is present. If it is, force delete its finalizers so it can be deleted by Kubernetes.

$ kubectl get hyperblockmanagers
NAME           AGE
hyperscience   20m

$ kubectl patch hyperblockmanager hyperscience -p '{"metadata":{"finalizers":[]}}' --type=merge
hyperblockmanager.hyperscience.net/hyperscience patched

Then, recreate it with the following command, which will in turn recreate the blocks.

$ helm upgrade $HS_HELM_RELEASE -f values.yaml $HS_HELM_CHART