Enabling Trainers with GPUs in On-Premise Podman Deployments

In recent versions of Hyperscience, we’ve made it possible to classify and extract data from Unstructured documents. However, automating these processes requires more computing resources than our automation capabilities for Structured and Semi-structured documents do.

In order to process Unstructured documents in on-premise deployments of Hyperscience, you need to add a trainer that has both a GPU (graphics processing unit) and a CPU (central processing unit). GPUs have specialized cores that allow the system to perform multiple computations in parallel, reducing the time required to complete the complex operations required to train models for Unstructured documents. When you attach a trainer whose machine has a GPU, you can maximize the benefits of Unstructured Extraction. To learn more about this feature, see Field Identification.

Support for GPUs in Hyperscience deployments

Trainers with GPUs are supported only for the following deployments:

  • Deployments of v37 or later that run on Docker or Podman

    • If RHEL is used, it must be RHEL 8.4 or later.

  • Deployments of v38 or later that run on Kubernetes

Deployments of v36 or earlier are not supported, nor are any deployments that run on RHEL 7.x.

You cannot train models for Structured or Semi-structured documents with a GPU. However, you can train them with a CPU on a machine that has both a GPU and a CPU.

Machines with GPUs are not supported for the Hyperscience application.

This article describes how to enable a trainer with both a GPU and a CPU in an on-premise Podman deployment of Hyperscience. Steps 1-3 must be completed before untarring the Hyperscience bundle on the trainer machine. For more information on setting up the trainer, see Trainer Installation (Production).

1. Make sure your GPU hardware meets the requirements.

To process Unstructured documents in Hyperscience, your trainer needs an NVIDIA GPU. We used the NVIDIA Tesla T4 for our benchmarking. Other NVIDIA GPUs may perform slightly better or worse, depending on cores, RAM, and other factors.

For more information on the T4’s specifications, see TechPowerUp’s NVIDIA Tesla T4. To learn more about other NVIDIA GPUs, see TechPowerUp’s GPU Specs Database.

Machine sizing

We've completed benchmark tests for AWS's g4dn.4xlarge machine, and we recommend that machine or one of comparable size. For more details on this machine, see AWS's Amazon EC2 G4 Instances.

2. Make sure your trainer meets the software compatibility requirements.

There are several software-compatibility considerations to keep in mind when setting up your trainer.

a. Verify that the lspci command is enabled.

To do so:

  1. Install the pciutils package by running the following command on RHEL:

    yum -y install pciutils
  2. Run lspci to make sure the command has been enabled.

b. Verify that your GPU supports CUDA.

CUDA is a parallel computing platform and programming model created by NVIDIA. Machine learning often uses CUDA-based libraries, SDKs, and other tools.

You can find out whether your GPU supports CUDA by running the following command:

lspci | grep -i nvidia

For more information, see NVIDIA’s CUDA GPUs - Compute Capability and NVIDIA CUDA Installation Guide for Linux.

c. Verify that you have a supported version of Linux.

Follow the instructions in NVIDIA’s NVIDIA CUDA Installation Guide for Linux to check your version of Linux. Then, make sure your Linux version is supported by the latest CUDA Toolkit by reviewing NVIDIA’s NVIDIA CUDA Toolkit Release Notes.  

d. Verify that the system has gcc installed.

The gcc compiler is required for development using the CUDA Toolkit. To make sure it is installed, follow the instructions in NVIDIA’s NVIDIA CUDA Installation Guide for Linux.

e. Verify that the system has the current Kernel headers and development packages installed.

Kernel headers are header files that specify the interface between the Linux kernel and userspace libraries and programs. The CUDA driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well whenever the driver is rebuilt. For example, if your system is running kernel version 3.17.4-301, the 3.17.4-301 kernel headers and development packages must also be installed.

To verify that these requirements are met, run the following command:

sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

For more information and commands for various Linux distributions, see NVIDIA’s NVIDIA CUDA Installation Guide for Linux.

3. Install the CUDA Driver and NVIDIA Container Toolkit.

  1. Ensure your system meets the prerequisites for the driver installation:

  2. Install the driver by running the following command on RHEL:

    sudo dnf module install nvidia-driver:latest-dkms
  3. Install the container toolkit by completing the steps in the "Installing with Yum or Dnf" section of NVIDIA's Installing the NVIDIA Container Toolkit.

  4. Configure Podman to use NVIDIA devices in the container:

    1. Complete the steps in the "Procedure" section of NVIDIA's Support for Container Device Interface.

    2. Edit /usr/share/containers/containers.conf:

      1. Set NVIDIA as the runtime (i.e., runtime=”nvidia"). If there is any other runtime set, comment it out.

      2. Add nvidia=[“/usr/bin/nvidia-container-runtime”] to [engine.runtimes].
        After completing these steps, your file should look similar to the one shown below.

        runtime = "nvidia"
        
        ...
        
        [engine.runtimes]
        
        ...
        
        nvidia = [
        
        "/usr/bin/nvidia-container-runtime",
        
        ]
  5. (Optional) Reboot the system:

    sudo reboot

4. Change the model type for the layout you want to use with Unstructured Extraction.

Before you can use your GPU-based trainer, you first need to determine which layout you’d like to use Unstructured Extraction with. Then, you need to change the model type for that layout to UNSTRUCTURED_EXTRACTION. The system will use the GPU for Unstructured Extraction of data from that layout’s documents.

To do so:

  1. Go to /admin/form_extraction/template/.

  2. Find the record for the layout you’d like to use Unstructured Extraction with, and click on its UUID.

  3. In the Flex engine type for training drop-down list, select UNSTRUCTURED_EXTRACTION.

  4. Click Save.

Note that there is no indication on the Trainers page (Administration > Trainers) that the trainer you just enabled has a GPU. However, you can run run.sh check-gpu on a trainer machine to determine whether it has an NVIDIA GPU available.