Infrastructure Requirements

Specifications

Internet browsers

  • Hyperscience v31 and earlier: Internet Explorer 11 and the latest version of Google Chrome

  • Hyperscience v32 to v35: Internet Explorer 11 and the latest versions of Google Chrome and Microsoft Edge

  • Hyperscience v36 and later: the latest versions of Google Chrome and Microsoft Edge

For the best possible user experience, we recommend browser dimensions of at least 1280 x 720 pixels.

Servers

Operating System

  • Ubuntu:

    • Hyperscience v39.2 and earlier: Ubuntu 16.04 and later

    • Hyperscience v40: Ubuntu 18 and later

  • RHEL:

    • Hyperscience v28 and earlier: RHEL 7, 7.5, 7.7, and 7.8

    • Hyperscience v30-v32: RHEL 7, 7.5, 7.7, 7.8, and 7.9

    • Hyperscience v33-v34: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, and 8.5

    • Hyperscience v35-v39.0.8: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, 8.5, and later 8.x versions

    • Hyperscience v39.0.9-v39.2: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, 8.5, later 8.x versions, and 9

    • Hyperscience v40: RHEL 8.4, 8.5, later 8.x versions, and 9

Below, you can find a table with the supported container environments for each operating system.

Operating System

Supported container environments

RHEL 7.9 and earlier

  • Hyperscience v37-v39.2 with trainers with GPUs: Docker 19.0.3 or later

  • All other configurations of Hyperscience v37-v39.2: Docker 1.13 or later

RHEL 8.4 and later

  • Podman 3.3.1 or later

Ubuntu 16.04 (LTS) and later 16.x versions

  • Hyperscience v37-v39.2 with trainers with GPUs: Docker 19.0.3 or later

  • All other configurations of Hyperscience v37-v39.2: Docker 1.13 or later

Ubuntu 18 and later

  • Hyperscience v37 and later with trainers with GPUs: Docker 19.0.3 or later

  • All other configurations: Docker 1.13 or later

Note the following:

  • The container environment should be installed on all machines.

  • The container environment can be called:

    • “docker-latest” or “docker” if you are using Docker.

    • “podman” if you are using Podman.

  • The container environment’s preferred storage driver is overlay2.

We support any Docker distribution that meets the requirements above. Examples include:

  • Docker Community Edition (CE):

    • In v30 and later, Hyperscience does not support Docker installed via the Snap application package or via Snap Store. Hyperscience does not impose restrictions on where users run the installation command and unpack the bundle, while Snap-installed Docker has tighter security permissions and only allows mounting Docker containers if the install path is under /home. To learn more about the Snap application package and Snap Store, see Snapcraft’s Install Docker on Ubuntu and Snap Store.

  • Docker Enterprise Edition (EE):

    • Docker EE can be purchased from Docker.

Hyperscience requires Docker Container Runtime, which we refer to as “Docker” in the documentation. Note that Docker offers other products such as Docker Desktop, premium support, and others that are licensed by Docker but are not required for using Hyperscience. For any licensing and support arrangements for Docker, contact the Docker’s support team.

Local storage

  • Each VM needs to come with at least 150 GB of local storage. Additional space is recommended to accommodate future expansion and operational flexibility.

  • At least 90 GB of storage on the volume designated for downloading, extracting, and deploying the application, typically the root ( / ) volume.

  • At least 60 GB on whatever volume Docker or Podman is set up to use for the application image, typically located at /var.

CPU

  • Intel x86_64 is a requirement.

  • ARM is not supported.

VM CPU cores

  • The system requires a minimum of 8 cores per CPU in each VM.

Note that in this article, we use the term “CPU cores” for:

  • threads on Intel processors with enabled Hyper-Threading, and

  • virtual CPUs (vCPUs) on compute instances in cloud providers (e.g., AWS).

For example, a VM using 4 physical cores on an Intel processor with enabled Hyper-Threading would be using 8 logical cores (i.e., 8 threads) and is considered to have 8 cores. An AWS EC2 instance with 8 vCPUs is considered to have 8 cores.

RAM

  • The system requires a minimum of 32 GB per VM.

Permissions

The Hyperscience application does not support the use of fapolicyd in Podman-based deployments. It does support SELinux, which provides similar security measures. 

Trainer

The Hyperscience Trainer runs separately from the main application and communicates to the main application via the API. The trainer supports select long-running tasks and very large file downloads / uploads that might otherwise negatively impact document processing time. 

To learn more, see Trainer Installation.

Storage

Use local storage with the trainer. Do not use shared storage, especially if you have multiple trainers of the same version. Using shared storage may cause data to be overwritten and training jobs to fail.

VM CPU cores

We require 16 cores for each CPU in a trainer VM if you are processing Semi-structured documents. If you have only 8 cores for these CPUs, you can expect 60-70% longer training times and an increased risk of out-of-memory errors during training, particularly on datasets with longer, denser documents.

RAM

The trainer requires 64GB of RAM for each CPU in a trainer VM, which will maximize the performance of the 16-core CPUs described above.

Database

Supported Options

  • PostgreSQL (community or enterprise (EnterpriseDB Postgres) editions):

    • Hyperscience v28: PostgreSQL 9.5, 10.x, and 12.x

    • Hyperscience v30-v33.1.8: PostgreSQL 10.x and 12.x

    • Hyperscience v33.1.9-v34: PostgreSQL 10.x, 12.x, and 13.x

    • Hyperscience v35-v37: PostgreSQL 10.x, 12.x, 13.x, and 14.x

    • Hyperscience v38-v39.2: PostgreSQL 12.x, 13.x, and 14.x

    • Hyperscience v40 and later: PostgreSQL 13.x and 14.x

  • Amazon RDS for PostgreSQL

  • Oracle:

    • Hyperscience v28 or earlier: Oracle 12 with DBMS_ALERT privileges

    • Hyperscience v30-v31: Oracle 12 and 19c, both requiring DBMS_ALERT privileges

    • Hyperscience v32-v33: Oracle 12.2 and 19c, both requiring DBMS_ALERT privileges

    • Hyperscience v34 or later: Oracle 19c with DBMS_ALERT privileges

  • Amazon RDS for Oracle

  • Microsoft SQL Server (MSSQL):

    • Hyperscience v30.0.5 or earlier: MSSQL 2016 and 2017

    • Hyperscience v30.0.6 or later: MSSQL 2016, 2017, and 2019

    • Service Broker must be enabled.

  • Amazon RDS for SQL Server

  • Azure SQL Managed Instance

    • Supported in Hyperscience v28 and later

    • Because Azure SQL Database does not support Servicer Broker, Hyperscience does not support Azure SQL Database.

Note that PostgreSQL and MSSQL are the recommended database options. While we do support the use of Oracle, it is the least used option, and we may remove support for it in the future if usage decreases.

Privileges required

  • User DDL privileges for table/index creation and modification. Keeping the aforementioned DDL privileges is required even after installing the Hyperscience application.

HA/DR

  • Architectures for the database are subject to our customers’ policies and are managed by them  — our application supports connecting to a single database host at a time

Note that migrating existing data between different database types is not supported.

File storage

Supported options

  • An S3 bucket

  • Azure Blob Storage

  • Google Cloud Storage

  • A networked file store (like NFS or CIFS)

HA/DR

  • Architectures for the file store are subject to our customers’ policies and are managed by them

Load balancer

To achieve HA/DR goals for the application, we encourage customers to deploy the application on multiple VMs and to use a load balancer for web requests. To learn more about using a load balancer to distribute web requests, see Load Balancer.

Note: the application uses the HTTP_HOST from the request to generate some of its links. If you are configuring a Load Balancer, please bear this in mind.

Load balancer health check URLs

Hyperscience offers a Health Check Status API endpoint, which is designed to help you monitor the health of your system’s components. If any component tested by the Health Check Status API is in an error state, the endpoint will return an error code. If you enter the Health Check Status endpoint as your load balancer's health check URL, an issue in one server will cause all servers to return an error code to the load balancer. This response will prevent traffic from being routed to your entire system, even if healthy servers are available. For this reason, we do not recommend using this endpoint as your load balancer's test of overall system health.

For more information on the Health Check Status endpoint, see our API documentation.

Sizing the system

Exact capacity planning requires details of anticipated document flow (see below) and is usually discussed with our Deployments team.

Server capacity

The exact number of VMs required depends on the following factors:

  • Peak hourly throughput

  • Number of fields per page to be collected

  • The split of structured (forms) vs semi-structured (invoices, checks, paystubs, bills, etc) documents

Our system scales horizontally: doubling the number of machines doubles the processing capabilities. Additionally, Hyperscience is able to leverage a larger number of cores. The system will perform twice as fast on a 16-core 64GB RAM machine as it will on an 8-core 32GB RAM machine.

Requirements for RAM and VM CPU cores

To ensure optimal performance, the system requires a 1:4 ratio of the number of cores in a VM's CPU and the number of gigabytes of RAM in that VM. For example, if a VM has 16 CPU cores, that VM should have 64GB of RAM.

Note that burstable-performance machines are not supported. Such machines are: 

The Hyperscience application is designed to consume 100% of its CPUs’ resources. The nature of burstable-performance machines does not allow them to constantly utilize 100% of the CPUs’ resources, which results in system slowness. 

Directionally, a single 8-core 32GB RAM machine will process ~15,000 structured pages, with 20 fields per page, in 12 hours. Similarly, a single 8-core 32GB machine will process ~7,000 semi-structured pages, with 20 fields per page, in 12 hours. Large production set-ups might have 4-8 machines, across 2 data centers for high availability.

The following table gives an example of the recommended number of machines based on a few different peak hourly volumes and ratios of Structured vs Semi-structured documents. The table assumes a 8-core 32GB RAM machine and approximately 20 extracted fields per page.

 

1000 pages
(Peak Hourly)

2000 pages

(Peak Hourly)

4000 pages

(Peak Hourly)

100% structured

1 VM+ 1 (Trainer)

2 VMs+ 1 (Trainer) 

3 VMs + 1 (Trainer)

50% structured


50% semi-structured 

2 + 1 (Trainer) 

3 + 1 (Trainer) 

5 + 1 (Trainer)

100% semi-structured 

2 + 1 (Trainer)

4 + 1 (Trainer) 

7 + 1 (Trainer) 

Table 1: Number of 8-core 32GB RAM VMs required

Trainer capacity

The trainer uses the directory referenced by the HS_PATH environment variable for storage. Ensure that the directory is located on a partition with at least 100GB of storage.

If you plan on enabling Trainer Resiliency, which creates checkpoints for training data and model training, you will need to ensure that there is an additional 6GB of storage available. To learn more about this feature, see Trainer Resiliency.

Database capacity

The size of the database store depends on:

  • Daily volume of pages

  • Number of fields per page to be collected

  • Retention period before record deletion

    • Configurable by the user. Anywhere between 3 days and 60 days is common.

A typical set-up (15,000 TIFFs per day, deleted after 30 days) requires 30GB of DB. As storage is inexpensive, teams usually provision this with a buffer (100-200GB).

File storage capacity

The size of the file store depends on:

  • Daily volume of pages

  • Mix of file sizes and formats

  • Retention period before record deletion

Directionally, a set-up of (15,000 TIFFs per day, deleted after 30 days) requires ~1TB of file storage.