Infrastructure Requirements

Specifications

Internet browsers

Hyperscience v31 and earlier: Internet Explorer 11 and the latest version of Google Chrome
Hyperscience v32 to v35: Internet Explorer 11 and the latest versions of Google Chrome and Microsoft Edge
Hyperscience v36 and later: the latest versions of Google Chrome and Microsoft Edge

For the best possible user experience, we recommend browser dimensions of at least 1280 x 720 pixels.

Servers

Operating System

Ubuntu:
- Hyperscience v39.2 and earlier: Ubuntu 16.04 and later
- Hyperscience v40: Ubuntu 18 and later
- Hyperscience v41: Ubuntu 20 and later
RHEL:
- Hyperscience v28 and earlier: RHEL 7, 7.5, 7.7, and 7.8
- Hyperscience v30-v32: RHEL 7, 7.5, 7.7, 7.8, and 7.9
- Hyperscience v33-v34: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, and 8.5
- Hyperscience v35-v39.0.8: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, 8.5, and later 8.x versions
- Hyperscience v39.0.9-v39.2: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, 8.5, later 8.x versions, and 9.x
- Hyperscience v40: RHEL 8.4, 8.5, later 8.x versions, and 9.x
- Hyperscience v41: RHEL 8.10, 9.x

Below, you can find a table with the supported container environments for each operating system.

Operating system	Supported container environments
RHEL 7.9 and earlier	Hyperscience v37-v39.2 with trainers with GPUs: Docker 19.0.3 or later All other configurations of Hyperscience v37-v39.2: Docker 1.13 or later
RHEL 8.4 and later	Podman 3.3.1 or later
Ubuntu 16.04 (LTS) and later 16.x versions	Hyperscience v37-v39.2 with trainers with GPUs: Docker 19.0.3 or later All other configurations of Hyperscience v37-v39.2: Docker 1.13 or later
Ubuntu 18	Hyperscience v37-v40 with trainers with GPUs: Docker 19.0.3 or later All other configurations of Hyperscience v40 or earlier: Docker 1.13 or later
Ubuntu 20 and later	Hyperscience v41: Docker 25.0.4 or later

Note the following:

The container environment should be installed on all machines.
The container environment can be called:
- “docker-latest” or “docker” if you are using Docker.
- “podman” if you are using Podman.
The container environment’s preferred storage driver is overlay2.

We support any Docker distribution that meets the requirements above. Examples include:

Docker Community Edition (CE):
- In v30 and later, Hyperscience does not support Docker installed via the Snap application package or via Snap Store. Hyperscience does not impose restrictions on where users run the installation command and unpack the bundle, while Snap-installed Docker has tighter security permissions and only allows mounting Docker containers if the install path is under /home. To learn more about the Snap application package and Snap Store, see Snapcraft’s Install Docker on Ubuntu and Snap Store.
- RHEL 7 – Red Hat's Getting Docker in RHEL 7
- Ubuntu – Docker's Install Docker Engine on Ubuntu
- Docker CE can be obtained from either OS distribution repository packages or Docker’s download.docker.com repository.
Docker Enterprise Edition (EE):
- Docker EE can be purchased from Docker.

Hyperscience requires Docker Container Runtime, which we refer to as “Docker” on this site. Note that Docker offers other products such as Docker Desktop, premium support, and others that are licensed by Docker but are not required for using Hyperscience. For any licensing and support arrangements for Docker, contact Docker’s support team.

Local storage

Each VM needs to come with at least 150GB of local storage. Additional space is recommended to accommodate future expansion and operational flexibility.
At least 90GB of storage on the volume designated for downloading, extracting, and deploying the application, typically the root ( / ) volume.
At least 60GB on whatever volume Docker or Podman is set up to use for the application image, typically located at /var.

CPU

Intel x86_64 is a requirement.
ARM is not supported.

VM CPU cores

The system requires a minimum of 8 cores per CPU in each VM.

Note that in this article, we use the term “CPU cores” for:
threads on Intel processors with enabled Hyper-Threading, and
virtual CPUs (vCPUs) on compute instances in cloud providers (e.g., AWS).
For example, a VM using 4 physical cores on an Intel processor with enabled Hyper-Threading would be using 8 logical cores (i.e., 8 threads) and is considered to have 8 cores. An AWS EC2 instance with 8 vCPUs is considered to have 8 cores.

RAM

The system requires a minimum of 32GB per VM.

Permissions

The Hyperscience application does not support the use of fapolicyd in Podman-based deployments. It does support SELinux, which provides similar security measures.

Trainer

The Hyperscience Trainer runs separately from the main application and communicates to the main application via the API. The trainer supports select long-running tasks and very large file downloads / uploads that might otherwise negatively impact document processing time.

To learn more, see Trainer Installation.

Storage

Use local storage with the trainer. Do not use shared storage, especially if you have multiple trainers of the same version. Using shared storage may cause data to be overwritten and training jobs to fail.

VM CPU cores

We require 16 cores for each CPU in a trainer VM if you are processing Semi-structured documents. If you have only 8 cores for these CPUs, you can expect 60-70% longer training times and an increased risk of out-of-memory errors during training, particularly on datasets with longer, denser documents.

RAM

The trainer needs 4GB of RAM for each CPU core in a trainer VM. Therefore, if you have a 16-core CPU, as described in VM CPU cores, you can expect maximum performance with 64GB of RAM.

Database

Supported Options

PostgreSQL (community or enterprise (EnterpriseDB Postgres) editions):
- Hyperscience v28: PostgreSQL 9.5, 10.x, and 12.x
- Hyperscience v30-v33.1.8: PostgreSQL 10.x and 12.x
- Hyperscience v33.1.9-v34: PostgreSQL 10.x, 12.x, and 13.x
- Hyperscience v35-v37: PostgreSQL 10.x, 12.x, 13.x, and 14.x
- Hyperscience v38-v39.2: PostgreSQL 12.x, 13.x, and 14.x
- Hyperscience v40: PostgreSQL 13.x and 14.x
- Hyperscience v41: PostgreSQL 14.x, 15.x, and 16.x
Amazon RDS for PostgreSQL
Oracle:
- Hyperscience v28 or earlier: Oracle 12 with DBMS_ALERT privileges
- Hyperscience v30-v31: Oracle 12 and 19c, both requiring DBMS_ALERT privileges
- Hyperscience v32-v33: Oracle 12.2 and 19c, both requiring DBMS_ALERT privileges
- Hyperscience v34 or later: Oracle 19c with DBMS_ALERT privileges
Amazon RDS for Oracle
Microsoft SQL Server (MSSQL):
- Hyperscience v30.0.5 or earlier: MSSQL 2016 and 2017
- Hyperscience v30.0.6-v40: MSSQL 2016, 2017, and 2019
- Hyperscience v41: MSSQL 2017, 2019, and 2022
- Service Broker must be enabled.
Amazon RDS for SQL Server
Azure SQL Managed Instance
- Supported in Hyperscience v28 and later
- Because Azure SQL Database does not support Servicer Broker, Hyperscience does not support Azure SQL Database.

Note that PostgreSQL and MSSQL are the recommended database options. While we do support the use of Oracle, it is the least used option, and we may remove support for it in the future if usage decreases.

Privileges required

User DDL privileges for table/index creation and modification. Keeping the aforementioned DDL privileges is required even after installing the Hyperscience application.

HA/DR

Architectures for the database are subject to our customers’ policies and are managed by them — our application supports connecting to a single database host at a time

Note that migrating existing data between different database types is not supported.

File storage

Supported options

An S3 bucket
Azure Blob Storage
Google Cloud Storage
A networked file store (like NFS or CIFS)

HA/DR

Architectures for the file store are subject to our customers’ policies and are managed by them

Load balancer

To achieve HA/DR goals for the application, we encourage customers to deploy the application on multiple VMs and to use a load balancer for web requests. To learn more about using a load balancer to distribute web requests, see Load Balancer.

If you are configuring a load balancer, note that the application uses the HTTP_HOST from the request to generate some of its links.

Load balancer health check URLs

Hyperscience offers a Health Check Status API endpoint, which is designed to help you monitor the health of your system’s components. If any component tested by the Health Check Status API is in an error state, the endpoint will return an error code. If you enter the Health Check Status endpoint as your load balancer's health check URL, an issue in one server will cause all servers to return an error code to the load balancer. This response will prevent traffic from being routed to your entire system, even if healthy servers are available. For this reason, we do not recommend using this endpoint as your load balancer's test of overall system health.

For more information on the Health Check Status endpoint, see our API documentation.

Sizing the system

Exact capacity planning requires details of anticipated document flow (see below) and is usually discussed with our Deployments team.

Server capacity

The exact number of VMs required depends on the following factors:

Peak hourly throughput
Number of fields per page to be collected
The split of structured (forms) vs semi-structured (invoices, checks, paystubs, bills, etc) documents

Our system scales horizontally: doubling the number of machines doubles the processing capabilities. Additionally, Hyperscience is able to leverage a larger number of cores. The system will perform twice as fast on a 16-core 64GB RAM machine as it will on an 8-core 32GB RAM machine.

Requirements for RAM and VM CPU cores
To ensure optimal performance, the system requires a 1:4 ratio of the number of cores in a VM's CPU and the number of gigabytes of RAM in that VM. For example, if a VM has a 16-core CPU, that VM should have 64GB of RAM.

Note that burstable-performance machines are not supported. Such machines are:

AWS: T-series. To learn more, see Amazon’s Burstable performance instances.
Azure: B-series. To learn more, see Microsoft’s B-series burstable virtual machine sizes.
Google Cloud: shared core. To learn more, see Google’s General-purpose machine family.

The Hyperscience application is designed to consume 100% of its CPUs’ resources. The nature of burstable-performance machines does not allow them to constantly utilize 100% of the CPUs’ resources, which results in system slowness.

Directionally, a single 8-core 32GB RAM machine will process ~15,000 structured pages, with 20 fields per page, in 12 hours. Similarly, a single 8-core 32GB machine will process ~7,000 semi-structured pages, with 20 fields per page, in 12 hours. Large production set-ups might have 4-8 machines, across 2 data centers for high availability.

The following table gives an example of the recommended number of machines based on a few different peak hourly volumes and ratios of Structured vs Semi-structured documents. The table assumes a 8-core 32GB RAM machine and approximately 20 extracted fields per page.

	1000 pages (Peak Hourly)	2000 pages (Peak Hourly)	4000 pages (Peak Hourly)
100% structured	1 VM+ 1 (Trainer)	2 VMs+ 1 (Trainer)	3 VMs + 1 (Trainer)
50% structured 50% semi-structured	2 + 1 (Trainer)	3 + 1 (Trainer)	5 + 1 (Trainer)
100% semi-structured	2 + 1 (Trainer)	4 + 1 (Trainer)	7 + 1 (Trainer)

1000 pages
(Peak Hourly)

2000 pages

(Peak Hourly)

4000 pages

(Peak Hourly)

100% structured

1 VM+ 1 (Trainer)

2 VMs+ 1 (Trainer)

3 VMs + 1 (Trainer)

50% structured

50% semi-structured

2 + 1 (Trainer)

3 + 1 (Trainer)

5 + 1 (Trainer)

100% semi-structured

2 + 1 (Trainer)

4 + 1 (Trainer)

7 + 1 (Trainer)

Table 1: Number of 8-core 32GB RAM VMs required

Trainer capacity

The trainer uses the directory referenced by the HS_PATH environment variable for storage. Ensure that the directory is located on a partition with at least 100GB of storage.

If you plan on enabling Trainer Resiliency, which creates checkpoints for training data and model training, you will need to ensure that there is an additional 6GB of storage available. To learn more about this feature, see Trainer Resiliency.

Database capacity

The size of the database store depends on:

Daily volume of pages
Number of fields per page to be collected
Retention period before record deletion
- Configurable by the user. Anywhere between 3 days and 60 days is common.

A typical set-up (15,000 TIFFs per day, deleted after 30 days) requires 30GB of DB. As storage is inexpensive, teams usually provision this with a buffer (100-200GB).

File storage capacity

The size of the file store depends on:

Daily volume of pages
Mix of file sizes and formats
Retention period before record deletion

Directionally, a set-up of (15,000 TIFFs per day, deleted after 30 days) requires ~1TB of file storage.