Specifications
Internet browsers
Hyperscience v31 and earlier: Internet Explorer 11 and the latest version of Google Chrome
Hyperscience v32 to v35: Internet Explorer 11 and the latest versions of Google Chrome and Microsoft Edge
Hyperscience v36 and later: the latest versions of Google Chrome and Microsoft Edge
For the best possible user experience, we recommend browser dimensions of at least 1280 x 720 pixels.
Servers
Operating System
Ubuntu:
Hyperscience v39.2 and earlier: Ubuntu 16.04 and later
Hyperscience v40: Ubuntu 18 and later
RHEL:
Hyperscience v28 and earlier: RHEL 7, 7.5, 7.7, and 7.8
Hyperscience v30-v32: RHEL 7, 7.5, 7.7, 7.8, and 7.9
Hyperscience v33-v34: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, and 8.5
Hyperscience v35-v39.0.8: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, 8.5, and later 8.x versions
Hyperscience v39.0.9-v39.2: RHEL 7, 7.5, 7.7, 7.8, 7.9, 8.4, 8.5, later 8.x versions, and 9
Hyperscience v40: RHEL 8.4, 8.5, later 8.x versions, and 9
Below, you can find a table with the supported container environments for each operating system.
Operating System | Supported container environments |
---|---|
RHEL 7.9 and earlier |
|
RHEL 8.4 and later |
|
Ubuntu 16.04 (LTS) and later 16.x versions |
|
Ubuntu 18 and later |
|
Note the following:
The container environment should be installed on all machines.
The container environment can be called:
“docker-latest” or “docker” if you are using Docker.
“podman” if you are using Podman.
The container environment’s preferred storage driver is overlay2.
We support any Docker distribution that meets the requirements above. Examples include:
Docker Community Edition (CE):
In v30 and later, Hyperscience does not support Docker installed via the Snap application package or via Snap Store. Hyperscience does not impose restrictions on where users run the installation command and unpack the bundle, while Snap-installed Docker has tighter security permissions and only allows mounting Docker containers if the install path is under /home. To learn more about the Snap application package and Snap Store, see Snapcraft’s Install Docker on Ubuntu and Snap Store.
RHEL 7 – Red Hat's Getting Docker in RHEL 7
Ubuntu – Docker's Install Docker Engine on Ubuntu
Docker CE can be obtained from either OS distribution repository packages or Docker’s download.docker.com repository.
Docker Enterprise Edition (EE):
Docker EE can be purchased from Docker.
Hyperscience requires Docker Container Runtime, which we refer to as “Docker” in the documentation. Note that Docker offers other products such as Docker Desktop, premium support, and others that are licensed by Docker but are not required for using Hyperscience. For any licensing and support arrangements for Docker, contact the Docker’s support team.
Local storage
Each VM needs to come with at least 150 GB of local storage. Additional space is recommended to accommodate future expansion and operational flexibility.
At least 90 GB of storage on the volume designated for downloading, extracting, and deploying the application, typically the root ( / ) volume.
At least 60 GB on whatever volume Docker or Podman is set up to use for the application image, typically located at /var.
CPU
Intel x86_64 is a requirement.
ARM is not supported.
VM CPU cores
The system requires a minimum of 8 cores per CPU in each VM.
Note that in this article, we use the term “CPU cores” for:
threads on Intel processors with enabled Hyper-Threading, and
virtual CPUs (vCPUs) on compute instances in cloud providers (e.g., AWS).
For example, a VM using 4 physical cores on an Intel processor with enabled Hyper-Threading would be using 8 logical cores (i.e., 8 threads) and is considered to have 8 cores. An AWS EC2 instance with 8 vCPUs is considered to have 8 cores.
RAM
The system requires a minimum of 32 GB per VM.
Permissions
The Hyperscience application does not support the use of fapolicyd in Podman-based deployments. It does support SELinux, which provides similar security measures.
Trainer
The Hyperscience Trainer runs separately from the main application and communicates to the main application via the API. The trainer supports select long-running tasks and very large file downloads / uploads that might otherwise negatively impact document processing time.
To learn more, see Trainer Installation.
Storage
Use local storage with the trainer. Do not use shared storage, especially if you have multiple trainers of the same version. Using shared storage may cause data to be overwritten and training jobs to fail.
VM CPU cores
We require 16 cores for each CPU in a trainer VM if you are processing Semi-structured documents. If you have only 8 cores for these CPUs, you can expect 60-70% longer training times and an increased risk of out-of-memory errors during training, particularly on datasets with longer, denser documents.
RAM
The trainer requires 64GB of RAM for each CPU in a trainer VM, which will maximize the performance of the 16-core CPUs described above.
Database
Supported Options
PostgreSQL (community or enterprise (EnterpriseDB Postgres) editions):
Hyperscience v28: PostgreSQL 9.5, 10.x, and 12.x
Hyperscience v30-v33.1.8: PostgreSQL 10.x and 12.x
Hyperscience v33.1.9-v34: PostgreSQL 10.x, 12.x, and 13.x
Hyperscience v35-v37: PostgreSQL 10.x, 12.x, 13.x, and 14.x
Hyperscience v38-v39.2: PostgreSQL 12.x, 13.x, and 14.x
Hyperscience v40 and later: PostgreSQL 13.x and 14.x
Amazon RDS for PostgreSQL
Oracle:
Hyperscience v28 or earlier: Oracle 12 with DBMS_ALERT privileges
Hyperscience v30-v31: Oracle 12 and 19c, both requiring DBMS_ALERT privileges
Hyperscience v32-v33: Oracle 12.2 and 19c, both requiring DBMS_ALERT privileges
Hyperscience v34 or later: Oracle 19c with DBMS_ALERT privileges
Amazon RDS for Oracle
Microsoft SQL Server (MSSQL):
Hyperscience v30.0.5 or earlier: MSSQL 2016 and 2017
Hyperscience v30.0.6 or later: MSSQL 2016, 2017, and 2019
Service Broker must be enabled.
Amazon RDS for SQL Server
Azure SQL Managed Instance
Supported in Hyperscience v28 and later
Because Azure SQL Database does not support Servicer Broker, Hyperscience does not support Azure SQL Database.
Note that PostgreSQL and MSSQL are the recommended database options. While we do support the use of Oracle, it is the least used option, and we may remove support for it in the future if usage decreases.
Privileges required
User DDL privileges for table/index creation and modification. Keeping the aforementioned DDL privileges is required even after installing the Hyperscience application.
HA/DR
Architectures for the database are subject to our customers’ policies and are managed by them — our application supports connecting to a single database host at a time
Note that migrating existing data between different database types is not supported.
File storage
Supported options
An S3 bucket
Azure Blob Storage
Google Cloud Storage
A networked file store (like NFS or CIFS)
HA/DR
Architectures for the file store are subject to our customers’ policies and are managed by them
Load balancer
To achieve HA/DR goals for the application, we encourage customers to deploy the application on multiple VMs and to use a load balancer for web requests. To learn more about using a load balancer to distribute web requests, see Load Balancer.
Note: the application uses the HTTP_HOST from the request to generate some of its links. If you are configuring a Load Balancer, please bear this in mind.
Load balancer health check URLs
Hyperscience offers a Health Check Status API endpoint, which is designed to help you monitor the health of your system’s components. If any component tested by the Health Check Status API is in an error state, the endpoint will return an error code. If you enter the Health Check Status endpoint as your load balancer's health check URL, an issue in one server will cause all servers to return an error code to the load balancer. This response will prevent traffic from being routed to your entire system, even if healthy servers are available. For this reason, we do not recommend using this endpoint as your load balancer's test of overall system health.
For more information on the Health Check Status endpoint, see our API documentation.
Sizing the system
Exact capacity planning requires details of anticipated document flow (see below) and is usually discussed with our Deployments team.
Server capacity
The exact number of VMs required depends on the following factors:
Peak hourly throughput
Number of fields per page to be collected
The split of structured (forms) vs semi-structured (invoices, checks, paystubs, bills, etc) documents
Our system scales horizontally: doubling the number of machines doubles the processing capabilities. Additionally, Hyperscience is able to leverage a larger number of cores. The system will perform twice as fast on a 16-core 64GB RAM machine as it will on an 8-core 32GB RAM machine.
Requirements for RAM and VM CPU cores
To ensure optimal performance, the system requires a 1:4 ratio of the number of cores in a VM's CPU and the number of gigabytes of RAM in that VM. For example, if a VM has 16 CPU cores, that VM should have 64GB of RAM.
Note that burstable-performance machines are not supported. Such machines are:
AWS: T-series. To learn more, see Amazon’s Burstable performance instances.
Azure: B-series. To learn more, see Microsoft’s B-series burstable virtual machine sizes.
Google Cloud: shared core. To learn more, see Google’s General-purpose machine family.
The Hyperscience application is designed to consume 100% of its CPUs’ resources. The nature of burstable-performance machines does not allow them to constantly utilize 100% of the CPUs’ resources, which results in system slowness.
Directionally, a single 8-core 32GB RAM machine will process ~15,000 structured pages, with 20 fields per page, in 12 hours. Similarly, a single 8-core 32GB machine will process ~7,000 semi-structured pages, with 20 fields per page, in 12 hours. Large production set-ups might have 4-8 machines, across 2 data centers for high availability.
The following table gives an example of the recommended number of machines based on a few different peak hourly volumes and ratios of Structured vs Semi-structured documents. The table assumes a 8-core 32GB RAM machine and approximately 20 extracted fields per page.
| 1000 pages | 2000 pages (Peak Hourly) | 4000 pages (Peak Hourly) |
---|---|---|---|
100% structured | 1 VM+ 1 (Trainer) | 2 VMs+ 1 (Trainer) | 3 VMs + 1 (Trainer) |
50% structured 50% semi-structured | 2 + 1 (Trainer) | 3 + 1 (Trainer) | 5 + 1 (Trainer) |
100% semi-structured | 2 + 1 (Trainer) | 4 + 1 (Trainer) | 7 + 1 (Trainer) |
Table 1: Number of 8-core 32GB RAM VMs required
Trainer capacity
The trainer uses the directory referenced by the HS_PATH environment variable for storage. Ensure that the directory is located on a partition with at least 100GB of storage.
If you plan on enabling Trainer Resiliency, which creates checkpoints for training data and model training, you will need to ensure that there is an additional 6GB of storage available. To learn more about this feature, see Trainer Resiliency.
Database capacity
The size of the database store depends on:
Daily volume of pages
Number of fields per page to be collected
Retention period before record deletion
Configurable by the user. Anywhere between 3 days and 60 days is common.
A typical set-up (15,000 TIFFs per day, deleted after 30 days) requires 30GB of DB. As storage is inexpensive, teams usually provision this with a buffer (100-200GB).
File storage capacity
The size of the file store depends on:
Daily volume of pages
Mix of file sizes and formats
Retention period before record deletion
Directionally, a set-up of (15,000 TIFFs per day, deleted after 30 days) requires ~1TB of file storage.