Requirements for Training a New Model

Overview

If you create a new Semi-structured layout version, there will be no models immediately available. For optimal layout performance, train a model on the newest layout version. Recall that identification models are trained at the layout-level.

Increase Field ID model performance (accuracy and automation) by retraining with additional Field ID Supervision and QA documents.
Increase Table ID model performance (accuracy and automation) by retraining with additional documents that have gone through Table ID Supervision and Table ID QA.

The application can be trained on-premise and deployed to a distinct machine in order to separate model training resources from document processing resources. For more information about this, see What is the Trainer?

Training requirements

The document requirements for training Field ID and Table ID models are the same. Field ID uses Field ID Supervision and Field ID QA data, while Table ID uses Table ID Supervision and Table ID QA data for training.

Guidelines for training both types of models:

A layout can only have one live model for Field ID and one for Table ID.
To train a model, the system uses training data from all flows where the model’s layout is live.
The following requirements and limitations around training data for each model ensure a smooth training process and prevent crashes:
- A minimum of 400 qualified documents.
  - With accurate training data annotations, effective Field ID models can be trained on ~120 documents. For better performance, we recommend using as many documents as possible.
- A maximum of 1000 documents.
  - If you have more than 1000 documents, the documents with the latest completion date will be used unless specified differently in Keyer Data Management.
- A maximum of 5000 pages in total.
- A maximum of 10 pages per document.
- A maximum of 5000 text segments per document.
- A maximum of 2000 text segments per page.
- A maximum of 500000 text segments in total.

If you would like to change these requirements and limitations, reach out to your Hyperscience representative.

Note that training jobs can take anywhere from three hours to several days, depending on how many documents are available.

For more information about model training, see Training a New Field Identification Model and Training a New Table Identification Model.

Document qualifications for training

A diverse, representative training set is crucial for a high-quality Identification model. To create a more generalized model, we advise you to select representatives from each type of document you want to include.

The more diverse the data, the better the model’s performance will be.

When selecting training documents:

Review all documents and become familiar with the fields you want to extract (e.g., how they’re formatted, and where they appear on the pages).
Remove any redundant or unnecessary documents from the dataset (i.e., documents containing unrelated information, documents that are highly distorted, or that don’t contain valuable information).

Avoid including documents with duplicate pages, poorly scanned pages, or photographed pages that are noisy, skewed, or pixelated.

A document is eligible for model training if it meets all of the following criteria:

Field ID models

PII is not wiped.
Field ID Supervision or QA is complete and no fields have been left blank.
The layout version for the respective document has a superset of the fields in the active layout version.
- If you delete a field in the layout editor and then re-add it later, it will be considered a new field.
The document does not have fields with overlapping bounding boxes.

Table ID models

PII is not wiped.
Table ID Supervision is complete and no fields have been left blank.
The layout version for the respective document has a superset of the table column fields in the active layout version.
- If you delete a table column field in the layout editor and then re-add it later, it will be considered a new field.
The document does not have fields with overlapping bounding boxes.

Data requirements for Table ID models

When training Table ID models, note the additional requirements listed below.

For each column on a page:
- all of the cells should have the same width, and
- the left edge of each cell should be the same distance away from the left edge of the page.
For each row on the page, the bottom boundary of the row should be the same as the top boundary of the row directly below it (i.e., the rows should be contiguous).

Technically, a document where the table is marked "Not Present" satisfies these requirements and counts as a training document. However, if you train on only these documents, you will not have a very performant model.

In documents where the table is present, completing the Table ID Supervision tasks will ensure that your training data meets these requirements.

Trainer performance metrics

The Model Details page (accessible by Library > Models and selecting a specific model) provides detailed metrics on the performance of each layout as well. These metrics show the manual, machine, and overall field identification accuracy, as well as automation rates for a specific layout. As you create new models and add additional documents to the training data set, these metrics will allow you to understand the change in performance.