Training a New Field Identification Model

There are two ways to train a Field ID model.

  1. To manually train and deploy models, go to the Model Details page, and follow the instructions in this article.

  2. To automatically train and deploy Field ID models, you can enable the Continuous Field Locator model improvement setting.

    • For optimal performance, we recommend that you train models manually and disable the Continuous Field Locator model improvement setting. Only enable this setting if instructed to do so by a Hyperscience representative.

    • See Identification Settings for more information about Continuous Field Locator model improvement.

To train and deploy models, go to the Model Details page. Once you determine a Semi-structured layout where you would like to train a model, there are two ways to get to the Model Details page:

  1. Go to Library > Models, select Identification Models from the drop-down list at the top of the page, and then click on the name of the model.

  2. Go to Layouts, click on the name of the layout, and then click on the name of the Identification Model on the Layout Details page.

To understand the requirements to train a model, see Requirements for Training a New Model.

Multiple Occurrences Field ID model

The Multiple Occurrences (MOs) feature helps you identify multiple instances of a field. Learn more about fields with multiple occurrences in Field Identification.

Multiple Occurrences checkbox

The default Field ID model can predict multiple occurrences of fields. Users are now able to indicate whether a field needs annotation of multiple instances by selecting the Multiple Occurrences checkbox in the Layout Editor. 

When creating a layout, the checkbox will be deselected by default. 

  • When this checkbox is selected, Multiple Occurrences annotations will be enabled, and the model will look for multiple instances of a specific field. 

  • When this checkbox is deselected, you'll be able to annotate only a single instance of the field. The model will return only one result. 

If you select the Multiple Occurrences checkbox for a field, annotate your dataset, and then deselect the checkbox, the annotations won’t be invalidated. Still, the Training Data Analysis will display anomalies for documents that have multiple instances of that field. Learn more about anomalies in Detecting and Correcting Anomalies in Field Annotations.

Using existing layouts

After upgrading, existing layouts will have the following behaviors, depending on the engine type you used in the previous version:

  • If in a previous version of the application, you used engine type MULTIPLE_OCCURRENCES to train a model, then ALL existing fields will have the Multiple Occurrences checkbox checked in the Layout Editor. 

Make sure to deselect the Multiple Occurrences checkbox for fields with a single instance.

  • If a generic engine type for training a model was used in the previous version of the application, then none of the existing fields will have the Multiple Occurrences checkbox selected.

If you have a model already trained and deselect the Multiple Occurrences checkbox in the Layout Editor, no changes will be applied to the current live model until it's re-trained.

To initiate model training, follow the steps in the Initiating Model Training section below.

Unstructured Extraction Field ID model

GPU trainer required

A GPU trainer is required in order to use Unstructured Extraction. Contact your Hyperscience representative for more information.

If you have an on-premise deployment of Hyperscience, you can also learn more by reviewing the "Enabling Trainers with GPUs" articles for Docker, Podman, and Kubernetes.

The default Field ID model cannot extract data points from documents with unstructured text. If you want to extract data points from unstructured documents through a specific layout, you need to select the Unstructured Extraction Field ID model for this layout before model training. To do so, follow the steps below:

  1. Go to the admin page by adding "/admin/form_extraction/template/" to the end of the application URL (e.g., example.production.hyperscience.com/admin/form_extraction/template/).

  2. Click the UUID of the layout you’d like to edit.

  3. In the Flex engine type for training setting, select UNSTRUCTURED_EXTRACTION from the drop-down list.

  4. Click Save.

When training a model for Unstructured Extraction, the following limits apply:

  • 2,000 text segments per page

  • 200 pages per document

  • 200,000 text segments per document

  • 10,000,000 text segments total

To initiate model training, follow the steps from the Initiating Model Training section below. 

Initiating Model Training

On the Model Details page, you can see if you've completed enough QA or Field ID Supervision to initiate training. If you have not yet reached the minimum, you'll see the number of additional documents required to reach the minimum.

  • Once you've reached the minimum, train a model by clicking the Run Training button (if there is no previous model) or Actions > Run Training (if there is an existing model).

  • After initiating training, the system will show that the model is pending.

The training process takes approximately 8 minutes per document on an 8-core machine with 32 GB of memory. Monitor the Notifications in the top left of the application to keep track of model training jobs.

To cancel a model training job, see Canceling or Retrying a Training Job.

Anomaly Detection

With the Anomaly Detection feature, the system analyzes your training data and flags potential anomalies in the annotations for you to review. When you review each flagged annotation, you can mark it as correct or edit the annotation. If you re-train a model after reviewing the anomalies, you will improve automation. You can manually initiate model training at any point, even if you haven’t reviewed all of the flagged anomalies.

For more information, see Detecting and Correcting Anomalies in Field Annotations and Detecting and Correcting Anomalies in Table Annotations.

Additional Notes

  • If your deployment does not have a dedicated machine for training, document processing times will be severely delayed while the model trains. Without a dedicated machine, it is best to avoid processing documents while training models. 

  • Initiating training on subsequent models for a Semi-structured layout is identical to initiating the first model. However, when you view the Model Details page, you'll see data associated with the live model in the Current Model section. 

  • If you have PII deletion enabled on your system, or if you have imported a model from another instance, it is possible that you may not have enough documents to run training even if you have a live model. If this is the case, you'll need to wait until enough documents have been through QA or Field ID Supervision (increasing the sampling rate can reduce the wait time).

    • Just like before, you can train a new model by clicking Run Training.