Clause Extraction

Hyperscience’s solution for documents that consist of longer paragraphs of text that span across multiple pages is called Clause Extraction. It builds upon the Unstructured Extraction model. To learn more, see Unstructured Extraction Field ID model.

Clause Extraction is available for:

  • Flexible Extraction

  • Field ID & Field ID QA

To make the extraction of longer text possible, we implemented a new data type called “Clause.”

The Clause data type can only be added to fields, and not tables, in the Layout Editor.

Prerequisites

GPU trainer required

A GPU trainer is required in order to use Clause Extraction. Contact your Hyperscience representative for more information.

If you have an on-premise deployment of Hyperscience, you can also learn more by reviewing the "Enabling Trainers with GPUs" articles for Docker, Podman, and Kubernetes.

Before using Clause Extraction, make sure to set the data type of the field with longer text in your layout to Clause. Learn more in Creating Semi-Structured Layouts.

Custom Data types also work with Clause, as long as the ML Configuration is set to Entry - Clause:

Learn more about ML Configurations in Creating Data Types with ML Configurations.

Once you set the data type to Clause, the system will automatically switch the default Field ID model engine for training to UNSTRUCTURED_EXTRACTION.

Using Clause Extraction

  1. Annotate your documents by using the Add another text segment option for fields that span across multiple pages. Learn more in Multiple bounding boxes for fields.

  2. Preview the transcribed values on the right-hand side of the Document Viewer.

Unstructured Extraction does NOT support Multiple Occurrences. If you have the Multiple Occurrences checkbox selected in the Layout Editor and run a model training, it will fail due to the engine type. Learn more about Unstructured Extraction in Training a New Field Identification Model.