Long-form Extraction

Hyperscience’s solution for documents that consist of longer paragraphs of text that span across multiple pages is called Long-form Extraction. It builds upon the Long-form Extraction model. To learn more, see Long-form Extraction Field ID model

Long-form Extraction is available for:

  • Flexible Extraction

  • Field ID & Field ID QA

To make the extraction of longer text possible, we implemented a new data type called “Clause.”

The Clause data type can only be added to fields, not tables, in the Layout Editor.

Prerequisites

GPU trainer required

A GPU trainer is required in order to use Long-form Extraction. Contact your Hyperscience representative for more information.

If you have an on-premise deployment of Hyperscience, you can also learn more by reviewing the "Enabling Trainers with GPUs" articles for Docker, Podman, and Kubernetes.

Before using Long-form Extraction, make sure to set the data type of the field with longer text in your layout to Clause. Learn more in Creating Semi-Structured Layouts.

Custom Data types also work with Clause, as long as the ML Configuration is set to Entry - Clause

Learn more about ML Configurations in Creating Data Types with ML Configurations.

Long-form Extraction Engine type

Long-form Extraction Engine type

The default Field Identification models use a generic engine type that cannot extract data points from documents with unstructured text.To learn more, see Training a New Field Identification Model.

To extract data points from Unstructured documents using a specific layout, you need to select the Long Form Extraction engine type for that layout before training the model.

Follow the steps below to change the engine type of your layout:

  1. Go to Library > Layouts and open your layout.

  2. On the Configuration card, click the Edit button () under the Engine Type setting.

  3. Select Long Form Extraction from the drop-down menu.

  4. Click Change Type.

  1. Retrain your model after you’ve changed the engine type of the layout.

Using Long-form Extraction

  1. Annotate your documents by using the Add another text segment option for fields that span across multiple pages. Learn more in Multiple bounding boxes for fields.

  2. Preview the transcribed values on the right-hand side of the Document Viewer.

Long-form Extraction limits

Below is a list of the limits that apply for training a Long-form Extraction model.

  • 2,000 text segments per page

  • 200 pages per document

  • 200,000 text segments per document

  • 100,000,000 text segments total

To learn more about segments, see Text Segmentation.