Text Segmentation

Using features for Semi-structured documents
This article mentions features used in the processing of Semi-structured documents. Your access to those features depends on your license package and pricing plan.
To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

Text Segmentation is the process of partitioning an image into regions containing text into meaningful and distinct pieces or blocks of text. It is the first step of downstream processing tasks such as classification, text transcription, fields, table extraction, and others.

In this article, you will learn how to leverage segmentation to improve the performance of your models.

Segmentation in Hyperscience

Segmentation is a crucial first step in handling Semi-structured documents. It detects regions in a page image that contain text. These regions (also known as “text segments”) are identified by the model and returned as coordinates that define their position. To visualize these regions, the system creates bounding boxes around the text segments. The segments can then be processed using a Transcription model.

Segment properties
After the Segmentation model identifies the text segments on a page, the Transcription model extracts the text within each segment. As a result, each text segment has the following properties:
Location — The coordinates of the bounding box containing the text.
Text — The extracted text within the bounding box.

Sometimes our segmentation model may create segments in areas without text or fail to create a segment where text is present. Examples include the following:

Watermarks
Vertical text
Text that is faded, has a background, or has low contrast with its surrounding area.

When the system does not create a segment for a specific piece of text, our training pipelines and downstream processing won’t be aware of that text when evaluating the model’s predictions. As a result, the model cannot extract any information from it during both training and processing. Thus, if a segment is missing, the model won’t recognize key data in that area.

Model training and segmentation

Each model uses specific information. The table below describes the data used by each type of model.

Model	Segmentation properties used
Classification	Uses only text.
Field ID	Uses both text and location information.
Table ID	Uses both text and location information.
Long-form Extraction	Uses both text and location information.

Segmentation and Signatures
Signature segmentation is trained as a standalone model, separate from text and checkbox segmentation, allowing it to focus exclusively on identifying signatures in documents. That way, the model produces more reliable and complete bounding boxes around signatures.
The bounding boxes from text segmentation are used to refine the segmentation of signatures, ensuring that fragmented or partial signatures are consolidated for better accuracy.

Segmentation and annotations

Annotations are automatically mapped to the identified segments in the selected region. We recommend using these locations to ensure accuracy. During this process, bounding boxes appear automatically around the segments, as shown below:

These annotations are later sent to the Trainer for model training. Learn more in our What is the Trainer? article.

The system expects full segments when evaluating annotations. As a result, adjusting the bounding box to capture part of the field does not mean that only that part will be sent to the trainer. For example, if we adjust the bounding box to omit the “D” in “LTD,” the system will still send the entire original segment for model training, as most of the text is within the bounding box.

If the user wants to capture only a part of the segment, this needs to be handled in post-processing. Partial segments are not supported.

Best practices

Make sure to capture the entire segment in the bounding box.

Adjust the bounding box only if two segments overlap to prevent disrupting the values you want to extract or use multiple bounding boxes. Learn more in Document Eligibility Filtering.

If you want to annotate parts of a multiline field, use our Multiple Bounding Boxes feature. See the example below:
If you want to annotate the entire field, you can use a single bounding box:

To learn more about the model-training process, see Training a Semi-structured Model.