Matching pages to layouts is part of the Document Classification task. The machine tries to match each page to its layout. If the machine’s confidence is high enough, the page will be matched to a layout automatically. If the machine’s confidence is low, you will be asked to match the page manually.
The process of matching pages to layouts is different, depending on the layout type. To explain the differences between the processes, we will go through:
Matching pages to Structured layouts.
Structured layout match threshold
Layout identifiers
Matching pages to Semi-structured layouts.
Specify a submission’s layout
Differences between matching pages to Structured and Semi-structured layouts.
Submissions with Structured and Semi-structured pages.
Troubleshooting.
Matching pages to Structured layouts
Hyperscience uses an engine that matches incoming page images to a predefined set of layout images. The predefined set of layout images includes all Structured layouts that are part of a live release. The engine returns a list of possible matches ordered by confidence. If the confidence is high enough, the engine automatically matches a page with a layout. To understand how to set a confidence threshold and achieve high automation rates, see the sections below.
Structured layout match threshold
The Structured layout match threshold provides the ability to set a confidence threshold for Structured page matching. Structured pages above the machine-determined threshold will be automatically matched to a Structured layout.
The layout match threshold is a flow setting whose default value is 0.6, but it can be changed to meet your requirements. Reduced thresholds can lead to higher chances of pages being matched to wrong layouts. During implementation, your Hyperscience representative will help you determine the optimal threshold minimum to ensure high quality matches.
Confidence in matching directly correlates with extraction accuracy. The machine is always matching pages to the best layout the machine can extract data from. The more confident the machine is in matching a page, the more confident the machine will be when extracting a page’s fields.
Layout identifiers
Layout identifiers are unique identifying landmarks that can be defined on each layout page to help the machine achieve the best layout match for submitted pages. The machine uses layout identifiers to distinguish among similar layouts. For example, if you have 3 similar layouts that are part of a live release and you submit a document that is similar to these 3 layouts, using a unique layout identifier will help the machine match the document to the correct layout.
When using layout identifiers, we recommend following the best practices from our Best Practices for Using Layout Identifiers article to achieve high accuracy in document classification.
Matching pages to Semi-structured and Additional layouts
Hyperscience uses trained Classification models to match pages to Semi-structured and Additional layouts. You need to train a Classification model for each Semi-structured and Additional layout in a live release to help the machine recognize individual layouts. There are three types of documents used for training:
Submitted documents that are already matched to a layout - doesn’t matter whether the pages were classified by a user or the machine.
Manually uploaded training documents - documents that are added from the Classification Model detail page.
“Other” documents - manually uploaded documents that should be used on special occasions. You can train classification models to ignore certain documents that you don’t want to be classified. This can help reduce unnecessary manual work by preventing documents you don’t care about from being processed.
To learn more about matching pages to Semi-structured layouts, see:
Specify a submission’s layout
You can skip machine classification for Semi-structured documents if you specify the layout upon submission. We recommend specifying the layout in cases where you know the submission’s layout in advance. There are two ways to specify the submission’s layout:
In the Hyperscience application - choose a layout from the drop-down menu upon document submission.
Through the API - use either the layout_uuid or the fallback_layout_uuid property:
layout_uuid - The UUID of a Semi-structured layout that's part of an active release can be specified. When specified, all submitted pages in the API request will be matched to it and machine classification will be skipped.
fallback_layout_uuid - The UUID of a Semi-structured layout that's part of an active release can be specified. When specified, only pages that are not matched to a live layout in the API request will be matched to the specified Semi-structured layout.
Note that the layout_uuid and fallback_layout_uuid properties cannot be specified in the same request.
By specifying the submission’s layout, you eliminate the possibility of matching a page incorrectly, and you save time by skipping machine classification.
Differences between matching pages to Structured and Semi-structured layouts
The tables below outline the differences between matching pages to Structured and Semi-structured layouts.
Features
Feature | Structured | Semi-structured |
Automatic Classification | Yes | Yes |
Manual Classification | Yes | Yes |
Layout identifiers | Yes | No |
Classification model | No | Yes |
Settings
Setting | Structured | Semi-structured |
Layout match threshold | Yes | No |
Semi-structured target accuracy | No | Yes |
Semi-structured grouping logic | No | Yes |
To learn more about Classification and Classification settings, see Document Classification and Flow Settings.
Submissions with Structured and Semi-structured pages
If you have submissions that contain both Structured and Semi-structured pages, here’s how the system will try to match the pages to layouts:
Structured pages will be identified and matched to Structured layouts.
The Classification model will try to match the remaining pages as Semi-Structured or Additional. Note that the Classification model can only match pages to layouts that are part of a live release.
You will be asked to perform a Manual Classification task – you have to classify all pages that could not be matched to a layout.
Troubleshooting
When your submission’s pages go through Classification, issues may occur. To understand how to resolve these issues, we recommend going through the following articles: