Automatic Document Classification

Overview

Automatic Document Classification enables Semi-structured and Additional documents to be automatically matched to their respective layout. For more information on document types, see Understanding Document Types.

Automatic Document Classification is performed by a machine learning model, called a Classification Model, that must be trained to recognize different Semi-structured and Additional layouts. This model will continue to improve over time as more submissions are processed.

A Classification Model is deployed with a release, and a Classification Model must be trained on all the Semi-structured and Additional layouts contained in that release. We can call the group of Semi-structured and Additional layouts used to train a Classification Model, the ‘layout family.’ While a Classification Model may be generated for a release, it exists independently and can be used for any other release that has the same or fewer Semi-structured or Additional layouts.

If a newly created release does not have a compatible Classification Model, a new untrained model will automatically be created for the release. If a new release is created that has the same or fewer Semi-structured or Additional layouts in the layout family for an existing Classification model, that model is considered compatible and will be used for the release.

Whether Automatic Document Classification is enabled or not, Hyperscience will continue to match structured layouts to their respective layout. With the feature enabled, an additional classification step is introduced when processing begins. Any structured documents in a submission will first be matched to structured layouts, the remaining unclassified pages will be processed by the Classification Model to match to Semi-structured or Additional layouts. Any pages still not matched to a layout will go to Document Organization to be manually matched to a layout. Any Semi-structured or Additional documents submitted with their layout ID will skip all classification.

Getting Started

In order to use Automatic Document Classification, the Semi-structured Classification setting must be enabled in your flow's settings. The Manual Classification Supervision flow setting must also be enabled in order to use Automatic Document Classification. With the feature enabled, users will likely first encounter a Classification Model on the release detail page.

An untrained Classification Model will automatically be created for any release that has at least one Semi-structured or Additional layout, and does not already have a compatible Classification Model. The model will be created with a layout family comprised of all the Semi-structured and Additional layouts in the release. A new model will not be able to classify pages until the model is trained. To train a model, the system uses training data from all flows where the model’s layout is live. To learn more about classification model training, see Training a Classification Model.

How the Model Classifies Document Pages and Model Performance

The classification model distinguishes between layouts based on a variety of parameters. They could be things like term or word frequency, combinations of words, or types of text. The model doesn’t look at visual features.

Model Performance with Document Page Variability

If the documents being submitted for processing contain pages expected to match to layouts in the layout family, but are significantly different than the pages used to train the model, model performance may decrease, causing a drop in automation.

Over time, the pages in submissions expected to match to layouts in the layout family may change (eg. a user begins processing a new invoice that looks significantly different than existing invoices, but expects it to match to an existing semi-structured layout for invoices). If this is anticipated, we encourage users to upload the new pages as additional training pages to the layouts in the layout family and run training for the model immediately afterward.

If changes in these pages are not anticipated, continuous model training will help maintain optimal performance. A new model will be trained and deployed once training data is generated from the submissions with new pages. The new model may have slightly lower automation than the preceding model, but will perform better on the new document pages being submitted compared to the preceding model.

Classification Supervision, QA, Continuous Model Training, and Settings

Enabling Automatic Document Classification can create additional Supervision and QA tasks. Predictions from the classification model below the target accuracy will generate a document organization supervision task. You can specify the target accuracy in the Semi-structured Target Accuracy setting in the "Classification" section of your flow's settings.

After submissions are complete, a percentage of documents will be sampled for Classification QA. These QA tasks prompt users to confirm that the model classifications are correct. Completing Classification QA is necessary to enable accuracy and automation reporting for Automatic Document Classification. The QA sample rate can also be adjusted in the same group of settings.

Model Performance and Manual Working Time

Adequate model training will reduce the manual working time around Semi-structured and Additional layouts. If a Classification Model is not sufficiently trained, additional manual tasks can be created for users. When model performance is low, more page match predictions from the model will be below the target accuracy, generating Classification Supervision tasks. It is also possible that the Classification Model will make predictions with high confidence that are incorrect. These pages can be marked as an incorrect layout during Field ID Supervision. If model performance is low, it is recommended to upload additional training pages to the layouts in the layout family.

Continuous Model Training and When New Training Data is Generated

If the Continuous Classification model improvement setting is enabled in the "Document Classification" section of Administration > Settings, new models will be trained automatically and deployed automatically when training completes. In the Classification Model detail page, users can see a history of model training and deployments in the Model Activity card.

New training pages for a Classification Model are generated in a number of ways. During Classification Supervision, pages matched to layouts in the layout family of a Classification Model will become training pages. Training pages can also be generated as Classification QA tasks are completed. If a Field ID Supervision task is completed without marking the layout incorrect, the pages in that document become training pages for the Classification Model. Classification QA and model training can also generate Model Validation Tasks, which provide the model with valuable training data.

Enabling continuous training

Note that Continuous Classification model improvement is disabled by default. For optimal performance, we recommend that you train models manually and keep the Continuous Classification model improvement setting disabled. Only enable Continuous Classification model improvement if instructed to do so by a Hyperscience representative.

If you enable it and you import a model from another instance, your automation rates may be reduced.
Models only use training data from their current environment. The only exception to this rule is models that were downloaded and imported along with their training pages, as described in Model Management.
If you do not import training data with the model and don’t have enough training data in your new environment, the model you imported will be overwritten by a worse one.
You cannot enable Continuous Classification model improvement for individual models. The setting is either enabled or disabled for all Classification models in your instance.

Disabling Automatic Document Classification

Disabling the Semi-structured Classification setting prevents the model from classifying pages during processing. Users can still train Classification Models in locked releases, access model detail pages, download and upload models, and manage models in all other ways. Documents matching to structured layouts will still be classified.

Classification Reporting

Reporting of Automatic Document Classification performance can be found in the following charts:

Reporting → Overview → “Automation” Chart → select Classification in the filter
Reporting → Accuracy → “Manual vs Machine Accuracy” Chart → select Classification in the filter

For more information, see Automation and Manual Accuracy vs. Machine Accuracy.