Classification models are a crucial part of document processing as they help the system determine which layout should be used to process each page you upload. Training Data Management for Classification allows you to add, remove, and update training pages for Semi-structured Classification (also known as NLC) models to achieve more accurate classification results. In this way, TDM helps you maximize the performance of your Classification models.
NLC (Non-structured layout classifier) finds the correct Semi-structured or Additional layout for a given set of submission pages, based on the words in the submitted documents. Note that NLC works on a page level. Learn more in Automatic Document Classification.
Each release contains a set of layouts. The creation of a release generates a single Classification model. For example, if you create a release with two layouts, then one Classification model will be generated. It will be trained to identify the document pages submitted through the release’s flow.
If you create a new layout and create a new release for it, then a new Classification model will be created. Note that you need to add your new layout to a new release or a copy of an existing release for document pages to be matched to that layout. Learn more in Adding a New Release.
TDM for Classification logic
TDM for Classification operates on a document level. However, the Training Data tab displays the number of uploaded documents and the required and recommended number of pages per layout. Learn more in the Training Data Tab section of this article.
TDM for Classification allows you to manage example documents that should be included in or excluded from your model’s training:
Layouts eligible to train - These are the layouts that meet the minimum number of pages required for training. To ensure this requirement is met, upload documents that:
have pages that match your layout and
are diverse but still represent your layout.
Our recommendation for a robust model is 120 page examples per layout.
You need at least 10 page examples to meet the minimum requirements for model training.
Do not upload the same document multiple times.
Excluded documents - TDM uses these as examples of documents that you expect to process but don't want to match. They serve as counter-examples of the documents that your model should not classify.
Access TDM for Classification
To access TDM for Classification, go to Library > Models and click on Classification Models in the drop-down menu at the top of the page.
A table with all Classification models appears:
You can filter the Classification Models table by release with the Filter by release drop-down list, which is located on the right-hand side of the page.
You can also access the model management page for a particular model by clicking on its name in the table.
You can import models from the Import Model button on the upper-right corner of the page.
The Classification models table contains the following columns:
Model shows the name of your Classification model.
Compatible Releases indicate the number of releases the Classification model can predict.
Status shows the model's current state (e.g., Needs Training or Live)
Training Status indicates the current state of the model training (e.g., Pending, In Progress, Failed, Canceled, or Last trained on [date]).
To access TDM features for your Classification model:
Go to Library > Releases
Click on the release's name for the model you want to manage training data for.
Click View Model in the Automatic Document Classification card.
Using TDM for Classification
Overview tab
Projected Automation Chart
The Projected Automation chart displays the performance of the model that’s currently live. Learn more about these metrics in our Accuracy article.
Expand it by clicking the arrow button (
)
The chart displays how the target accuracy affects the automation. The lower the accuracy, the higher the automation, and vice versa. To learn more, see Automation.
Note that projected model performance (i.e., accuracy and automation) can increase by adding more QA records. You can also see the margin of error (MoE) for this model.
Margin of Error (MoE)
The Margin of Error (MoE) indicates the allowable range of inaccuracy in the system’s results. It shows you how much the output can differ from the true value while still being acceptable. A smaller margin of error means the system is more accurate.
Model History table
The Model History table provides a comprehensive overview of your model’s lifecycle. It displays the following columns:
Name — The name of the last available model for this layout.
Date Created — Date and time the model was created. Helps in tracking the model’s version history and ensures you’re working with the most recent model version.
Version — The specific version of the model that was trained on.
Source — Indicates where the model was trained—either within the current instance or externally and then uploaded to this instance.
Proj auto — Displays the predicted automation based on the Test Target Accuracy. Learn more in our Evaluating Model Training Results article.
Docs Trained — The total number of documents used for training the model.
Last Deploy — The last date and hour the model was deployed.
Actions — The options in this menu allow you to take the following actions on a model version:
Deploy
Undeploy
Download
In v40.2 and later, you can find specific records in the table in the following ways:
Filtering — Filter the contents of the Model History table by creation date, last-deploy date, source, and trainer version. Click Filter and select the criteria that match what you’re looking for.
Searching — Enter the name of a model version in the search box.
Sorting — Sort the table's contents by clicking on the names of the following columns:
Name
Date created
Version
Source
Last deploy
Additionally, you can choose which columns are included in the table by clicking the menu next to the Filter drop-down list and clicking the Manage columns… option.
You can also adjust the target accuracy by clicking the up and down arrows located next to the Manage Columns option.
Model Compatibility table
The Model Compatibility table indicates the releases that your model is compatible with and contains the columns described below.
Release name displays the names of the releases to which your layouts are added. Click on the column header to sort its contents by name.
Created On shows the creation date of the release. Sort its contents chronologically by clicking on the column header.
Status indicates whether your release is live or locked. Learn more in What is a Release?. Sort this column’s contents by clicking its header.
You can use the pagination options at the bottom of the table to display all releases that are compatible with your model.
Training Data tab
Training Data Summary card
The Training Data Summary card displays insights on the status of your training dataset.
Training Data Status indicates your training data's health based on the number of pages uploaded for each layout:
Requirements Not Met - The minimum number of required pages uploaded for each layout is 10. If you’ve added less than 10, you won’t be able to run a training.
Not Optimized - Hyperscience recommends uploading at least 120 pages to build a robust classification model. If you upload more than 10 but fewer than 120 pages, the status will indicate that your training data can be optimized. However, you’ll still be able to proceed with training.
Ready To Train - This status will be displayed after you’ve reached the minimum required and the recommended number of uploaded pages to start a model training.
Number of examples
You need to upload at least 10 examples per layout for the model to learn what documents should be considered as part of your training set. Note that you can run a training without the Excluded examples if you have more than two layouts. However, we recommend adding documents in the Excluded section as well, as they serve as counter-examples.
Layouts Eligible to train - indicates the number of layouts that meet the minimum requirements for training.
Pages / Req./ Recomm - displays the number of uploaded pages, the number of required pages, and the number of recommended pages for your model.
Documents - shows the number of documents uploaded for this model.
Excluded Pages / Req. - number of pages used as counter-examples. The excluded pages train the model on what should not be matched. Note that they are recommended but not required.
Excluded Documents - the number of excluded documents.
Training Data Health card
The Training Data Health card displays a breakdown of your dataset. It shows all layouts included in the Classification model, as well as bars next to each layout indicating the number of uploaded pages. Note that you’ll have the required and recommended number of pages for each layout.
Follow the steps below to add training data to your model:
Click the Add Training Data button.
Select the layout you want to add data for from the Upload To Layout drop-down.
Drag and drop your files into the dialog box or click Browse.
Once you’ve uploaded your files, click Continue.
Training Data table
The Training Data table displays all documents that can be used as training data for the model.
It contains the following columns:
Document ID shows the unique ID number of the document.
Hover over the preview icon (
) to see the pages of the uploaded document.
Freeze the preview by clicking on the preview icon.
Page through the document using the arrow keys on your keyboard or the arrows in the preview dialog.
Click anywhere on the page to hide the preview.
TDM for Classification works on a document level
When working with the training data in TDM, you will classify the whole document and not a single page to a specific layout. However, QA operates at the page level.
Pages displays the number of pages in the document.
Layout shows the layout this example corresponds to.
Usage Rule indicates the way the system will use the specific document for training:
Always - The document will always be used in future model trainings. It will never be deleted, regardless of the system’s data-deletion settings.
Auto - The document may be used in future model trainings until it is automatically deleted according to the system’s data-deletion settings.
Never - The document will never be used in future model trainings and will be automatically deleted according to the system’s data-deletion settings. Documents processed through Supervision or QA will always display a status of 'Never' in TDM.
Loading - The document has just been uploaded and is going through pre-processing. Once they load, the status will change to 'Auto'.
Anomaly - The model was not confident enough for a document, and an anomaly was generated after the model training. Review the anomaly and continue with the process.
High-confidence errors
The machine might classify a page with high confidence yet still be incorrect. This type of mistake is known as a high-confidence error. To confirm and correct such errors, users must complete Model Validation Tasks (MVTs), which are shown as anomalies in TDM. Learn more in Document Classification Model Validation Tasks.
QA - Indicates the legacy documents that were processed through QA in v37 or earlier.
Excluded Training Data table
The Excluded Training Data table displays the documents used as counter-examples for your classification model. The columns are the same as those described above for the Training Data table.
Excluded Documents Required vs Recommended
The excluded documents help train the model on what should not be matched. They serve as counter-examples and are only required if your release contains a single layout.
If you have multiple layouts in your release, we recommend uploading excluded documents to achieve a higher model performance.
You can change the displayed columns by clicking on Manage Columns… in the drop-down menu.
You can filter the tables by:
Layout
Usage Rule
Source
Scheduled Deletion (Scheduled Del.)
Has Anomalies - Filters out documents that are incorrectly classified for this layout.
The Actions drop-down menu provides options to bulk-delete, edit, or download training data, as well as to download the entire training dataset.
Training a Classification Model
Follow the steps described below to learn how to train a classification model using TDM.
Upload your documents
To upload documents to TDM for Classification:
Click the Add Training Data button on the right-hand side of the Training Data Health card.
Choose Upload Files or Import Training Data from the dialog box.
Importing training data
You can import the following training data:
A Hyperscience export
A .ZIP file containing sub-folders, where the name of each folder is the name of an existing layout. We recommend naming each layout differently to avoid any confusion when importing training data. Do not include special characters in the names of the layouts, as doing so could lead to unexpected behavior during import.
You can also add examples directly to that layout by clicking add documents next to each layout. The same dialog box will appear, but the options for importing training data and uploading to a layout will be grayed out.
or add documents directly from a layout’s details page by clicking Upload Documents.
Review your documents
Review your documents using the Training Document View. It helps you match each document to a specific layout.
TDM for Classification works on document-level
You will classify the whole document and not a single page to a specific layout.
Assign a layout to the training document from the Layout drop-down menu on the right-hand side of the page
Change the status of your document from the Training Status section.
Click Save Changes after you’ve classified your document.
Training a Classification Model
Classification models data
Different Classification models across flows share the same data in TDM. Any changes applied to the training data (e.g., updating or removing documents) are also applied to all releases and flows.
After you’ve reached the requirements and recommendations, you’ll see a message indicating that your model is ready to be trained for the first time in the Overview tab.
You can run training from either tab by clicking the Run Training button on the upper-right corner of the page.
You can cancel your training at any time from the drop-down menu (
) in the upper-right corner of the page.
You can also download your model from the same drop-down menu.
Deploying a Classification model
Classification models are automatically deployed after training.
Importing a Classification Model
In v41.1 and above, you can import your classification models through the Overview tab of your model. Follow the steps below to learn how to import classification models.
Importing a Classification Model
To import a model:
Open your Classification model.
On the Overview tab, click the Actions menu (
) next to the Run Training button.
Click Upload Model from the drop-down.
Drag and drop your ZIP file directly into the dialog box, or click Browse to find the file on your machine and upload it.
Click Submit.
Importing Classification Models
When importing Classification models, make sure they are trained for the same release version as the currently opened classification model.
If the model is from a different release, an error message will appear, indicating the mismatch.
Note that you are only able to import classification models by themselves, you cannot import them with their training data.
Learn more about migrating data in Migrating Artifacts and System Assets.