Classification models are a crucial part of document processing, as they help the system determine which layout should be used to process each page you upload. Training Data Management (TDM) for Classification allows you to add, remove, and update training pages for Classification (also known as NLC) models to achieve more accurate classification results. In this way, TDM helps you maximize the performance of your Classification models.
NLC (Non-structured layout classifier) finds the correct Semi-structured or Additional layout for a given set of submission pages, based on the words in the submitted documents. Note that NLC works on a page level. Learn more in Automatic Document Classification.
Each release contains a set of layouts. The creation of a release generates a single Classification model. For example, if you create a release with two layouts, then one Classification model will be generated. It will be trained to identify the document pages submitted through the release’s flow.
If you create a new layout and create a new release for it, then a new Classification model will be created. Note that you need to add your new layout to a new release or a copy of an existing release for document pages to be matched to that layout. Learn more in Adding a New Release.
TDM for Classification logic
TDM for Classification operates on a document level. However, the Training Data tab displays the number of uploaded documents and the required and recommended number of pages per layout. Learn more in the Training Data Tab section of this article.
TDM for Classification allows you to manage example documents that should be included in or excluded from your model’s training:
Layouts eligible to train - These are the layouts that meet the minimum number of pages required for training. To ensure this requirement is met, upload documents that:
have pages that match your layout and
are diverse but still represent your layout.
Our recommendation for a robust model is 120 page examples per layout.
You need at least 10 page examples to meet the minimum requirements for model training.
Do not upload the same document multiple times.
Excluded documents - TDM uses these as examples of documents that you expect to process, but don't want to match. They serve as counter-examples of the documents that your model should not classify.
Using TDM for Classification
Learn how to navigate through TDM for Classification and how to use it in the tabs below.
Classification Models table
Access TDM for Classification
To access TDM for Classification, go to Library > Models, and click on Classification Models in the drop-down menu at the top of the page.
A table with all Classification models appears:
You can filter the Classification models table by release with the Filter by release drop-down list, which is located on the right-hand side of the page.
You can also access the model management page for a particular model by clicking on its name in the table.
You can import models from the Import Model button, on the upper-right corner of the page.
The Classification models table contains the following columns:
Model shows the name of your Classification model.
Compatible Releases indicate the number of releases that the Classification model can generate predictions for.
Status shows the model's current state (e.g., Needs Training or Live)
Training Status indicates the current state of the model training (e.g., Pending, In Progress, Failed, Canceled, or Last trained on [date]).
To access TDM features for your Classification model:
Go to Library > Releases
Click on the release's name for the model you want to manage training data for.
Click View Model in the Automatic Document Classification card.
Click on the Overview or Training Data tab to learn more about the model and optimize its performance.
Overview Tab
Classification Overview tab
The Classification Overview tab contains the tables described below.
Projected Automation chart
The projected Automation chart displays the predicted automation rate of your model based on the target accuracy. Learn more about these metrics in our Accuracy article.
Model Activity table
The Model Activity table shows the training history for your classification models and has the following columns:
Training Started indicates the start date and hour of the training process.
Status displays the status of your model (e.g., Training in Progress, Training Failed, Training Canceled, Scheduled Training, and Needs Training).
Actions allows you to download the last trained version of your model.
You can also download the current version of your model from the drop-down menu next to the Run Training button.
If you download the training data for the model, it may contain personally identifiable information. Learn more about managing your data in PII Data Deletion.
The System Version is the Hyperscience version the model was trained in. Change it from the System Version drop-down menu.
Use the pagination options at the bottom of the table to display all activities for your Classification model.
Model Compatibility table
The Model Compatibility table indicates the releases that your model is compatible with and contains the columns described below.
Release name displays the names of the releases in which your layouts are added. Click on the column header to sort its contents by name.
Created On shows the creation date of the release. Sort its contents chronologically by clicking on the column header.
Status indicates whether your release is live or locked. Learn more in What is a Release?. Sort this column’s contents by clicking its header.
You can use the pagination options at the bottom of the table to display all releases that are compatible with your model.
Training Data tab
Training Data tab
The Training Data tab contains the sections described below.
Summary
This section shows your training data stats, as well as the date and hour of the last model training.
Training Data Status indicates your training data's health, based on the number of pages uploaded for each layout:
Requirements Not Met - The minimum number of required pages uploaded for each layout is 10. If you’ve added less than 10, you won’t be able to run a training.
You need to upload at least 10 examples per layout for the model to learn what documents should be considered as a part of your training set. Note that you can run a training without Excluded examples if you have more than two layouts. However, we recommend adding documents in the Excluded section, as well, as they serve as counter-examples.
Not Optimized - Hyperscience recommends uploading at least 120 pages to build a robust classification model. If you upload more than 10 but fewer than 120 pages, the status will indicate that your training data can be optimized. However, you’ll still be able to proceed with training.
Ready To Train - This status will be displayed after you’ve reached the minimum required and the recommended number of uploaded pages to start a model training.
Layouts Eligible to train - indicates the number of layouts that meet the minimum requirements for training.
Excluded Documents - number of documents used as counter-examples. The excluded documents train the model on what should not be matched. Note that they are recommended but not required.
Last Training - displays the date and hour of the last model training.
Training Data Health
The Training Data Health card displays a breakdown of your dataset. It shows all layouts included in the Classification model, as well as bars next to each layout indicating the number of uploaded pages (not documents, as suggested in the application). Note that you’ll have the required and recommended numbers of documents for each layout.
Training Data
The Training Data table shows all documents available for use as training data for the model. It contains the following columns:
Document ID shows the unique ID number of the document.
Hover over the preview icon to see the pages of the uploaded document. Freeze the preview by clicking on the preview icon. Page through the document using the arrow keys on your keyboard or the arrows in the preview dialog. Click anywhere on the page to hide the preview.
Note that TDM for Classification works on a document level (i.e. when you edit the classification in TDM, you will classify the whole document and not a single page to a specific layout), whereas QA operates at the page level. For example, if you classify 3 pages into 2 different layouts in QA, those 3 pages will be combined into a single document in TDM. That document will keep the machine's prediction for the layout.
Pages displays the number of pages in the document.
Layout shows the layout this example corresponds to.
Usage Rule indicates the way the system will use the specific document for training:
Always - The document will always be used in future model trainings. It will never be deleted, regardless of the system’s data-deletion settings.
Auto - The document may be used in future model trainings until it is automatically deleted according to the system’s data-deletion settings.
Never - The document will never be used in future model trainings and will be automatically deleted according to the system’s data-deletion settings. Documents processed through Supervision or QA will always display a status of 'Never' in TDM.
Loading - The document has just been uploaded and is going through pre-processing. Once they load, the status will change to 'Auto'.
Source - indicates how the document was added to Training Data Management:
Upload - The document was uploaded manually through TDM.
Processing - The document was uploaded through Submissions.
Anomaly - The user changed the layout for the document during a Model Validation Task, generated after the model training.
The machine might classify a page with high confidence yet still be incorrect. This type of mistake is known as a high-confidence error. To confirm and correct such errors, users must complete Model Validation Tasks (MVTs), which are shown as anomalies in TDM. Learn more in Document Classification Model Validation Tasks.
QA - Indicates the legacy documents that were processed through QA in v37 or earlier.
Excluded Training Data
The Excluded Training Data table displays the documents used as counter-examples for your classification model. The columns are the same as those described above for the Training Data table.
You can change the displayed columns by clicking on Manage Columns… in the drop-down menu.
You can filter the tables by:
Layout
Usage Rule
Source
Scheduled Deletion
Has Anomalies - Filters out documents that are incorrectly classified for this layout.
You can also search by Document ID. The Actions drop-down menu provides options to bulk-delete, edit, or download training data, as well as to download the entire training dataset.
Upload Documents to TDM
Uploading documents
To upload documents to TDM for Classification:
Click the Add Training Data button on the right-hand side of the Training Data Health card.
Choose Upload Files or Import Training Data from the dialog box.
You can import the following training data:
A Hyperscience export
A .ZIP file containing sub-folders, where the name of each folder is the name of an existing layout. We recommend naming each layout differently to avoid any confusion when importing training data. Do not include special characters in the names of the layouts, as doing so could lead to unexpected behavior during import.
You can also add examples directly to that layout by clicking add documents next to each layout. The same dialog box will appear, but the options for importing training data and uploading to a layout will be grayed out.
or add documents directly from a layout’s details page by clicking Upload Documents.
Review Documents
Training Document View
The training document view helps you match each document to a specific layout.
Note that TDM for Classification works on document-level (i.e. you will classify the whole document and not a single page to a specific layout).
Choose the layout you want to assign to the training document from the Layout drop-down list on the right-hand side of the page.
Change the status of your document from the Training Status section.
Click Save Changes after you've classified your document.
Classification Model Training
Different Classification models across flows share the same data in TDM. Any changes applied to the training data (e.g., updating or removing documents) are also applied to all releases and flows.
After you’ve reached the requirements and recommendations, you’ll see a message that indicates that your model is ready to be trained for the first time in the Overview tab.
You can run training from either tab by clicking on the Run Training button on the upper-right corner of the page.
Classification models are automatically deployed after training.
You can cancel your training at any time from the drop-down menu in the upper-right corner of the page.
Learn more about Classification in Document Classification.