To achieve better automation rates for document classification, a classification model must be trained for each Semi-structured and Additional layout.
Training a Classification Model
How to Initially Train the Classification Model
To train a new model, navigate to the Classification Model detail page by clicking View Model. The card Model Training will contain the layout family which the model will be trained to recognize. Pages can be uploaded under Actions in the Model Training card.
A minimum of 10 document pages for each layout is required.
120 document pages for each layout is recommended.
A maximum of 5000 document pages for each layout can be uploaded.
A maximum of 18000 document pages for a classification model can be uploaded.
Once the pages are uploaded, train the model by clicking Run Training at the top of the model detail page. When training is complete, Model Validation Tasks (MVTs) generated by model training will appear at the top of the Model Details page. These MVTs should be completed. In order to get the highest model performance initially, we encourage users to complete two or three rounds of MVTs and model training before setting the release live. To learn more about Automatic Document Classification MVTs, see Automatic Document Classification Model Validation Tasks(link to the new article). Any additional rounds of model training before the release is live will make a negligible improvement to model performance and are unnecessary.
When to Initially Train the Classification Model
Until a model is trained, the Classification Model will not be deployed. We encourage users to train a new Classification Model while a release is locked so that the Classification Model is effective as soon as the release is live. However, a Classification Model can be trained at any time.
The Classification Model is Trained with Pages and Makes Predictions at the Page Level
A Classification Model is trained at the page level, meaning every page in a document will be matched to a layout individually during classification. If your semi-structured documents are more than one page, we recommend using all the pages of that document as training pages for a single semi-structured layout, as opposed to creating a semi-structured layout for each page. If pages are significantly different within a semi-structured document, additional training may be required to increase model performance.
Classification Model Requirements and “Other” Documents
At Least Two Semi-structured or Additional Layouts in the Layout Family, or One Semi-structured or Additional Layout in the Layout Family and “Other” Document Pages
A Classification Model cannot be trained to classify a single Semi-structured or Additional layout. An example with a model trained to classify invoices can help explain. The model fundamentally makes comparisons and needs to make a prediction that the page is an invoice or another layout. In order to do so, the model would need to also be trained with pages of another layout, for example, receipts, so it can predict whether a page is an invoice or a receipt. This is why at least two Semi-structured or Additional layouts are required in the model family.
Alternatively, a model trained to classify invoices could also be trained with another class of pages that is ‘not invoice’ which, for example, could be pages of handwritten notes, blank pages, and email scans. This class of different pages is like the Other class in a Classification Model.
The important distinction between Other and a second Semi-structured or Additional layout in the layout family
is the Other class contains anything in a submission you want to ignore, while any layout in the layout family requires a classification for processing. In our example, with a receipt class, we want our model to classify invoices and receipts, while with a “not invoice” class, we want our model to classify invoices and ignore anything not an invoice.
When to Train With “Other” Document Pages
It is not required to train a Classification Model with Other document pages unless there is only one Semi-structured or Additional layout in the layout family. Though for models with more than one layout in the layout family, there are potential cases where performance can be improved by training with Other documents. If, for example, a layout family contains Layouts A and B. Submissions contain documents A B and C. A user wants to ignore a C document during processing, but it is similar enough to document A that it occasionally is matched to Layout A. Training the model with C document pages as Other pages can help improve the model’s performance by making it more likely C documents will match to Other instead of Layout A.
Continuous Model Training
If the Continuous Classification model improvement setting is enabled in the "Document Classification" section of Administration > Settings, new models will be trained automatically and deployed automatically when training completes. In the Classification Model detail page, users can see a history of model training and deployments in the Model Activity card.
New training pages for a Classification Model are generated in a number of ways. During Classification Supervision, pages matched to layouts in the layout family of a Classification Model will become training pages. Training pages can also be generated as Classification QA tasks are completed. If a Field ID Supervision task is completed without marking the layout incorrect, the pages in that document become training pages for the Classification Model. Classification QA and model training can also generate Model Validation Tasks, which provide the model with valuable training data.