Monitoring Model Performance

Using features for Semi-structured documents
This article mentions features used in the processing of Semi-structured documents. Your access to those features depends on your license package and pricing plan.
To learn which features are available to your organization and how to add more, contact your Hyperscience representative.

After your model has been deployed in production, the next step is to monitor its performance. Machine learning models can degrade over time due to changes in input data, document structure, or annotation quality. That’s why monitoring the health of your models is crucial for your business case over time. Learn more about evaluating models in Evaluating Model Training Results.

In this article, you’ll learn how to:

Recognize common indicators of performance issues.
Use reporting tools to monitor automation and accuracy trends.
Identify when model retraining may be needed.
Monitor performance across Identification and Transcription models.
Validate your training data and annotations.

Whether the visual formats of your documents are the same or changing, proactive monitoring helps your models maintain high performance and avoids downstream impact on your workflows.

Common indicators of performance issues

Your model’s performance may not remain consistent over time. As document formats evolve or as new data enters production, the model may become less accurate or require more manual input. Monitoring several key metrics can help you identify whether or not you need to retrain your model. In this section, you’ll learn more about these key metrics and how to monitor them using our reporting tools.

Using testing documents

We recommend setting aside 50-100 representative documents for testing your model’s performance. Doing so allows you to evaluate how the model performs on realistic data.

These documents should reflect the variety of inputs you expect in production.
They should not be seen by the model (e.g., should not be included in the training documents).
Run your testing documents through the system with a 100% QA Sample Rate to assess accuracy and automation trends. To learn how to configure the QA settings for Identification and Transcription models in your flow, see Document Processing Subflow Settings.

This approach helps you establish a reliable performance baseline.

Decrease in Machine Accuracy

Accuracy
Accuracy indicates how often correct outputs are produced by the system. It measures how effective the model is at making correct predictions relative to the number of predictions made. It is calculated by comparing model predictions to the values that reached consensus during QA.
Accuracy can be influenced by factors like imbalanced datasets or inconsistent annotations. To learn more, see our Accuracy article.

A drop in accuracy is one of the most common indicators that your model may not perform as expected. Monitoring accuracy helps you assess whether predictions from the model or manual input are aligned with QA outcomes. Learn more about QA in our What is Quality Assurance? article.

Target Accuracy
Target Accuracy is a manual setting that defines the minimum accuracy level required for a model to be considered successful - for example, 95% of fields must be correct.
It is specified at the field level - if even one occurrence within a field is incorrect or missing, the entire field is considered incorrect.
In contrast, the reported accuracy is measured at the occurrence level, where each value is evaluated separately. As a result, reported accuracy is often higher than the target accuracy. Lean how to set your target accuracy in Document Processing Subflow Settings.

Manual Accuracy vs Machine Accuracy Report

You can monitor accuracy trends using the Manual Accuracy vs Machine Accuracy report, available on the Accuracy page (Reporting > Accuracy).

This report shows two key metrics:

Machine Accuracy (blue line): Model-only output vs. QA.

Machine Accuracy
Machine Accuracy refers to the ratio of correct model predictions excluding the cases where human review was involved. It represents the model’s standalone performance on tasks that were not manually reviewed.

Manual Accuracy (green line): Human-edited output (via Supervision or consensus QA) vs. QA. To learn more, see Manual Accuracy vs. Machine Accuracy.

Monitoring model types
Use this report to identify which model types—Field Identification, Table Identification, Classification, or Transcription—are underperforming.

Decrease in Machine Accuracy

Machine Accuracy shows how well the model performs on its own, without any human intervention. It is the primary indicator of model performance.

A drop in Machine Accuracy can indicate:

The model is seeing unfamiliar documents (e.g., documents from new vendors with different structures or updated visual formats).
There are insufficient examples of these documents or fields in the training data.
Fields have changed behavior (e.g., from single-line to multiline), but those changes aren’t reflected in the current layout configuration.

A drop in Machine Accuracy does not always mean the model has degraded. It may indicate the model was never exposed to certain types of documents or fields during training.

Monitoring Machine Accuracy

Follow the steps below to identify decreases in Machine Accuracy:

Go to the Accuracy page (Reporting > Accuracy), and scroll to Manual Accuracy vs Machine Accuracy.
1. Set a date range you want to observe accuracy for.
  - Track accuracy across time ranges to see if the accuracy drop is a temporary or persistent trend.
2. Select the model type (e.g., Field Identification, Table Identification, or Transcription).
3. Select the relevant layout and flow.
4. Apply filters:
  - Choose All Layout Fields for a general accuracy overview, or
  - select a specific field to identify isolated issues.
Observe the Machine Accuracy (blue lines) for that field.
- Hover your cursor over the data point of interest. A tooltip will display the accuracy percentage and number of calculation points (human and machine) for a chosen date.
Calculation points
Calculation points are the number of fields evaluated during QA (e.g., 1 QA task with 5 fields = 5 calculation points).
- Look for consistent downward patterns or sudden drops. Confirm that these patterns are not limited to a single field or date range.
- Repeat the process for each field you want to review.
Review your training data in TDM.
- Investigate whether the affected fields are sufficiently represented in your training documents.
- Ensure that the fields are annotated consistently across examples.
- Consider whether new examples of documents have been introduced but not yet captured in training (e.g., updated visual templates from specific vendors).
Retrain your model.

Using your test set
After retraining the model, we recommend re-running the evaluation using the same test set, as described in the Using Testing Documents section of this article. Doing so allows you to compare results and understand whether performance has changed or if the input data has shifted.

Next steps

Based on the results:

correct your annotations in TDM and retrain your model, or
enrich your training data by adding more representative examples, annotating them, and retraining your model. Learn more in Training a Semi-structured Model.

Decrease in Automation

Automation
Аutomation is the processing of data without human intervention. We measure it at the field-level—each individual field’s automation is evaluated independently. This approach enables more granular insights into model performance and helps identify exactly where manual review is needed.

A decrease in Automation means that more tasks are being routed to Supervision instead of being completed automatically by the model. This happens when the system encounters new, unfamiliar, or ambiguous documents that it’s not confident enough to process without human review.

Straight-through processing
Automation can be associated with straight-through-processing (STP), where an entire document is processed end-to-end without human intervention. However, note that STP is not a standard practice in real-life use cases due to the complexity of documents and evolving business needs.

A drop in Automation may indicate any of the following:

New document formats or vendors were introduced in production, but are not present in the training data.
Inconsistent formatting is lowering the model’s confidence.
The layout uses field settings that require a Supervision task to be generated (e.g., Transcription Supervision or Identification Supervision is set to Always).

Indicators of a decrease in Automation

A proportional increase in Supervision tasks is usually the first indicator that Automation is decreasing.

For example, if you process 100 documents per day and Supervision tasks increase from 5 to 15, that’s a proportional rise.

Automation Rate
The Automation Rate represents the proportion of extracted data with confidence scores exceeding a specified threshold. This threshold is determined by the level of accuracy you want your extracted data to have. To learn more, see Automation.

Automation Report

Track changes for a specific layout and flow with the Automation report on the Processing Time page (Reporting > Processing Time). With this report, you can track automation trends for Identification and Transcription models over time.

Hover over a blue point to view automation data for Field Identification.
Hover over a purple point to view automation data for Table Identification.
You can also click Document Fields or Table Cells to display information in the chart for one of the two models.

Accuracy and Automation
When accuracy increases, automation decreases. If you want better accuracy, more fields with high confidence will still need human checking. This checking is necessary because the model requires a higher level of certainty before relying entirely on machine input. However, a drop in the automation can happen even if accuracy remains high. For more information, contact your Hyperscience representative.

Next steps

Check new submissions for unexpected documents.
- Look for visually different documents, new vendors, or formats that may not be represented in your training data.
Confirm that these documents exist in TDM.
- If you already have examples of these documents, make sure they are well represented (typically 15-20 samples per vendor) and properly annotated.
- If they are not included, add representative examples to your training documents and annotate them.
Retrain your model and test it again on documents not seen by the model. To learn more about model retraining, see Model Validation Tasks and Training a Semi-structured Model.

Using your test set
We recommend repeating the steps described in Using testing documents to compare results and understand whether performance has changed or if the input data has shifted.