Having a diverse, representative training set is crucial for a high-quality identification model.
In v37, the Hyperscience application allows you to train a model with fewer annotations with minimal impact on performance.
How data is curated
The Training Data Curator labels each training document as having high or low importance. The importance is calculated by determining which data would best contribute to the model’s performance. For each group of documents, the system labels the most impactful documents as having high importance.
Before using the Training Data Curator, make sure that you've:
uploaded at least the recommended number of documents. The default minimum value is 100.
Upload a diverse and representative sample of documents to achieve robust model performance.
The goal is to improve the efficiency of the annotation process by requesting an optimal subset that reflects the variety of documents you expect to automate with the model.
Using the Training Data Curator
Analyze the data.
The Training Data Curator depends on the results from the document grouping. That's why you should first analyze your data.
For more information about data analysis and groups, see Training Data Analysis and Guided Data Labeling.
Before analyzing the data, the importance of each document is N/A.
You should reanalyze your data each time you need to recalculate the importance of your documents, as the system doesn’t re-run the analysis automatically.
After data analysis is finished, the Training Data Health card shows your training set's quality.
Required documents — number of additional documents needed to run a model training. This number is lower than the number of recommended documents, as the number of documents the system needs to run the model-training process is lower than the number of documents needed for a high-performance model.
Ineligible/Eligible documents — number of documents excluded/included from the training set. For more information, see Document Eligibility Filtering.
Recommended documents — number of documents we recommend to use for high-quality model training. The default minimum value is 100, but to ensure a robust model, provide a representative and diverse training set, which may require more than 100 documents.
The importance of each document appears in the Training Data Table. Importance is partially determined by grouping or document similarity—similar documents are grouped together and assigned high or low importance, helping you avoid redundant annotations.
High importance — the documents you should annotate first. They are representative of your data and will help you achieve better performance.
Low importance — the documents that are redundant in your training set.
Click Filters to filter the documents by group and by importance.
After you’ve selected a group and importance, Apply Filters.
Annotate the documents with High importance first.
The more representative your training documents are of real-life documents, the better your model will perform with fewer annotations.
We recommend double-checking the ones with Low importance.
The Training Data Curator automatically decides how many documents are of High importance. If the training set has low diversity and high redundancy, you will have a low number of High importance documents, and vice-versa. We recommend uploading more documents, re-analyzing your data, and annotating more High importance documents to improve your model.
After a document has already been annotated, it’s marked as high importance, and similar documents may be marked as low importance. These updates help keyers annotate documents that are most valuable to the model. If you import documents that have already been annotated, they have high importance by default.
Next steps
Review your training set for inconsistencies to determine whether you need more annotations or corrections. To learn more, see the Detecting and Correcting Anomalies in Field Annotations.
As you annotate documents, their importance becomes High, and unannotated documents may be marked as Low importance. Reanalyze your data as you annotate to see the updated information.
Based on the data analysis, the system might determine that a document with a Training Status of Never (excluded from the training) is of High importance. In such cases, we recommend double-checking the document.