Hyperscience extracts data from documents and converts them into a machine-readable format. We support Structured, Semi-Structured, and Additional documents. To learn how to differentiate between the document types, see Understand Document Types.
Use Semi-Structured Layouts for documents where field positions might vary. They’re suitable for documents like invoices, pay stubs, and checks.
In this article, you’ll learn how to build a robust Semi-structured model corresponding to your business needs by using our Training Data Management tools.
Step 1 - Sampling documents
Review your documents
Having a diverse, representative training set is crucial for a high-quality identification model. Selecting the appropriate documents for training will optimize your Semi-structured model’s performance.
Determine the common types of documents you'll be processing, and ensure you have at least 20 of each of them.
To create a more generalized model that's able to handle a wide range of different documents, you need a diverse dataset. We recommend using documents with similar visual layout for a robust model. Provide at least 20 examples of each type of document you want to include.
Become familiar with the edge cases (i.e., documents that are completely different from the main ones) and determine their variety. Exclude them if they are not suitable for your use case.
Remove documents that would reduce model performance (e.g., documents containing unrelated information, highly distorted pages (pages that are noisy, skewed, pixelated ones, duplicates).
Choose at least 50-100 documents for testing purposes. Note that these documents should be representative of the data you expect in production.
Review your fields and columns
Your fields and columns should be representative of the information you want to extract.
Ensure they are present in your documents to achieve a high-performance model.
Make sure to review any interchangeable fields or columns, as this might result in poor model performance.
For optimal results, the data in your training set should be representative of the data expected in production.
Step 2 - Build your layout and add it to a release
Build your layout
Once you’ve determined the information you want to extract, you need to build your Semi-structured layout. To learn more about Semi-Structured Layouts, see Determining Layout Type.
Create your layout by following the steps in Creating Semi-Structured Layouts.
Use unique names for your fields or columns to avoid model training failure and simplify the annotation process.
Make sure to set the proper data type for each field or column you create to obtain a high-performance model. Learn more about data types and how to choose them in What is a Data Type? and Choosing a Data Type.
Ensure your configurations are suitable for the fields and columns for extraction:
Check Multiple Occurrences if your fields have more than one occurrence. Learn more about the Multiple Occurrences (MO) checkbox in Training a New Field Identification Model.
Enable the Multiline setting if required.
Set Identification Supervision to Always for each field you want to guarantee a manual review for. Enabling this setting will always generate Field ID tasks, regardless of the machine’s confidence. To learn more see Scoring Field Output Accuracy.
Set Transcription Supervision to Always if there are issues in the document that could prevent the machine from reading the field or the column. That way, the system will always send it to Manual Transcription, ensuring review from your keyers. Learn more about accuracy in and Transcription Accuracy and Automation.
Find more field configurations in the Defining field metadata section of Creating Semi-Structured Layouts.
Assign to a release
Add your layout to a release by following the steps described in Adding a New Release
Follow the steps in Assigning a Release to a Flow to match your release to the flow you are using. Learn more about releases in What is a Release?
Step 3 - Training Data Management
Using Training Data Management
Ground truth is manually annotated data used to train our machine-learning models. We use a subset of this data to assess the performance of your models.
Use the tools in TDM to control, manage, and adjust the ground truth of your training sets for Identification and Classification models. In this section, you will learn how to upload your data using TDM. Before you start:
See the training requirements in Requirements for Training a New Model.
Make sure to keep 50-100 documents for testing purposes. Note that they should be representative of the data expected in production. You’ll upload them after the model training is completed.
Upload your documents
Go to the Model Management page for your layout (Library > Models).
Click Upload Training Documents and upload each document as its own file.
Click Upload in the dialog box.
All uploaded documents will appear on the Training Documents card. Switch between the Field Identification and Table Identification tabs, depending on the type of model you want to train.
Note that the status of your documents will be Ready to annotate. Learn more about statuses in Training Data Management.
Step 4- Analyze your data
Training Data Analysis allows you to group your training documents and receive recommendations to improve the quality of your dataset. For more information, see Training Data Analysis and Guided Data Labeling.
Analyze your data
Receive insights for improving your training data by clicking the Analyze Data button, located in the Training Data Health card.
Do NOT edit or upload documents while the analysis is taking place, as they’ll be excluded from the analysis.
The results will appear in the Training Data Health card. Learn more in Training Data Analysis and Guided Data Labeling.
Analysis results
The results show you the eligibility and importance of each document. Learn more in Training Data Management Features.
Groups - Training data analysis groups your training set by visual similarity. For best data representation, we recommend having at least 10 groups of each document type.
Having a group with Excess Documents (i.e., more than 15 samples for Field ID and 20 samples for Table ID) does not necessarily mean that you need to remove the excess data. Depending on the specific use case and the performance of your model, you may want to enrich the annotations by adding more annotated examples from a particular group. Learn more in Training Data Analysis and Guided Data Labeling.
Importance - The Training Data Curator labels each training document as having high or low importance.
The importance is calculated by determining which data would best contribute to the model’s performance. For each group of documents, the system labels the most impactful ones as having high importance. The goal is to improve the efficiency of the annotation process by requesting an optimal subset that reflects the variety of documents whose data you expect to identify with the model. Learn more about how data is curated in Training Data Curator.
Eligibility - with Document Eligibility Filtering, you can see which documents are incompatible with training and why, allowing you to address any issues accordingly and achieve better model performance. Learn more in Document Eligibility Filtering.
Detect anomalies - Re-analyze your data and find inconsistencies across your annotations with Labeling Anomaly Detection. For more information see our Labeling Anomaly Detection article.
You should reanalyze your data each time you need to review the updated results for your documents, as the system doesn’t re-run the analysis automatically.
Step 5 - Annotate your documents
Consistent annotations are crucial for a high-performance locator model.
Learn how to annotate fields in Field Identification and how to annotate tables in Table Identification.
Best practices
Learn how to annotate your documents easier and faster, by following the best practices listed below:
General guidelines
Once you analyze the data, you’ll be able to annotate by group. Doing so provides you with more control over the dataset. Annotating by group and by priority helps you determine which groups have more documents and which groups are underrepresented.
After annotating 2-3 documents per group, you’ll be able to use guided data labeling. This feature gives suggestions provided by the machine that will help you to annotate more quickly.
Follow the general rule for annotating: left to right, top to bottom.
Make sure to maintain consistent annotations for your fields or columns. When a single value of a field or a column appears in different sections of the document, annotate it strictly in one location to avoid confusing the model.
Always use the machine predictions when drawing the bounding box. Avoid drawing it manually.
Adjust the machine predictions ONLY if the bounding boxes are overlapping and preventing the proper extraction of the data.
Do NOT interchange fields or columns, as doing so may lead to uncertainty for the model.
If a field or a cell is not present, do not replace it with a similar value.
If you don’t see a box made of dashed lines around a value, do NOT annotate it. If there is no such box, it means that our internal ML models are not reading any values for that field or cell.
The annotations serve as Ground truth labels that guide the model through the training process. Aligning the annotations with the machine’s predictions will ensure that the model learns from accurate and consistent information. Inconsistencies, such as annotating the same information in different locations within a document, can affect the model’s ability to learn patterns accurately, which may result in lower performance or incorrect predictions.
Field Identification
Annotate fields with Multiple Occurrences only when multiple instances of a field are present. Learn more about Multiple Occurrences in Field Identification.
Use multiple bounding boxes when a text is logically connected. Learn more in the Multiple bounding boxes for fields section of Field Identification.
If you don’t see a value for a field (i.e., the field is blank), do NOT annotate it.
Table Identification
When annotating a table, make sure to select a row where all data is present. The row you select is your template row, or the row in your table that is most representative of the table’s content.
The template row doesn't need to be the first row in the table. Hyperscience uses the copycat tool to populate the annotation from the template row to the rest of the rows. The copycat is not always accurate, so make sure to double-check the annotations before you submit.
Always find the first and the last rows of your table and make sure they are properly annotated.
Always press the ESC button before submitting a table to ensure the annotations are correct.
Draw one large bounding box capturing all rows of your table, and press the S button on your keyboard. That way, you’ll activate the Split tool, and you’ll be able to define or correct the rows of your table faster. Make sure to double-check the annotations.
Learn more tips and tricks on how to annotate tables in the Table ID Supervision tab of Table Identification.
Once you’re ready with your annotations, re-analyze the data, and use Anomaly Detection to ensure that your annotations are correct and consistent. Learn more in Detecting and Correcting Anomalies in Field Annotations and Detecting and Correcting Anomalies in Table Annotations. You can reanalyze your data after each iteration to maximize the quality of the training set.
Next steps
Check if all training documents are eligible for training.
The number next to Eligible for training on the model details page is the number of documents that will be used in your training set. This number may change as documents are annotated, and each time you analyze your training data.
Ensure you have the required number of training documents.
The number of Required documents shown on the model details page is the number of additional documents you need to upload to run a model training
Step 6 - Review your flow’s settings and train your model
Once you’ve reviewed your annotations and addressed any potential anomalies, you’re ready to initiate model training.
Before you run a model training:
Review the flow’s configurations:
Your system might consist of several workflows, called flows. Each flow contains blocks, representing important stages of the data-extraction process. Learn more in Flows Overview.
For more precise control over the process, you can configure your flow’s settings.
Set your Target Accuracy to achieve better performance.
The system uses QA data and the Field Identification Target Accuracy or Table Identification Target Accuracy values to calculate the optimal confidence threshold that will allow the system to reach the target accuracy with the minimum amount of manual effort. We recommend using the default values (95% for Field ID and 96% for Table ID) for the initial training to compare the results with the next iterations and adjust accordingly later. Change the target accuracy as follows after the first iteration:
If you want to achieve high automation, set a lower percentage.
If you need high accuracy, set a higher value.
Run Training
Initiate a model training by clicking the Run Training button.
The button will be grayed out if you don’t have the minimum required documents.
Notifications section (), which is located in the upper-right corner of the application, once the training is completed.
A single trainer attached to your instance will train one model at a time. For example, if you run a model training for Field ID, and then start a model training for Table ID, the one that you’ve started first will be running, and the second one will be queued. To learn more, see our What is the Trainer? article.
Monitor the training jobs in the Running and Queued cards on the Trainer page (Administration > Trainer).
Step 7 - Evaluate the training results
Deploy your model
Once the model training is complete, you’ll find the candidate model in the model details page. To deploy it, click on your candidate model, then click Deploy Model.
The model is now live and ready for document processing. You can see insights on the automation and accuracy on the model details page. Learn more in Evaluating Model Training Results.
Evaluate the performance
Use the documents you’ve chosen for testing purposes to evaluate the performance of your model. Note that, to measure the performance accurately, these documents should not be ones that were used for the training.
Before you start, make sure to set your flow’s configurations to match the ones you expect in production.
Enable Manual Identification Supervision if you have fields that you want to review manually. Doing so will generate Manual Identification tasks which should be performed by a keyer.
If required, enable any combination of Field Identification Quality Assurance, Table Identification Quality Assurance, and Transcription Quality Assurance, and set the QA sample rate for each type of quality assurance you enable (Field Identification QA Sample Rate, Table Identification QA Sample Rate, or Transcription QA Sample Rate).
The QA Sample Rate values represent the percentage of documents selected for Field ID, Table ID, or Transcription QA tasks. Learn more in the Quality Assurance Tasks section of this user guide.
Learn more about flow-level configurations in Flow Settings.
Upload your testing documents
Upload your documents as submissions by following the steps below.
Go to Submissions.
Click Create Submission.
Upload the testing documents. If you’re uploading multiple documents at once, select One Submission per file to evaluate the performance for each individual document.
Click Next.
Choose the flow you’re using for the model from the Flow drop-down list.
Choose the layout used for the model from the Layout drop-down list.
Click Upload.
Results
Observe the results based on your flow settings in the Document Output page. Learn more in Document Output Page.
If the model is performing poorly, we suggest going over the training documents to check for potential annotation errors and inconsistencies and fixing them. You can use Anomaly Detection for more accurate results. Based on the results, you can also decide to enrich the training set by adding more documents.
If the model is performing well and the projected automation meets the target automation, we do not recommend retraining the model unless
changes are made to the layout, or
the data distribution of incoming documents has changed (e.g., new variations). Learn more in Retraining Existing Models.