Training Data Analysis and Guided Data Labeling

The Training Data Analysis feature allows you to reduce the number of annotation errors and make the process of training Field Locator and Table Locator models faster and less complex. 

The Guided Data Labeling feature provides an enhanced experience for gathering and annotating training data for Field ID and Table ID, using Training Data Management tools. 

You can upload and annotate training documents directly in TDM with additional guidance on creating a representative dataset and automated annotation suggestions to speed up the annotation process. These features become available after annotating as few as 2-3 similar documents.

Prerequisites

Before you start going through the step-by-step instructions below, see Model Management to learn more about the redesigned Model Details page.

Step 1: Upload Training Documents

To upload training documents for a given Semi-structured layout, follow the steps below:

  1. Go to Library > Models.

  2. Click on a model name from the table to view its Model Details page.

  3. Click the Upload Training Documents button.

  4. Upload a set of documents for annotation.

  5. Click the Upload button.

mceclip3.png

To learn more about the different options you have when uploading training documents, see Importing and Exporting Training Data.

Step 2: Run Training Data Analysis

Once the upload process from Step 1 finishes, all uploaded documents will appear on the Training Documents card. These documents will have Ready to Annotate training statuses.

To let the system automatically group the training data and provide recommendations to add/remove documents for improved dataset diversity and representation, you can run training data analysis.

  1. Go to Library > Models.

  2. Click on a model name from the table to view its Model Details page.

  3. In the Training Data Analysis card, click the Analyze Data button.

Once the training data analysis finishes, you will see a list of groups with training documents. Groups perform best with 15 example documents for Field ID and 20 example documents for Table ID. 

  • If a group has more than 15 training documents for Field ID or more than 20 training documents for Table ID, the system recommends removing some documents.

  • If a group has less than 15 training documents for Field ID or less than 20 training documents for Table ID, the system recommends uploading more documents. 

mceclip2.png

If you upload additional training documents, you can reanalyze all training data by clicking the Reanalyze Data button.

Step 3: Annotate Training Documents with Guidance

Once the training data analysis has been completed and has created a representative dataset based on the clustering recommendations, you can start filtering training documents by groups for easier annotation. To filter training documents by group, follow these steps:

  1. Click the Filter button above the Training Documents card.

  2. Select a group from the Groups drop-down list.

Once you’ve selected a group, you can start annotating the group’s training documents by clicking the Annotate link for each document in the table. 

GuidedDataLabelingAnnotateLinks.png

Once you annotate 2-3 documents from a group, the system starts using the guided data labeling feature to generate annotation suggestions for this group’s documents. To speed up the annotation process, these suggestions provide you with predictions about where a given field or a table column might be located. 

The guided data labeling feature also generates annotation suggestions for fields with multiple bounding boxes.

GuidedDataLabelingReviewFieldAnnotations.png

Annotation suggestions are disabled by default. To enable these suggestions:

  • select the Display Suggestions option while manually identifying fields with the Training Data Management tools, or

  • click the Display available suggestions button while manually identifying table cells with the Training Data Management tools.

We recommend using the Guided Data Labeling feature with a single keyer at a time. With the increased resources used by multiple keyers working at the same time, you can expect slower or missing suggestions.

GuidedDataLabelingDisplaySuggestions.png