Training Data Management Features

Training Data Management (formerly Keyer Data Management) includes tools for controlling and managing Identification model performance.

The performance of your models depends on the quality of the pages, the diversity of the documents, and consistent annotations.

For more information on model-training results, see Evaluating Model Training Results.

Training Data Management allows users to improve and supervise models by working directly with the training data (“ground truth”) obtained from each document in the training set. Users can group their documents, see incompatible ones, annotate representative parts of them, and detect potential inconsistencies. 

Available features

  • TDM for Classification - in v39 and above, users can operate with the ground truth of Classification models from Training data management. Learn more in Training Data Management for Classification.

  • Document Eligibility Filtering — indicates whether a document is eligible for training, based on internal checks in the application and our machine learning logic. It provides additional information about documents that were excluded from the training set. 

  • Training Data Curator — labels each training document as having high or low importance. The importance is calculated by determining which data would best contribute to the model’s performance. 

  • Labeling Anomaly Detection for Fields and Tables — identifies potential discrepancies in the training datasets before running model training. Once the annotations are ready, the user can analyze the data to find inconsistencies and ensure a top-performance locator model. 

Using Training Data Management features 

The recommended steps for using the Training Data Management features together are given below.

  1. Go to Library > Models, and then click on the name of the model you want to manage training data for.

  2. Upload your documents by following the instructions in Importing and Exporting Training Data.

    Loading the documents may take several minutes, depending on the number of uploaded pages.

  3. Analyze the training documents by clicking Analyze Data.

    The analysis groups similar documents and suggests improvements for your training data.

    For more information, see Training Data Analysis and Guided Data Labeling.

     

After the data analysis is ready, more information regarding your training set appears on the Field Identification or Table Identification model card:

  • Required to train - number of additional documents needed to run a model training

  • Eligible for training - number of documents used in your training set.

This number will change as documents are annotated and after analyzing the data.

You can see the ineligibility details by clicking See Ineligibility details >>. When you do, a sidebar with information about the documents will be displayed. Learn more about document eligibility of documents in Document Eligibility Filtering.

All documents with the Training Status Ready To Annotate or Never will appear as ineligible for training until you change their Training Status and reanalyze the data.

The Training Data Health card shows your training set’s health, including:

  • a bar indicating how many documents you need to meet the minimum required for training, 

  • the number of required, ineligible/eligible, and recommended documents, and

  • the document grouping and recommended number of training documents. 

For each document, the Training Data table displays the following:

  • A yellow indicator if the document is ineligible for training.

  • The ID of the document’s group.

  • The importance of the document. The importance can be high or low, depending on the variety of documents in your training set. To learn more about the importance, see Training Data Curator.

  1. Annotate your documents and re-analyze your data to refresh the results. 

Reanalyze your data each time you need to update the information about your training set’s health, as the system doesn’t re-run the analysis automatically.

  1. Run the model training after you improve the health of your training data by using Document Eligibility Filtering, Training Data Curation, and Labeling Anomaly Detection.

For details on how to train a high-performance model and evaluate your results. see Requirements for Training a New Model and Evaluating Model Training Results.