The following best practices for Field Identification Supervision and Quality Assurance will ensure the highest levels of accuracy and automation for your Semi-structured documents. They will also maximize the quality of the training data that is collected through these tasks.
For more information about Supervision and QA, see Field Identification and Field Identification Quality Assurance.
Providing realistic training data
Provide training documents that are as similar as possible to the documents being processed.
If your submitted documents change gradually over time, automation will begin to decrease. If you see that accuracy decreasing—or if your keyers notice slight changes in your submissions—consider retraining your models with updated training documents.
Selecting values to identify
ID what you see.
If a value is blank on a document, mark the field as “Not Present.” Do not draw a bounding box where the value should have been.
In the following examples, the Tax field should be marked as not present.
ID text rather than logos.
If you are extracting a company's name that is present outside of a logo, draw the bounding box around a text version of it rather than one within the logo.
ID values on white backgrounds.
If the same value is repeated on a white background and on a colored or textured background, draw a bounding box around the version on the white background.
ID standalone values.
When selecting between a repeated value that is by itself or one that has surrounding text, choose the standalone value.
ID consistently, choosing the top-most and left-most value whenever possible.
If a single value is present multiple times on a page or document, it is best to pick one location and ID it consistently across documents. Ideally, that location will be the top-most and left-most value of that field, if that value meets the other criteria described in this article.
In the example below, the team on the left has decided to always select the address from the top location, whereas the team on the right has not come up with clear guidelines. When different workers select different address locations, the quality of their training data is reduced.
…but be more lenient when performing QA tasks.
If the predicted answer is in the correct location, but that location isn't the top-most or left-most location for that field, mark it as “Correct.” Otherwise, the reported accuracy will decrease unnecessarily.
Drawing bounding boxes
Ensure that bounding boxes don't split text within a word.
The machine trains models on the word level, where a “word” is a string of consecutive characters without spaces. If you include portions of words in a bounding box, you will provide the model with training data that decreases its accuracy rather than increases it.
Maximizing provided tools
Use keyboard shortcuts to navigate between fields.
While you can choose to identify any field at any time by clicking on its name in the right-hand sidebar, we recommend using our keyboard shortcuts to move from one field to the next.
For a list of shortcuts, see Field Identification and Field Identification Quality Assurance.
Use our predictions for checkboxes.
When identifying checkbox fields, we recommend using our predicted bounding boxes rather than drawing the bounding boxes by hand. Doing so ensures that the proper amount of padding is added between the checkbox and the bounding box, which increases the accuracy of checkbox transcriptions.
Performing sufficient Supervision and QA tasks
When training models, be aware of “diminishing returns.”
Unless the layouts within a dataset vary greatly, there is negligible benefit in performing Supervision or QA on more than 1000 documents in that dataset. For example, if your dataset contains checks, your keyers do not need to spend time annotating more than 1000 documents.