Table Identification

Overview

A Table is a data structure used to organize and present information in rows and columns. It is used to present values in a readable format.

A table consists of the following elements:

  • Rows

  • Columns

  • Cells

Hyperscience provides a solution for extracting the data from tables by using Table Identification models in Semi-Structured layouts.

Table Extraction is available only for Semi-Structured Layouts.

Learn more in Creating Semi-Structured Layouts.

We support extraction for the following tables:

  • Regular Tables - Tables with a simple structure. They contain a single table with columns.

  • Nested Tables - Learn more in What is a Nested Table?

  • Multiple Tables - Hyperscience supports more than one table in a document in v38. These tables may be regular or nested tables. We do not support side-by-side tables.

In this article, you will learn how to annotate and train a Table ID model.

Table Identification Task

Table ID tasks are specific to Semi-structured documents with table columns. To see these tasks, you must define table columns on a Semi-structured layout.

Prerequisites

Follow the steps below to define a table and start the annotation process:

Extract your tables by using the Table Identification task. It is available for Supervision, QA, and Training Data Management.

Annotating Tables

The Table Identification task is available in Training Data Management and it’s similar to annotating in Supervision.

Follow the steps described in the Table ID Supervision tab to annotate your documents.

Navigating the Table ID task in TDM

  • All tables available for annotation will be displayed on the right-hand side of your screen.

  • You can select which table you want to annotate from the right-hand side of your screen.

right-hand sidebar tables .gif

  • You can also see the type of table you’re currently annotating, as well as the number of columns available.

  • If you add a new column to your layout, you'll have the New Column label displayed next to its name

Be sure to click the Continue to Review (CMD+ENTER) button for each available table.

Follow the steps below to annotate your table in Supervision:

1. Select a row from the table to be a Template row

A Template Row is the lead row in your table. It is not necessary to be the first one. Hyperscience uses the copycat tool to populate the annotation to the rest of the rows in Step II. The copycat is not always accurate, so make sure to double-check the annotations.

In this step, you will define the template row of your table.

  1. Select a row from the table to be a template row.

  • We recommend selecting a row where all data is present.

  • Use rows with longer values.

  • Annotate a row with multi-line values.

template_row.gif


  • The right-hand side of the screen will display the number of tables, you have to annotate. You can see the name and the type of table you are currently annotating.

  • Select a column from the right-hand side of your screen or use the W and E buttons to navigate between them.

2. Make sure to capture the cell and follow the tips below if necessary

If...

Then...

The bounding box includes all of the cell's content.

Move on to the next step.

The box is in the right place but doesn't include all of the cell's content (e.g., parts of letters fall outside of the box).

Click and drag the box's corners until it contains all of the content that should be transcribed.

TemplateToolAdjustBox.gif

Neighboring text segments should also be included in the cell's transcription.

With a click-and-drag motion, draw a bounding box that includes all of the cell's content.

TemplateToolCombineFields.gif

The box doesn’t include any of the cell’s content OR no bounding box appears around the cell’s content when hovering over it.

Press the spacebar, and with a click-and-drag motion, draw a bounding box that includes all of the cell's content.

3. Review your annotations

  1. De-select all columns by pressing the ESC button to have a better preview of the annotations. The labels will indicate all cells annotated in the respective column with different colors. Hide all labels by pressing CMD+I.

If you have more rows in the table, use the Split button to identify them faster:

  • Extend the boundary of the table to the bottom of your document

  • Press the S button on your keyboard and separate the table into individual rows by clicking on each place where a row boundary should exist.

    regular_split.gif

2. Use the action buttons on the labels to adjust the annotations of your cells:

a. Find Missing Cells:

This button allows you to auto-annotate missing cells. The target button(​​mceclip4.png) on the column label can help auto-annotate any cells that may have been left unidentified in a particular column. Use this button in the following cases:

  1. When a user manually created rows that the machine had failed to identify. The user can click the target button (​​ ) for each column to auto-identify missing cells from the newly created rows.

  2. When a user manually deleted all rows, created new rows from scratch, and annotated a single row. To auto-identify missing cells from the rest of the rows, the user can click the target button (mceclip4.png​​) for each column.

    b. Select all column cells on page:

    This button ()will select all cells from the column you’re currently working with.

    Use it when you see unidentified cells to adjust all bounding boxes at once:

select_all_cells.gif

c. Delete all column cells on page:

This button () allows you to delete all column cells on the page.

Other actions

You can use the Scroll freeze button () located at the top of the page if you have more pages in your document. Clicking it improves the performance of the system by rendering the images on each page faster. You can also extend a row to the next page by using the button, located between the pages - Extend row [row’s number] to next page.

Insert a row by clicking the button on the left side of the page or the button in the middle of the page if you don’t have any rows.

Right-click on the row for the following options:

  • Insert 1 above

  • Insert 1 below

  • Delete selected row

  • Delete all rows above

  • Delete all rows below

Click Manually Re-identify Table if you need to start over.

If you have a nested table, follow Steps 1. and 2. described above. Learn more about Nested tables in What is a Nested table?

The Table Identification QA task:

The Table ID QA task is similar to the Supervision task. Note that once all tables reach consensus, the document will also reach consensus. Learn more about consensus in Transcription Supervision Consensus.

Limitations

  • We do not support side-by-side tables.

  • Anomalies and suggestions appear only in the first table.

  • The Table Identification tasks are not ordered in the same way as the tables are ordered in the Layout.

  • You can’t have the same column names in a nested table.

  • Do NOT add tables with the same name

Table ID Models

After completing your annotations, you will be able to train a Table Identification model.

It enables cell-level predictions and automatic table processing. A Table ID model can be trained to automatically identify regular and nested tables.

Table ID models look at the transcribed text to improve table identification. This feature is called Table Detector and supports the following scenarios:

  • If there are multiple similar tables on the page, you can train the model to identify only a specific table with a predefined header’s name.

  • Using the transcribed text from the page, the model can filter out unnecessary rows from a table.

Learn more about Table Transcription in Table Transcription task.

To train and deploy a model, go to the Model Details page. Once you determine a Semi-structured layout where you would like to train a model, there are two ways to get to the Model Details page:

  1. Go to Library > Models, select Identification Models from the drop-down list at the top of the page, and then click on the name of the model.

  2. Go to Layouts, click on the name of the layout, and then click on the name of the Identification Model on the Layout Details page.

You can process multiple tables within a document. Note that they will be trained one after another when initiating the model training.

The model for each table is available in the Table Identification Models card. To initiate a model training follow the steps described in Training a New Table Identification Model.

Learn more about navigating Training Data Management and using its features in Training Data Management Features and Training Data Management.