Creating Semi-structured Layouts

After you add a Semi-structured layout to Hyperscience, you will use the Layout Editor to define what information the system should extract from the processed pages that match the specific layout.

Unlike fields in Structured documents, fields in Semi-structured documents do not require bounding boxes because their location on the page may vary.

However, metadata information is required for fields defined in Semi-structured layouts. Similar to Structured fields, metadata includes the names used to label the information to be extracted from the fields, as well as any specific settings to help the system read and process the field value.

Create a Semi-structured layout

  1. Go to Library > Layouts.

  2. Click Add Layout.

  3. Click Semi-structured Layout, and then click Next.

  4. In the Layout Name text box, enter a name for your layout.

  5. Under Language, click on the drop-down list, and click on the language you expect people to use when entering information in this layout’s documents.

    • If you know that specific fields or table columns will contain text in different languages than the document’s main language, you can select languages specific to those fields or columns in the Layout Editor. In the Layout Editor, selecting the Not in option allows you to select a language for each of those fields or columns. You can choose from any of the languages we support. You should only select this option if you know the text in a field or column will be in a different language than the language assigned to the layout.

  6. Click Create.

By following these steps, you will create a layout with a single variation when you follow the steps above. You cannot add variations to the layout, as variations can only be added to Structured layouts.

To make edits to your layout, you’ll need to go to the Layout Editor, as described in the next section.

Navigating the Layout Editor

You can specify the information that should be extracted from submitted pages that match the layout by defining the field metadata in the Layout Editor.

Access the Layout Editor

To access the Layout Editor:

  1. Go to Library > Layouts, and click on the name of the layout you want to edit.

  2. Click Layout Variations, and then, click on the name of the variation.

When in the Layout Editor, you’ll be making changes to the working version of the layout. To commit your changes—or to save without committing your changes—follow the steps in Saving and exiting below.

Defining field metadata

DefiningFieldMetadata.pngAdd a new field by clicking the Add Field button at the bottom of the "Fields" card. Then, fill out the following metadata for each field:

  • Name - labels the field throughout the product and is intended to be human-readable. This name is also provided in the output for submitted pages matched to the layout.

  • Data Type - designates the type of data that is expected for the given field. Data types specify the kinds of characters expected and any formatting that should be expected.  

    • Note that for Semi-structured documents, you now define signature fields with the "Signature" data type. Signatures are compatible with all Semi-structured models. Signature fields can be trained and are eligible for Field ID automation. To learn more, see Checkboxes and Signatures.

  • Output Name - allows users to provide a programmatic name for each field, in addition to the human-readable display name. This name is included in the output for submitted pages matched to the layout.

  • Transcription Supervision - allows you to specify the field’s transcription handling. The possible values and their meanings are as follows:

    • Autotranscribe - a field with this setting will not generate manual transcription tasks and will instead output the machine’s transcription value. This setting is appropriate for scenarios where the accuracy of a given field’s data is not sufficiently critical to warrant human review, but the machine’s best guess would be helpful to output.

      • If the machine’s confidence in its transcription is below the set threshold, an illegible field exception will be output instead of the machine’s transcription value.

    • Default - if toggled on, the field will undergo machine transcription. If the machine’s confidence level on its transcription is above the set threshold, the machine’s transcription value will be output. If the confidence level is below the threshold, the field will be sent to Supervision as a manual transcription task.

      • On a new field, the Transcription Supervision setting will initially be set to Default, and can subsequently be changed by the user.

    • Always - when the Transcription Supervision setting is set to Always, the system will always send the field to Supervision as a manual transcription task, regardless of the machine’s confidence in its transcription. This is used to guarantee a manual review of the field.

    • Consensus - when the Transcription Supervision setting is set to Consensus, the system does not record a value for the field until it receives the same post-normalization transcription value twice. This means that at least one manual transcription of the field will be required, regardless of the machine’s confidence in its transcription. This is used to indicate when accurate transcription of a given field is particularly important.

  • Identification Supervision – allows you to specify the field’s identification handling. The possible values are:

    • Auto-identify - a field with this setting will not generate Field ID tasks and instead will be identified automatically by the machine. This setting is appropriate for scenarios where the accuracy of a given field’s location is not sufficiently critical to warrant human review, but the machine’s best guess would be helpful.

    • Default - a field with this setting will undergo Field ID Supervision only if the machine’s confidence falls below the thresholds specified in the flow's settings.

      • On a new field, the Identification Supervision setting will initially be set to Default and can subsequently be changed by the user.

    • Always - a field with this setting will always generate Field ID tasks, regardless of the machine’s confidence in identifying this field. This setting is appropriate for scenarios where you want to guarantee a manual review of a field.

  • Notes (optional) - setting allows notes to be added to a field’s definition.

Adding new fields to a live layout causes a warning message to appear on the layout’s details page. This warning message prompts you to go to the Release Library (Library > Releases) and create a new release with your updated layout variation.

mceclip0.pngIf you add new fields to a layout that has a live Field Locator model, a warning message informs you that the ground-truth data does not include the newly-added fields.

mceclip1.png

Additional settings

There are additional settings and features you can enable to improve machine performance, namely the following:

Multiline - should be enabled for any field where more than one line of text is expected and will improve the machine’s processing on these fields.

Required - should be marked when you need to know whether or not a field exists on the document. The system will apply special logic to the processing of submitted pages matched to that layout so that an exception will be generated stating that the required field was missing for the following cases:

  • if a field marked as required is not found on the page

  • if the transcription of a required field is determined to be blank

  • if the field is marked illegible

For each field, a field name and a data type must be defined. If either is missing, an error message will be shown, and the field will be highlighted in red in the Layout Editor.

Not in - if you know this field will contain text in a language that differs from the one selected during the layout-creation process, select the Not in option, and then select the language from the Language drop-down list that appears. For example, if you selected Korean as your layout’s language but know that a certain field will contain English text, you would select the Not in Korean option, and then you would select English from the Language drop-down list.

  • You can choose from any of our supported languages, regardless of their language family. To learn more about language families, see Supported Languages.

  • You can select only one language for each field.

  • If the field’s language may differ across documents, do not select the Not in option.

Defining table column metadata

Add a new table by clicking the Add Table button at the bottom of the "Tables" card. Clicking the Add Table button automatically creates a new table.

If you want to define a nested table, you also need to add a child table by clicking the Add Child Table button. Nested tables allow you to extract data from tables with nested, complicated structures where child row data points inherit data points from parent rows.

To add a new column, click the Add Column button and enter the following metadata:

Fill out the following metadata for each table column:

  • Name - labels the table column throughout the product and is intended to be human-readable. This name is also provided in the output for submitted pages matched to the layout.

  • Data type - designates the type of data that is expected for the given column. Data types specify the kinds of characters expected and any formatting that should be expected.

  • Output Name - allows users to provide a programmatic name for each column, in addition to the human-readable display name. This name is included in the output for submitted pages matched to the layout.

  • Notes (optional) - setting allows notes to be added to a column's definition. These notes will be shown to Supervision data clerks working on Table Identification tasks to assist them in locating fields on submitted pages matched to Semi-structured layouts. For example, a note may provide possible labels for a given field or hint to a field that is often shown in a certain area of the page.

For each table column, a name and a data type must be defined. If either is missing, an error message will be shown, and the column will be highlighted in red in the Layout Editor.

Adding new table columns to a live layout causes a warning message to appear on the layout’s details page. This warning message prompts you to go to the Release Library (Library > Releases) and create a new release with your updated layout variation.

mceclip2.png

If you add new table columns to a layout that has a live Table Locator model, a warning message informs you that the ground-truth data does not include the newly-added columns.

mceclip3.png

Additional settings

There are additional settings and features you can enable to improve machine performance, namely the following:

  • Multiline - should be enabled for any column where more than one line of text is expected and will improve the machine’s processing on these columns.

  • Not in - if you know this column will contain text in a language that differs from the one selected during the layout-creation process, select the Not in option, and then select the language from the Language drop-down list that appears. For example, if you selected Korean as your layout’s language but know that a certain column will contain English text, you would select the Not in Korean option, and then you would select English from the Language drop-down list.

    • You can choose from any of our supported languages, regardless of their language family. To learn more about language families, see Supported Languages.

    • You can select only one language for each column.

    • If the field’s language may differ across documents, do not select the Not in option.

Note that table cells do not support Required, Transcription Consensus, or Dropout settings.

Reordering and deleting fields

The order of fields in the layout corresponds to the order in which fields will be shown to users during Field ID and Table ID. As such, it can be useful to re-order fields such that a particular order is used when generating Supervision tasks. Fields can be re-ordered in the Layout Editor by clicking and dragging the handle tool on the right side of each row in the list.

Note that you can also delete a field by clicking the trash can icon found on the right side of each row.

Saving and exiting

All working versions auto-save, so there is no need to manually save changes that have been made. If you’re finished making changes to your layout and want to commit your changes, you can do so before leaving the Layout Editor.

  • To commit your changes to the working version of the layout variation, click Commit Changes at the top of the page.

  • To leave the Layout Editor without committing your changes, click the X in the upper-right corner of the page.

    • You will still be able to commit your changes later by clicking Commit Changes on the layout’s details page.

  • To match submitted pages to a layout, the layout must have at least one committed version, and the layout must be deployed in a live release. For more information about layout versions and creating releases, see the articles What is a Layout Version?, Editing and Finalizing a Layout Version, and What is a Release?.