Auto-splitting allows you to define specific rules for splitting submissions into Semi-structured documents. This feature enables different rules for each layout, determining how a sequence of pages from the same layout should be grouped into documents. Previously, splitting rules were applied at the flow level, affecting all layouts in that flow. Now, layout-specific rules are configured in the Layout Management Page, where each layout can have a single, dedicated splitting rule. If no specific rule is set, the system defaults to the existing flow-level settings. In this article, you’ll learn how to use Auto-splitting and how to define your own rules for splitting pages in semi-structured layouts.
Flow-level Splitting logic
Auto-splitting applies predefined rules at the layout level, allowing you to customize how documents should be split during processing. This approach provides more flexibility than the traditional flow-level semi-structured grouping logic, which has the following options:
Consecutive pages as a document - All consecutive pages classified at the same layout are grouped as a single document.
Consecutive pages as separate documents - Each consecutive page is treated as a separate document.
Manual review of consecutive pages - Requires manual validation of document boundaries.
Learn more about grouping logic in our Semi-structured Grouping Logic article.
Since these settings apply to all semi-structured documents within a flow, the introduction of layout-level auto-splitting allows you to define specific splitting rules for each document type individually.
Layout-level Splitting Logic Configuration
Auto-splitting rules can be defined on the Layout Management page under the Grouping Logic tab and are divided into two main categories:
Use Flow Settings - Select this option if you want to use the flow-level semi-structured grouping logic.
By default, Auto-Splitting uses the flow-level grouping settings to ensure backward compatibility. If no specific Auto-Splitting rule is configured at the layout level, the system will apply the existing flow settings.
Basic Split (from Flows Settings)
These options replicate the behavior of traditional flow-level grouping logic but can now be configured by layout:
Manual Review - The pages are sent for manual review in Manual Classification Supervision. Learn more in Document Classification.
Only multiple consecutive pages from the same layout will be sent to manual review. If a document contains only one page, the system will automatically create a document without sending it for manual review.
Consecutive Pages as Document - This setting groups all consecutive pages classified to the same layout into a single document. However, if pages of the same document are not consecutive or have other layout pages in between, they will not be merged into one document.
Auto-Split (Rule-based splitting)
The rule-based splitting enables more advanced document grouping:
Number Of Pages
# Number of Pages splits documents based on a defined page count. If a document contains exactly that number of pages, it will be split accordingly. However, if the total number of pages isn't evenly divisible by the defined number, the remaining pages will be grouped into a separate document. This rule is useful when dealing with documents that always follow a fixed page structure, such as invoices, forms, or contracts.
For example:
If you set the rule to 3 pages per document and submit 9 pages, the system creates three documents with 3 pages each.
If you submit 8 pages, the system creates two documents with 3 pages each and one document with 2 pages.
To handle cases where the number of pages doesn’t match the expected rule, you can enable Manual Review to check and adjust any inconsistencies.
Regular Expression (Regex) Matching
A Regular Expression (Regex) is a sequence of characters that defines a search pattern, commonly used to match and extract specific text patterns within documents. It allows you to automate how documents are split by identifying consistent text markers, such as titles or page numbers, across pages.
Regex in Hyperscience
Regex in Hyperscience is Python-compatible for flexible and precise document processing.
Understanding Regex Matching
When a document is being processed, Segmentation divides the text into chunks and arranges them into rows from left to right. To learn more, see our Segmentation article. The regex is then applied at the row level, meaning:
The regex must match text on a single line (multi-line regex is not supported).
The system scans from left to right, so right-to-left languages (e.g., Arabic) are not supported.
Splitting Page Types
First Page Regex: Splits documents based on a unique text pattern that appears only on the first page of each document. This is useful for cases where the first page contains specific identifiers like 'Page 1' or a document title.
Last Page Regex: Splits documents based on a text pattern that appears only on the last page of each document, such as 'Total' or 'End of Report'. This ensures the system correctly detects the end of each document.
Same/All Page Regex: Captures a unique identifier that appears on each page of a document and changes between different documents. A split occurs when the identifier changes.
Example Regex Patterns
First Page:
^Invoice Number: \d{6}$
Last Page:
^Total Amount: $d+\.\d{2}$
Consistent Field Across Pages:
^Page d+ of d+$
For more details on Regex syntax, refer to the Python Regex Documentation.
Manual Review for Rule Failures
If a document does not meet the criteria defined in the splitting rules, users can enable manual review to inspect and correct the split before finalizing document classification. This is useful in cases where:
The expected number of pages is not matched. For example, the system detected an 8-page document, but the auto-splitting logic expected documents to be only 3 pages long.
The regex pattern is not found.
A document is partially misclassified (e.g., mixed invoice and check pages)
Navigating the Grouping Logic Configurations page
To access the Grouping Logic configurations:
Navigate to Library > Layouts
Open the layout you want to configure by clicking on its name
On the Configuration card, click the Edit button (
)under Grouping Logic
On the Grouping Logic page select the grouping logic configuration you want for your layout
If you’re using Regex, choose the splitting page type from the drop-down menu
Define a Regex rule following the syntax in Python Regex Documentation
Click Save to apply your changes.
Compatibility and Known Limitations
Auto-splitting is available for flows created in v40.2 and later. If your flow is on a version lower than v40.2, auto-splitting rules will not apply, and documents will not be split.
Users upgrading from previous versions will retain existing flow-level configurations until they switch to layout-based splitting.
Regex matching works only for single-line text and scans left to right, meaning right-to-left languages are not supported.
For example, if a person’s first and last name appear on separate lines on a page, Regex won’t recognize the full name because it only scans one line at a time.
Right-to-left languages like Arabic have a different text flow, which is not supported by our current Regex implementation.
Blank or sparse separator pages are not currently supported.
If a single layout is used across multiple flows with different splitting rules, users must duplicate the layout to configure different rules.