Text Classification (Preview)

With Hyperscience’s Text Classification feature, you can train a model to classify freeform text in documents, emails, and more. This feature allows you to analyze and organize unstructured text by user intent, sentiment, topic, or any custom labels based on your own business rules.

For example, you can use the Text Classification feature to:

  • Classify the medical disorders listed in a patient intake form.

  • Determine the sentiment expressed in a customer comment and apply a label to it (e.g., "positive," "negative").

  • Categorize and prioritize emails by customer intent (e.g., inquiries, complaints, requests to change account information).

With Text Classification, Hyperscience provides you with a block that you can use to classify the text extracted from a specific field, document, or email. The preview version is designed to demonstrate the core functionalities of Text Classification. We will work closely with preview customers and partners to incorporate additional functionalities and changes in the GA (general availability) version. 

In this article, we outline the steps you need to take to classify text. As an example, we’ll focus on classifying text extracted from a specific field.

1.  Obtain the text classification flow from Hyperscience.

If you are interested in using the preview version of Text Classification, reach out to your Hyperscience representative to discuss your intended use of the feature, or send an email to [email protected]. If the feature in its current state can meet your needs, your representative will share a file containing the code for the text classification flow.

2.  Create a CSV file with training data.

To train the text classification model, you need to provide a CSV file that contains your training data. The text classification model will use this training data to learn how to classify text samples from the training set.

The CSV file needs to have two columns:

  • text, which contains the text samples you want to train with, and

  • label, which contains a single label for each text sample. 

When adding content to your CSV file, keep the following guidelines in mind:

  • Each text sample  / label pair should appear only once in the file.

  • Each text sample should only have one label. Do not include the same text sample multiple times with different labels.

  • Labels need to be consistent across samples. For instance, “positive” and “positive sentiment” will be considered two distinct labels.

  • There should be a minimum of 100 text samples per label. We recommend that you provide more samples than the minimum if you want better performance from the model.

  • If you have more than 50,000 text samples in your CSV file, reach out to your Hyperscience representative for additional guidance before training the model.

  • As is the case for other Hyperscience machine learning models, the quality of the training data dictates the performance of the model. Ground truth errors should be avoided as much as possible.

3.  Create a release for text classification and obtain its UUID.

  1. Choose a layout variation whose extracted text you would like to classify. You can choose any Structured or Semi-structured layout variation.

  2. Find the layout variation in the Layout Library (Library > Layouts), and note the name of the field whose extracted text you would like to classify.

    • You can classify text from multiple fields, but the same model and training data will be used to classify all of those fields.

  3. Create a release that contains the layout variation.

    • The release can contain other layout variations, as well, and the model will classify any fields with names you chose in step b.

    • To learn how to create a release, see Adding a New Release.

  4. Export the release.

  5. Open the release's ZIP file, and in its release subfolder, find the release's metadata file. Its name is in the format _.json.

  6. Open the file, copy the value for uuid, and paste it in a text file.

4.  Edit the flow provided by Hyperscience to include the field name and release UUID.

  1. Open the file for the flow that your Hyperscience representative provided for text classification  (text-classification-flow.py).

  2. Search the file for idp_config[Settings.LayoutReleaseUuid], and enter the release's UUID as its value. Put a single quotation mark ( ' ) before and after the UUID.

    IDP_CONFIG_releaseUUID.png

  3. In the same file, search for extract_text_from_output, and find its fields_to_extract property. In its array, enter the name of the field whose extracted text you want to classify. Put a single quotation mark ( ' ) before and after the field name.

    extract_text_from_output.png

  4. In the same file, search for extract_for_custom_supervision, and find its fields property. In its array, enter the name of the field whose extracted text you want to classify. Put a single quotation mark ( ' ) before and after the field name.

    extract_for_custom_supervision_block.png

  5. Save your changes, then generate a JSON for the flow.

5.  Import the edited flow and obtain its UUID.

  1. In the application, import the JSON file for the flow. 

    • For more information on importing flows, see "Import a new flow to your instance" in Managing Flows.

      Flow Studio opens to show the blocks in your imported flow.

  2. At the bottom of the left-hand sidebar, find Flow UUID, and copy its value into a text file.

6.  Train the flow's text classification model and obtain the model's UUID.

  1. Go to <instance URL>/admin/form_extraction/textclassificationtrainertask/import-file/.

  2. In Choose a Flow, select the UUID for the text classification flow.

    ModelTrainingChooseAFlow.png

  3. In Upload file, click Choose file, find the training CSV file you created in step 2, and click Upload File.

    The training process begins automatically. Depending on how much training data you included in the CSV file, training can take anywhere from 1 hour to 24 hours.

    After the training is completed, a TRAINER_TASK_COMPLETED notification appears in your Notification Center.

  4. Open the Notification Center, and in the TRAINER_TASK_COMPLETED notification, find the model_id value for the task, and copy it into a text file.

7.  Enter the model's UUID in the flow's Text Classification block.

  1. On the Flows page, find the name of the text classification flow, and click on its name.

  2. Scroll to the right to find the flow's Text Classification block, and click on it.

  3. In the Text Classification Identifier field, enter the model's UUID, and click Save.

8.  Process submissions with the text classification flow and view their labels.

You can now use your text classification flow to process submissions. To learn how to create submissions with a specific flow, see How a File Becomes a Submission.

For each submission, the application uses the model to apply a label to the field you specified in the flow's code. If it cannot apply a label with high confidence, the system creates a Custom Supervision task so a keyer can apply the correct label to the field.

TextClassificationCustomSupervision.png

To view the label applied to a field:

  1. On the Submissions page, find a submission that was processed through your text classification flow, and click on its Submission ID.

  2. Click Actions, and then click View Transformed Output.

  3. Search for the prediction element of the output, and find the label applied to each classified field. In the example below, multiple fields with the same name were classified in the submission.

    TextClassificationOutputLabels.png