Text Classification

With Hyperscience’s Text Classification feature, you can train a model to classify freeform text in documents, emails, and more. This feature allows you to analyze and organize unstructured text by user intent, sentiment, topic, or any custom labels based on your own business rules.

For example, you can use the Text Classification feature to:

Classify the medical disorders listed in a patient intake form based on symptoms and severity.
Determine the sentiment expressed in a customer comment and apply a label to it (e.g., "positive," "negative").
Categorize and prioritize emails by customer intent (e.g., inquiries, complaints, requests to change account information).

Text Classification in v37

We’ve made significant changes to Text Classification since it was first introduced in v34. This section gives an overview of how the feature works in v37 and the limitations that still exist.

How it works

If you would like to use Text Classification in v37, Hyperscience provides you with a flow that you can use to classify text from a specific type of document. This flow contains the blocks needed to make Text Classification work:

A Full Page Transcription (Submission) Block that transcribes the content of the submission into a single block of text
A Custom Code Block that formats the transcribed text for processing in the Text Classification Block
A Text Classification Block that classifies text into one or more classes included in the training data. When it classifies a submission, it assigns a class or classes to it, depending on whether the block uses a single-class or multi-class model.
A second Custom Code Block that formats the submission’s JSON output and adds information about the class or classes assigned to it.

You also can request that a Custom Supervision Block be included in your flow, which will allow your keyers to add or edit classifications manually.

As mentioned previously, each submission can have more than one class from a single dataset assigned to it. In earlier versions, multiple Text Classification Blocks were required to assign multiple classes from a single dataset.

Limitations in v37

There is no pre-built Supervision block for Text Classification; you need to add a Custom Supervision Block to your flow in order for your keyers to perform Text Classification Supervision tasks.
A model can have multiple datasets, but only one dataset can be active at a time. As a result, each Text Classification Block can add classes from only one dataset to each document. If you would like to add classes from multiple datasets to a flow’s documents (e.g., one for product mentioned, another for sentiment), you can add a Text Classification Block for each dataset you would like to add classes from.
There are no Text Classification QA tasks in v37. To increase model performance, edit classifications in the model's Training Data table as needed..

Setting up and using Text Classification

1. Obtain the Text Classification flow from Hyperscience.

If you are interested in using Text Classification, reach out to your Hyperscience representative to discuss your intended use of the feature. If the feature meets your needs, your representative will share a file containing the code for a Text Classification flow.

2. Create a CSV or ZIP file with training data, if you haven’t already.

To train the Text Classification model, you need to provide a CSV or ZIP file that contains your training data. The Text Classification model will use this training data to learn how to classify text samples from the training set.

You can download a sample CSV or ZIP to use to format your training data. To do so:

Go to Library > Models, and click on Text Classification Models in the drop-down list at the top of the page.
Click Create New Dataset.
Enter text in the Training Dataset Name text box.
Click on the .CSV or .ZIP tab, depending on how you would like to upload your training data.
Click on Download Sample .csv or Download Sample .ZIP.

CSV file

The CSV file needs to have two columns:

text, which contains the text samples you want to train with, and
classes, which contains one or more labels (classes) for each text sample. If you want to assign multiple classes to a text sample, enter the classes as a semicolon-separated list (e.g., toxic;obscene).

ZIP file

The ZIP file needs to have:

One .txt file for each training document.
A classes.csv file with two columns:
- document, which contains the file names of the training documents in the ZIP file, and
- classes, which contains one or more labels (classes) for each training document. If you want to assign multiple classes to a training document, enter the classes as a semicolon-separated list (e.g., toxic;obscene).

Guidelines

When adding content to your CSV or ZIP file, keep the following guidelines in mind:

Each text sample / class list should appear only once.
Each text sample should only have one class list. Do not include the same text sample multiple times with different class lists.
Classes need to be consistent across samples. For instance, “positive” and “positive sentiment” will be considered two distinct classes.
There should be a minimum of 10 text samples per class. We recommend that you provide more samples than the minimum if you want better performance from the model.
As is the case for other Hyperscience machine learning models, the quality of the training data dictates the performance of the model. Ground truth errors should be avoided as much as possible.

3. Import the training dataset and obtain its source UUID.

Go to Library > Models, and click on Text Classification Models in the drop-down list at the top of the page.
Click Import Dataset.
Do one of the following:
- Drag and drop the CSV or ZIP file containing your training data into the box provided.
- Click Choose File, find the CSV or ZIP file on your machine, and open it.
Click Import.
After you import the dataset, the system creates a Text Classification model. You can view the details page for this model by clicking on the name of the dataset in the Text Classification Models view of the Models page.
- If each text sample (in CSV files) or training document (in ZIP files) in your dataset has only one class assigned to it, the model’s Model Type is Single-class.
- If at least one text sample or training document has multiple classes assigned to it, the model’s Model Type is Multi-class.
Copy the dataset’s Source UUID value in the Text Classification Models view of the Models page. Save this value, as you will need to enter it in the Text Classification Block’s settings in the next step.

4. Import the provided flow and enter the source UUID of the dataset.

In the application, import the JSON file for the flow.
- For more information on importing flows, see "Import a new flow to your instance" in Managing Flows.
  Flow Studio opens to show the blocks in your imported flow.
Click on the Text Classification Block, and enter the source UUID you obtained in the previous step in the Text Classification Source UUID field.
Click Save.

After you’ve finished this step, the model is connected to the flow. You can deploy the flow and send submissions to it.

5. Process submissions with the flow and view their classifications.

You can now use your Text Classification flow to process submissions. To learn how to create submissions with a specific flow, see How a File Becomes a Submission.

For each submission, the application uses the model to apply a class or classes to it. If it cannot classify the document with high confidence, the system can create a Custom Supervision task so a keyer can apply the correct classification to the text.

To view the classification applied by the machine or a keyer:

On the Submissions page, find a submission that was processed through your Text Classification flow, and click on its Submission ID.
Click Actions, and then click View Transformed Output.
Search for the predictions element of the output, and find the class or classes applied to the submission.

You can also find the class or classes applied by the machine in the output of the Text Classification Block on the submission’s Flow Run page.

Managing training data

You cannot add or delete training documents from a dataset. If you need to change the training documents in a dataset, create a new dataset with the training documents you would like to include.

Editing the classes assigned to a document

You can change the classes assigned to training documents. To do so:

Go to Library > Models, and click on Text Classification Models in the drop-down list at the top of the page.
Find the dataset containing the classes you want to edit, and click on its name.
In the Training Documents card at the bottom of the page, click Edit Annotations for the training document whose class or classes you want to edit.
Select a new class or classes for the training document in the drop-down list in the right-hand sidebar, and click Complete Task.

After updating the training data, we recommend re-training the model by clicking Run Training on the details page for the dataset.

Managing a dataset’s Text Classification models

You can have multiple models for a single dataset, but only one can be deployed for each dataset at a time.

If you have another model for a dataset (e.g., one created in another environment), you can import it by clicking Import Model on the details page for the dataset.