Trainer Resiliency

This feature is available in v40 and later.

Trainer Resiliency allows you to automatically create checkpoints of training data and model training. The system uses these checkpoints to resume model-training tasks  in the event that training is interrupted.

With this feature, you can create checkpoints for the following types of models:

  • Field Identification

  • Table Identification

  • Long-form Extraction

Enabling Trainer Resiliency

You can enable the Trainer Resiliency for any combination of the supported model types by adding their respective variables to your application’s “.env” file:

# Field Identification
DEEP_FLEX_MODEL_CHECKPOINTING_ENABLED=true

# Table Identification
TABLE_MODEL_CHECKPOINTING_ENABLED=true

# Long-form Extraction
UNLP_MODEL_CHECKPOINTING_ENABLED=true

If you don’t have access to your “.env” file, your Hyperscience representative can also enable Trainer Resiliency in the /admin section of the application. Any settings changes made for Trainer Resiliency in /admin override any settings for the feature specified in the “.env” file.

Configuring checkpoint intervals

By default, checkpoints for training data and model training are created every 30 minutes. You can change that interval by adding the following variable to your trainer’s “.env” file:

MODEL_CHECKPOINT_CREATION_INTERVAL_MIN=<interval_in_minutes>

The optimal interval depends on several factors, including the types of models being trained, the amount of training data, and the hardware resources available. If your training jobs typically take days to complete, you may want to consider lengthening the interval to 3 to 4 hours.

Storing checkpoint data

The location for the storage of checkpoint data is /var/www/forms/forms/media/temp/trainer/<trainer_id>/<task_id>/npcache/artifacts. This location cannot be changed.

Checkpoint data requires up to 6GB of additional server space. Ensure that this capacity is available in the location where you save checkpoint data.

Potential effects on model performance

Interruptions to training may cause training data to be restored in an order that differs from the order in which it was originally recorded. As a result, you may notice increases or decreases in automation rates in models whose training data is restored with Trainer Resiliency.