↔️How to select thresholds

What are thresholds and why do I need to use them?

A threshold is a cut-off value used to convert model scores into binary decisions for a category.

If you are using Unitary's product, you will receive outputs in the form of scores for every class each product classifies. The score is a number between 0 and 1 that represents how present the item or characteristic is in the content (where 1 is the most present), but the distribution is not normalised e.g. 0.5 may mean "20% confidence" category A and "70% confidence" for category B.

To use these scores to build automated moderation rules or to classify content, you will need to select thresholds to determine what range of scores represents significance for that characteristic.

For example, if you want to block toxic content, you will need to decide the minimum toxicity score for which you will define the content as toxic. Depending on your use case, you may want multiple thresholds, for example one for ‘definitely toxic - auto-block’ and a lower threshold for ‘maybe toxic - send to review’.

The optimum threshold for each class will differ based on your use case, data and risk appetite

The ideal threshold (for a given class) is the number that gives the best classification performance on the metrics that are important to your use case - on your data.

This ideal threshold will depend on:

  • The nature of your data

  • Your use case and policy - what you are looking to classify

  • What you are optimising for and your risk appetite

    • The overall ‘optimum’ threshold is the number that gives the smallest number of false positives and false negatives

    • However, you may want to optimise for different metrics depending on your use case and risk appetite. For example, you may want to catch as much toxic content as possible, regardless of whether you overflag and also catch some safe content. In this case, your ideal threshold will be lower than the optimum threshold because minimising false negatives is more important than minimising false positives

The recommendation is to select your own thresholds using a data-driven approach that optimises performance. We provide guidance on this below, along with some alternative options.

Thresholds should be selected per class

Currently, Unitary scores are uncalibrated, meaning the score distribution is different for each item. Therefore you will need to threshold separately for each class (i.e. a 0.7 for pepe frog won't necessarily mean the same prevalence as a 0.7 for confederate flag).

If you have ground-truth labels for a test dataset that is representative of your production use case, you can use a data-driven approach to select the thresholds that will give you the highest classification performance. This is our recommended approach and we are happy to help you out.

This is a fairly straightforward exercise that most data scientists will be comfortable with. You can find a number of resources and guides online. A simple approach is outlined below.

A simple method for data-driven thresholding

First, prepare a labelled dataset that is representative of the content you want to classify with Unitary's product.

A random sample is adequate. Labels should be provided for all the classes you want to use.

Second, decide on the performance metric you would like to optimise for. If you aren’t sure, F1 will provide you with a balance of precision and recall.

Thirdly, begin optimising the thresholds, a simple iterative approach is often sufficient. We can loop over thresholds between zero and 1. In the example below we try every threshold incrementing in steps of 0.01 - each time we beat the f1 score of the previous one, we store the new value. Below is a basic example for optimising two thresholds on some dummy data.

from typing import Tuple

import numpy as np
import pandas as pd
from sklearn.metrics import f1_score

def create_random_data(seed: int = 134617456) -> pd.DataFrame:
    """
    Function to create a DataFrame of randomly generated data with two characteristics and corresponding labels.
    """
    np.random.seed(seed)
    characteristic_A = np.random.rand(10)
    characteristic_B = np.random.rand(10)
    true_label_A = np.random.randint(0, 2, 10)
    true_label_B = np.random.randint(0, 2, 10)

    df = pd.DataFrame(
        {
            "Characteristic A": characteristic_A,
            "True Label A": true_label_A,
            "Characteristic B": characteristic_B,
            "True Label B": true_label_B,
        }
    )
    return df

def find_best_threshold(
    data: pd.DataFrame, thresholds: np.ndarray, characteristic: str, true_label: str
) -> Tuple[float, float]:
    """
    Function to find the best threshold for a given characteristic based on its F1 score.

    Parameters:
        data (pd.DataFrame): A DataFrame containing the characteristic and labels.
        thresholds (np.ndarray): An array of threshold values to test.
        characteristic (str): The name of the characteristic column.
        true_label (str): The name of the corresponding ground truth label column.

    Returns:
        best_threshold (float): The best threshold for the characteristic.
        best_f1 (float): The corresponding F1 score.
    """
    best_threshold = None
    best_f1 = 0

    for threshold in thresholds:
        predicted_labels = (data[characteristic] > threshold).astype(int)
        f1 = f1_score(data[true_label], predicted_labels)

        if f1 > best_f1:
            best_f1 = f1
            best_threshold = threshold

    return best_threshold, best_f1

if __name__ == "__main__":
    # Generate some example data
    data = create_random_data()

    # Specify a range of thresholds to test
    thresholds = np.arange(start=0.01, stop=1.01, step=0.01)

    # For each characteristic, find the best threshold
    threshold_results = {}
    threshold_results["Characteristic A"] = find_best_threshold(
        data, thresholds, characteristic="Characteristic A", true_label="True Label A"
    )
    threshold_results["Characteristic B"] = find_best_threshold(
        data, thresholds, characteristic="Characteristic B", true_label="True Label B"
    )

    # Print out the best threshold and corresponding F1 score for each characteristic
    for characteristic, (threshold, f1) in threshold_results.items():
        print(
            f"Best threshold for {characteristic}: {threshold}, Best F1 score: {f1:.2f}"
        )

This is the simplest approach. Other options include gradient descent (plotting the performance curve and finding the local minima) and grid search.

Pros

  • Best performance

Cons

  • Requires some technical overhead

  • Requires labelled data

Option 2: Manual thresholds

In the absence of ground-truth labels or technical resources, one approach you can take is to “eyeball” the data and choose a threshold that gives broadly good performance for your use cases.

For each class, we recommend:

  • Sorting your dataset from high to low by that class value

  • Look at your content, starting with the highest scores

  • Find a suitable score threshold. For example, if you are optimising for overall performance, you will find a threshold after which most content contains the harm you are looking to classify

Pros

  • No technical resource or labelled data required

Cons

  • Requires some manual effort

  • Performance not likely to be as good as option 1

Option 3: Unitary default thresholds

The lowest lift approach to setting thresholds would be to use our recommended default thresholds optimised on internal data sets. Please note that these data sets may not be representative of your own specific moderation use case. For the most accurate results fine-tuned to your use case, we recommend option 1 or 2 (above).

The performance these give on your data is highly dependent on your dataset, use case and risk appetite, so we suggest evaluating performance - either with data or manually.

You may want to use a combination of approaches, for example starting with Unitary default thresholds and “eyeballing” adjustments to them based on your dataset.

Thresholds for Items and Characteristics

Unitary Class NameThreshold Optimized for F1

adult_content

0.355

suggestive

0.345

medical

0.430

violence

0.512

over_18

0.318

adult_toys

0.587

firearm

0.900

knife

0.810

violent_knife

0.990

alcohol

0.900

drink

0.900

smoking_and_tobacco

0.950

marijuana

0.935

pills

0.900

recreational_pills

0.984

confederate_flag

0.990

pepe_frog

0.990

nazi_swastika

0.990

artistic

0.769

comic

0.931

meme

0.722

photo

0.626

screenshot

0.756

map

0.933

poster_cover

0.798

game_screenshot

0.826

face_filter

0.579

promo_info_graphic

n/a

toxic

0.900

severe_toxic

0.900

obscene

0.600

insult

0.900

identity_hate

0.900

threat

0.900

sexual_explicit

0.66

middle_finger_gesture

coming soon!

child

0.912

toy

0.934

gambling_machine

coming soon!

Thresholds for Standard

You may use the high-recall threshold to flag all of the content that you want to double-check for potential harm, where high recall means letting as few harms as possible go unnoticed

You may use the high-precision threshold to make automated decisions based on whether the content is harmful, where high-precision means raising as few false alarms as possible.

ModalityUnitary ClassHigh recall thresholdHigh precision threshold

Video

Adult & Sexual

0.0587

0.8439

Video

Non-medical Drugs

0.0575

0.5644

Video

Violence and Injury

0.0329

0.4048

Video

Weapons and Firearms

0.1929

0.9170

Video

Hate Speech and Hate Symbols

0.0110

0.2758

Image

Adult & Sexual

0.3393

0.5500

Image

Non-medical Drugs

0.0110

0.4875

Image

Violence and Injury

0.2508

0.8319

Image

Weapons and Firearms

0.6204

0.8296

Image

Hate Speech and Hate Symbols

0.0989

0.5758

Pros

  • No technical resource, manual effort or labelled data required

Cons

  • Performance is not likely to be as good as option 1 and is highly dependent on data

Glossary

F1 Score: This is a measure of a model's accuracy on a dataset. It is used to balance the precision and recall of a model and is the harmonic mean of precision and recall. The F1 score ranges from 0 to 1, where 1 represents perfect precision and recall, and 0 is the worst possible score.

Precision: This is the fraction of correctly identified positive results out of all predicted positive results. For example, if your model predicted that 100 messages are spam and 90 of them were actually spam, then the precision of your model is 0.90.

Recall: Also known as sensitivity or true positive rate, this is the fraction of the total amount of relevant instances that were actually retrieved. For instance, if there were actually 200 spam messages and your model identified 150, then the recall of your model would be 0.75.

False Positives: These are instances where the model incorrectly predicted a positive result. In the context of spam detection, a false positive would occur if the model incorrectly flagged a legitimate message as spam.

False Negatives: These are instances where the model incorrectly predicted a negative result. In a spam detection scenario, a false negative would occur if a spam message was incorrectly identified as legitimate by the model.

Last updated