↔️How to select thresholds
What are thresholds and why do I need to use them?
A threshold is a cut-off value used to convert model scores into binary decisions for a category.
If you are using Unitary's product, you will receive outputs in the form of scores for every class each product classifies. The score is a number between 0 and 1 that represents how present the item or characteristic is in the content (where 1 is the most present), but the distribution is not normalised e.g. 0.5 may mean "20% confidence" category A and "70% confidence" for category B.
To use these scores to build automated moderation rules or to classify content, you will need to select thresholds to determine what range of scores represents significance for that characteristic.
For example, if you want to block toxic content, you will need to decide the minimum toxicity score for which you will define the content as toxic. Depending on your use case, you may want multiple thresholds, for example one for ‘definitely toxic - auto-block’ and a lower threshold for ‘maybe toxic - send to review’.
The optimum threshold for each class will differ based on your use case, data and risk appetite
The ideal threshold (for a given class) is the number that gives the best classification performance on the metrics that are important to your use case - on your data.
This ideal threshold will depend on:
The nature of your data
Your use case and policy - what you are looking to classify
What you are optimising for and your risk appetite
The overall ‘optimum’ threshold is the number that gives the smallest number of false positives and false negatives
However, you may want to optimise for different metrics depending on your use case and risk appetite. For example, you may want to catch as much toxic content as possible, regardless of whether you overflag and also catch some safe content. In this case, your ideal threshold will be lower than the optimum threshold because minimising false negatives is more important than minimising false positives
The recommendation is to select your own thresholds using a data-driven approach that optimises performance. We provide guidance on this below, along with some alternative options.
Thresholds should be selected per class
Currently, Unitary scores are uncalibrated, meaning the score distribution is different for each item. Therefore you will need to threshold separately for each class (i.e. a 0.7 for pepe frog won't necessarily mean the same prevalence as a 0.7 for confederate flag).
Option 1 [recommended]: data-driven custom thresholds
If you have ground-truth labels for a test dataset that is representative of your production use case, you can use a data-driven approach to select the thresholds that will give you the highest classification performance. This is our recommended approach and we are happy to help you out.
This is a fairly straightforward exercise that most data scientists will be comfortable with. You can find a number of resources and guides online. A simple approach is outlined below.
A simple method for data-driven thresholding
First, prepare a labelled dataset that is representative of the content you want to classify with Unitary's product.
A random sample is adequate. Labels should be provided for all the classes you want to use.
Second, decide on the performance metric you would like to optimise for. If you aren’t sure, F1 will provide you with a balance of precision and recall.
Thirdly, begin optimising the thresholds, a simple iterative approach is often sufficient. We can loop over thresholds between zero and 1. In the example below we try every threshold incrementing in steps of 0.01 - each time we beat the f1 score of the previous one, we store the new value. Below is a basic example for optimising two thresholds on some dummy data.
This is the simplest approach. Other options include gradient descent (plotting the performance curve and finding the local minima) and grid search.
Pros
Best performance
Cons
Requires some technical overhead
Requires labelled data
Option 2: Manual thresholds
In the absence of ground-truth labels or technical resources, one approach you can take is to “eyeball” the data and choose a threshold that gives broadly good performance for your use cases.
For each class, we recommend:
Sorting your dataset from high to low by that class value
Look at your content, starting with the highest scores
Find a suitable score threshold. For example, if you are optimising for overall performance, you will find a threshold after which most content contains the harm you are looking to classify
Pros
No technical resource or labelled data required
Cons
Requires some manual effort
Performance not likely to be as good as option 1
Option 3: Unitary default thresholds
The lowest lift approach to setting thresholds would be to use our recommended default thresholds optimised on internal data sets. Please note that these data sets may not be representative of your own specific moderation use case. For the most accurate results fine-tuned to your use case, we recommend option 1 or 2 (above).
The performance these give on your data is highly dependent on your dataset, use case and risk appetite, so we suggest evaluating performance - either with data or manually.
You may want to use a combination of approaches, for example starting with Unitary default thresholds and “eyeballing” adjustments to them based on your dataset.
Thresholds for Items and Characteristics
Thresholds for Standard
You may use the high-recall threshold to flag all of the content that you want to double-check for potential harm, where high recall means letting as few harms as possible go unnoticed
You may use the high-precision threshold to make automated decisions based on whether the content is harmful, where high-precision means raising as few false alarms as possible.
Pros
No technical resource, manual effort or labelled data required
Cons
Performance is not likely to be as good as option 1 and is highly dependent on data
Glossary
F1 Score: This is a measure of a model's accuracy on a dataset. It is used to balance the precision and recall of a model and is the harmonic mean of precision and recall. The F1 score ranges from 0 to 1, where 1 represents perfect precision and recall, and 0 is the worst possible score.
Precision: This is the fraction of correctly identified positive results out of all predicted positive results. For example, if your model predicted that 100 messages are spam and 90 of them were actually spam, then the precision of your model is 0.90.
Recall: Also known as sensitivity or true positive rate, this is the fraction of the total amount of relevant instances that were actually retrieved. For instance, if there were actually 200 spam messages and your model identified 150, then the recall of your model would be 0.75.
False Positives: These are instances where the model incorrectly predicted a positive result. In the context of spam detection, a false positive would occur if the model incorrectly flagged a legitimate message as spam.
False Negatives: These are instances where the model incorrectly predicted a negative result. In a spam detection scenario, a false negative would occur if a spam message was incorrectly identified as legitimate by the model.
Last updated