- Published on
Evaluating Generated Text Quality using BLEU Score in Natural Language Processing
The BLEU (Bilingual Evaluation Understudy) score is a metric used in Natural Language Processing (NLP) to evaluate the quality of text generated by machines, such as translations. It operates as a precision-based metric, examining the overlap of words or n-grams (contiguous sequences of n items from a text) between the generated text and a reference text.
Here’s a step-by-step breakdown of how BLEU is calculated using the text provided, alongside an example for better understanding:
N-gram Precision:
- For each n-gram in the generated text, count the number of times it appears in both the generated and reference texts.
- The precision for each n-gram level is calculated as:
Clipping:
- To prevent over-counting, the count of each n-gram in the generated text is clipped to the maximum number of times it appears in the reference text.
Brevity Penalty (BP):
- To penalize short generated texts, a brevity penalty is applied:
Calculating BLEU:
- The final BLEU score is the geometric mean of the n-gram precisions, multiplied by the brevity penalty:
Example:
Let’s consider an example with the reference sentence "the cat is on the mat" and the generated sentence "the the the the the the".
1-gram Precision:
- Generated Text: "the" appears 6 times.
- Reference Text: "the" appears 2 times.
- Clipped Count: Min(6, 2) = 2.
- Precision .
2-gram Precision:
- There are no 2-gram overlaps.
- Precision .
3-gram and 4-gram Precision:
- Since there are no matching 3-grams or 4-grams between the generated text and the reference text, and would actually be 0.
Brevity Penalty:
- Length of Generated Text .
- Length of Reference Text .
- Brevity Penalty .
Calculating BLEU:
- BLEU-4 score:
This example illustrates how the BLEU score calculation accounts for both the precision of n-gram overlaps and the adequacy of translation length through the brevity penalty. By evaluating the n-gram precisions and applying a brevity penalty, BLEU provides a balanced measure of text generation quality in NLP tasks.
In addition to the above explanation, a practical example of calculating the BLEU score is available in the book Natural Language Processing with Transformers. The book provides an in-depth explanation along with Python code to compute the BLEU score. Here is the code snippet from the book:
import pandas as pd
import numpy as np
bleu_metric.add(
prediction="the the the the the the", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])
Value | |
---|---|
score | 0.0 |
counts | [2, 0, 0, 0] |
totals | [6, 5, 4, 3] |
precisions | [33.33, 0.0, 0.0, 0.0] |
bp | 1.0 |
sys_len | 6 |
ref_len | 6 |
In this table, various metrics associated with the BLEU score calculation are showcased. The counts
row represents the number of n-gram matches for n=1 to 4, and totals
denotes the total number of n-grams in the candidate sentence for each n. precisions
illustrates the n-gram precision percentages, and bp
is the brevity penalty. Finally, sys_len
and ref_len
display the lengths of the system (candidate) and reference sentences, respectively.
For those interested in a hands-on example, a Jupyter notebook from this book is available on GitHub. The notebook demonstrates how to calculate the BLEU score for text summarization tasks, among other things. You can find the notebook here.