Published on

Evaluating Generated Text Quality using BLEU Score in Natural Language Processing

The BLEU (Bilingual Evaluation Understudy) score is a metric used in Natural Language Processing (NLP) to evaluate the quality of text generated by machines, such as translations. It operates as a precision-based metric, examining the overlap of words or n-grams (contiguous sequences of n items from a text) between the generated text and a reference text.

Here’s a step-by-step breakdown of how BLEU is calculated using the text provided, alongside an example for better understanding:

  1. N-gram Precision:

    • For each n-gram in the generated text, count the number of times it appears in both the generated and reference texts.
    • The precision for each n-gram level nn is calculated as: pn=n-gramsgenCountclip(n-gram)n-gramsrefCount(n-gram)p_n = \frac{\sum_{\text{n-gram} \in s_{\text{gen}}} \text{Count}_{\text{clip}}(\text{n-gram})}{\sum_{\text{n-gram} \in s_{\text{ref}}} \text{Count}(\text{n-gram})}
  2. Clipping:

    • To prevent over-counting, the count of each n-gram in the generated text is clipped to the maximum number of times it appears in the reference text.
  3. Brevity Penalty (BP):

    • To penalize short generated texts, a brevity penalty is applied: BP=min(1,e1lreflgen)\text{BP} = \min\left(1, e^{1 - \frac{l_{\text{ref}}}{l_{\text{gen}}}}\right)
  4. Calculating BLEU:

    • The final BLEU score is the geometric mean of the n-gram precisions, multiplied by the brevity penalty: BLEU-N=BP×(n=1Npn)1N\text{BLEU-N} = \text{BP} \times \left( \prod_{n=1}^N p_n \right)^{\frac{1}{N}}

Example:

Let’s consider an example with the reference sentence "the cat is on the mat" and the generated sentence "the the the the the the".

  1. 1-gram Precision:

    • Generated Text: "the" appears 6 times.
    • Reference Text: "the" appears 2 times.
    • Clipped Count: Min(6, 2) = 2.
    • Precision p1=26=0.3333p_1 = \frac{2}{6} = 0.3333.
  2. 2-gram Precision:

    • There are no 2-gram overlaps.
    • Precision p2=0p_2 = 0.
  3. 3-gram and 4-gram Precision:

    • Since there are no matching 3-grams or 4-grams between the generated text and the reference text, p3p_3 and p4p_4 would actually be 0.
  4. Brevity Penalty:

    • Length of Generated Text lgen=6l_{\text{gen}} = 6.
    • Length of Reference Text lref=6l_{\text{ref}} = 6.
    • Brevity Penalty BP=min(1,e166)=1\text{BP} = \min(1, e^{1 - \frac{6}{6}}) = 1.
  5. Calculating BLEU:

    • BLEU-4 score:
BLEU-4=1×(0.3333×0×0×0)1/4=0\text{BLEU-4} = 1 \times \left(0.3333 \times 0 \times 0 \times 0\right)^{1/4} = 0

This example illustrates how the BLEU score calculation accounts for both the precision of n-gram overlaps and the adequacy of translation length through the brevity penalty. By evaluating the n-gram precisions and applying a brevity penalty, BLEU provides a balanced measure of text generation quality in NLP tasks.

In addition to the above explanation, a practical example of calculating the BLEU score is available in the book Natural Language Processing with Transformers. The book provides an in-depth explanation along with Python code to compute the BLEU score. Here is the code snippet from the book:

import pandas as pd
import numpy as np

bleu_metric.add(
    prediction="the the the the the the", reference=["the cat is on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results["precisions"] = [np.round(p, 2) for p in results["precisions"]]
pd.DataFrame.from_dict(results, orient="index", columns=["Value"])
Value
score0.0
counts[2, 0, 0, 0]
totals[6, 5, 4, 3]
precisions[33.33, 0.0, 0.0, 0.0]
bp1.0
sys_len6
ref_len6

In this table, various metrics associated with the BLEU score calculation are showcased. The counts row represents the number of n-gram matches for n=1 to 4, and totals denotes the total number of n-grams in the candidate sentence for each n. precisions illustrates the n-gram precision percentages, and bp is the brevity penalty. Finally, sys_len and ref_len display the lengths of the system (candidate) and reference sentences, respectively.

For those interested in a hands-on example, a Jupyter notebook from this book is available on GitHub. The notebook demonstrates how to calculate the BLEU score for text summarization tasks, among other things. You can find the notebook here.