LLM evaluation metrics — BLEU, ROGUE and METEOR explained

5 min readAug 7, 2024

These metrics are typically used to compare the similarity between 2 texts. In the context of LLMs, this would typically be comparing the output and the ground truth in tasks such as Translation, Question Answering, Summarisation etc.

Prerequisites

Things we need to know before we tackle the metrics

N-Gram refers to all possible n consecutive words from a sentence. For example in the sentence I saw a cat the 1-grams (also called unigrams) are I , saw , a , cat . The 2-grams (also called bigrams) are I saw , saw a , a cat and so on till n = number of words in the sentence.

While accuracy measures what percentage did we get correct, precision and recall answer a bit more nuanced questions.

Precision Among all the positive predictions that we made how many are actually positive i.e True Positives / (True Positives + False Positives)

Recall Among all the positive datapoints how many did we predict positive i.e True Positives / (True Positives + False Negatives)

F1 Score combines both precision and recall to produce a balanced score i.e 2*precision*recall/(precision + recall)

Metrics

BLEU

It answers a very simple question, how many words from the predicted (also called candidate) sentence are also present in the ground truth (also called target or reference) sentence? Referring to the above definitions you can see this can be seen as precision. Bleu extends it to 2-gram, 3-gram, etc and combines them by taking the geometric mean of them all (can be seen averaging in log scale).

A brevity penalty is introduced to penalise models that get high precision from short answers. For example if the prediction is The and the target is The kite is flying , the precision is still 1. If the prediction is longer than the target then the multiplicative factor becomes 1 and there is no penalty. If shorter, then the factor becomes < 1 and reduces the overall score.

def get_n_gram_counts(words, n=4):

    n_gram_counts = { i: defaultdict(int) for i in range(1, n+1) }

    for curr_n in range(1, n+1):
        for i in range(len(words)-curr_n+1):
            current_gram = tuple(words[i:i+curr_n])
            n_gram_counts[curr_n][current_gram] += 1

    return n_gram_counts

def get_bleu_score(predicted_text, target_text):

    predicted_words, target_words = predicted_text.split(), target_text.split()

    num_predicted_words = len(predicted_words)
    num_target_words = len(target_words)

    brevity_penalty = min(1, math.exp(1 - (num_target_words/num_predicted_words)))

    predicted_n_gram_counts = get_n_gram_counts(predicted_words)
    target_n_gram_counts = get_n_gram_counts(target_words)

    total_precision = 1

    for n in range(1, 5):
        num_present = 0
        for n_gram in predicted_n_gram_counts[n]:
            num_present += min(predicted_n_gram_counts[n][n_gram], target_n_gram_counts[n][n_gram])
        precision = num_present/(num_predicted_words-n+1)
        total_precision *= precision**(0.25)
    
    bleu_score = brevity_penalty*total_precision

    return bleu_score

get_bleu_score("A NASA rover is fighting a massive storm on Mars .", "The NASA Opportunity rover is battling a massive dust storm on Mars .")

# 0.27

ROUGE

Rogue extends on Bleu and includes both precision and recall terms by calculating the F1 score. Remember, in this case precision is number of words in both prediction and target / number of words in prediction and recall is number of words in both prediction and target / number of words in target.

The variants of Rouge are ROUGE 1 is F1 score computed with unigrams, ROUGE 2 is bigram F1 score, ROUGE L is F1 score using the longest common subsequence (lcs), where the numerator in both precision and recall is the number of words in the lcs of prediction and target. For example, if candidate is I would love to have pizza now and the target is I would like to eat pizza , then the lcs is I would to pizza .

def get_rogue_score(predicted_text, target_text, n):

    predicted_words, target_words = predicted_text.split(), target_text.split()

    num_predicted_words = len(predicted_words)
    num_target_words = len(target_words)

    predicted_n_gram_counts = get_n_gram_counts(predicted_words)
    target_n_gram_counts = get_n_gram_counts(target_words)

    num_present = 0
    for n_gram in predicted_n_gram_counts[n]:
        num_present += min(predicted_n_gram_counts[n][n_gram], target_n_gram_counts[n][n_gram])
    precision = num_present/(num_predicted_words-n+1)
    recall = num_present/(num_target_words-n+1)

    f1 = 2*precision*recall/(precision + recall)

    return f1

get_rogue_score("A NASA rover is fighting a massive storm on Mars .", "The NASA Opportunity rover is battling a massive dust storm on Mars .", 1)
# 0.75
get_rogue_score("A NASA rover is fighting a massive storm on Mars .", "The NASA Opportunity rover is battling a massive dust storm on Mars .", 2)
# 0.45

def get_longest_common_subsequence(words1, words2, i, j):
    # can use dp
    if i == len(words1): return []
    if j == len(words2): return []
    if (words1[i] == words2[j]):
        return [words1[i]] + get_longest_common_subsequence(words1, words2, i+1, j+1)
    else:
        candidate1 = get_longest_common_subsequence(words1, words2, i+1, j)
        candidate2 = get_longest_common_subsequence(words1, words2, i, j+1)
        return candidate1 if len(candidate1) > len(candidate2) else candidate2

def get_rogue_lcs_score(predicted_text, target_text):

    predicted_words, target_words = predicted_text.split(), target_text.split()

    num_predicted_words = len(predicted_words)
    num_target_words = len(target_words)

    lcs = get_longest_common_subsequence(predicted_words, target_words, 0, 0)

    print("the longest common subsequence is: ", lcs)

    precision = len(lcs)/num_predicted_words
    recall = len(lcs)/num_target_words

    f1 = 2*precision*recall/(precision + recall)

    return f1

get_rogue_lcs_score("A NASA rover is fighting a massive storm on Mars .", "The NASA Opportunity rover is battling a massive dust storm on Mars .")

# the longest common subsequence is:  ['NASA', 'rover', 'is', 'a', 'massive', 'storm', 'on', 'Mars', '.']
# 0.75

METEOR

While higher n-grams are used to compute Bleu and Rouge the effect of continuity of words is not accounted for explicitly. Also they require exactly the same words in both target and candidate to count positively towards the score.

Meteor accounts for both of them by computing a chunk penalty and treating similar words as same.

A process of alignment is done first, in which every word that is present in both prediction and target is mapped such that, if we were to draw lines for these mappings the number of intersections would be the least. Now any set of consecutive words in the prediction mapped to a set of consecutive words in target would be called a chunk. Chunk Penalty is then defined as 0.5*(num chunks / num words in prediction)^3.

In addition, we can use Bert or Word2vec embeddings to compute similarity between words and map them if they cross a certain threshold, despite them not being the exact same word.

The final computation of Meteor is a weighted F1 score giving 9:1 weightage for precision over recall i.e 10*precision*recall/(recall + 9*precision).