Questions tagged [bleu]

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.

55 questions
2
votes
1 answer

Why Sacrebleu returns zero BLEU score for short sentences?

Why scarebleu needs that sentences ends with dot? If I remove dots, the value is zero. import sacrebleu, nltk sys = ["This is cat."] refs = [["This is a cat."], ["This is a bad cat."]] b3 = sacrebleu.corpus_bleu(sys, refs) print("b3",…
Ahmad
  • 8,811
  • 11
  • 76
  • 141
2
votes
1 answer

Average of BLEU scores on two subsets of data is not the same as overall score

For evaluating a sequence generation model, I'm using BLEU1:BLEU4. I separated the test set to two sets and calculated the scores on each set separately, as well as, on the whole test set. Surprisingly, the results I get from the whole test set is…
forough
  • 47
  • 1
  • 1
  • 5
2
votes
1 answer

BLEU - Error N-gram overlaps of lower order

I ran the code below a = ['dog', 'in', 'plants', 'crouches', 'to', 'look', 'at', 'camera'] b = ['a', 'brown', 'dog', 'in', 'the', 'grass', ' ', ' '] from nltk.translate.bleu_score import corpus_bleu bleu1 = corpus_bleu(a, b, weights=(1.0, 0, 0,…
Dung Dao
  • 51
  • 1
  • 4
1
vote
0 answers

Rouge Score averaged across documents or per question

I am evaluating a model on NarritiveQA story task, and the metrics that were given are Rouge, BLEU-1/4, METEOR ?. What is the standard practice for evaluating on datasets? Do I average the rouge score across the documents or per question? evaluator…
1
vote
1 answer

Early stopping based on BLEU in FairSeq

My goal is to use BLEU as early stopping metric while training a translation model in FairSeq. Following the documentation, I am adding the following arguments to my training script: --eval-bleu --eval-bleu-args --eval-bleu-detok…
1
vote
0 answers

Why would combining two features vectors degrade machine learning efficiency?

I work on video captioning project and I have two features vectors extracted for each video as numpy files. One of them is with shape of (15,2048) which is extracted by 3D CNN, and the other is with shape of (15,1536) which is extracted by…
A_B_Y
  • 332
  • 2
  • 8
1
vote
0 answers

What is the differences between pycocoevalcap Bleu and NLTK Bleu methods for bleu score calculation

For image captioning, Bleu score can by calculated by using pycocoevalcap repo or NLTK Python library like: `from utils.coco_caption.pycocoevalcap.bleu.bleu import Bleu # // by pycocoevalcap . . . from nltk.translate.bleu_score import…
ENG_AI
  • 23
  • 3
1
vote
1 answer

What are the differences between BLEU score and METEOR?

I am trying to understand the concept of evaluating the machine translation evaluation scores. I understand how what BLEU score is trying to achieve. It looks into different n-grams like BLEU-1,BLEU-2, BLEU-3, BLEU-4 and try to match with the human…
Exploring
  • 2,493
  • 11
  • 56
  • 97
1
vote
1 answer

why the bleu score is zero for this pair even though they are similar

why I'm getting 0 score even though sentences are similar. Please refer the code below from nltk.translate.bleu_score import sentence_bleu score = sentence_bleu(['where', 'are', 'economic', 'networks'], ['where', 'are', 'the', 'economic',…
Hagout Fan
  • 13
  • 3
1
vote
1 answer

I compare two identical sentences with BLEU NLTK and don't get 1.0. Why?

I’m trying to use the BLEU score from NLTK for quality evaluation of the machine translation. I wanted to check this code with two identical sentences, here I’m using method1 as a Smoothing function because I’m comparing two sentences and not…
slow_war
  • 25
  • 5
1
vote
1 answer

NLTK sentence_bleu method 7 gives scores above 1

When using the NLTK sentence_bleu function in combination with SmoothingFunction method 7, the max score is 1.1167470964180197. This while the BLEU score is defined to be between 0 and 1. This score shows up for perfect matches with the reference.…
Rink Stiekema
  • 394
  • 3
  • 13
1
vote
0 answers

Construct heatmap matrix with personal metric python

I want to construct a heatmap matrix but with a customized metric, which is the bleu score. I have 20 sentences that I want to compare this way. I tried to use sns.heatmap or sns.clustermap and then to add sentence_bleu as a metric function, but…
ASUCB
  • 51
  • 6
1
vote
1 answer

Running NLTK sentence_bleu in Pandas

I am trying to apply sentence_bleu to a column in Pandas to rate the quality of some machine translation. But the scores it is outputting are incorrect. Can anyone see my error? import pandas as pd from nltk.translate.bleu_score import…
1
vote
0 answers

Corpus bleu score calculation

I am trying to calculate bleu score text summarization. Before that I need to know how corpus bleu calculates score between given references and a candidate. I have 3 references and 2 candidates where I am calling corpus bleu on first two references…
0
votes
0 answers

keras_nlp.metrics.Bleu ValueError: y_pred must be of rank 0, 1 or 2. Found rank: 3

I am currently engaged in the process of fine-tuning the google/mT5-small model on Google Colab for a translation task. To assess the quality of translations, I am utilizing the keras_nlp.metrics.Bleu metric to compute the Bleu score. However, I…
王冠侖
  • 1
  • 1