For evaluating a sequence generation model, I'm using BLEU1:BLEU4. I separated the test set to two sets and calculated the scores on each set separately, as well as, on the whole test set. Surprisingly, the results I get from the whole test set is not the weighted average of the results I get from each set. For example, consider the BLEU4 scores I get on a set and two subsets of it:
set1, 866 elements: 0.0001529267908
set2, 1010 elements: 0.1625387989
<set1,set2>, 1876 elements: 0.3063472152
How should I aggregate the results on two subsets to get the overall result?
Note: I know that all the elements in set1 are shorter than 4 tokens that's why BLEU4 is almost zero there.