Reliability

When it comes to evaluating quality, the ability to measure alone is not sufficient; you also need to know how reliable the measurement is. How stable is your evaluation tool? Does it produce stable and consistent results? If the answer is yes, then the measurement also has to be valid.
When we talk about the reliability of a tool or a procedure, we need to know that different human evaluators can achieve very similar results using the same systems. We call this close similarity inter-rater reliability. For example, human evaluation of quality is known to be prone to low inter-rater reliability if carried out without sufficiently training the evaluators, since untrained evaluators may arrive at very different results.

In contrast to human evaluation, automatic evaluation of quality quickly produces the same scores if repeated a number of times, which creates an illusion of precision. Unfortunately, the results of automatic quality measurement not only depend on the language pair in question, the Machine Learning system, the decoder (this problem is solved in part by SacreBLEU), and the content type, but also vary from dataset to dataset. They can even vary within one and the same dataset, depending on the way the data have been cleaned and formatted. For this reason, it is usually recommended to use one and the same dataset to compare the results of different systems using automatic metrics. Given these factors, automatic measurement can be very fast and “reliable”, but it may be (and often is) invalid, as well as inconsistent, since it is dependent on the dataset- and formatting.

Compared to automatic quality evaluation, human quality evaluation does not depend on text formatting details or the dataset per se, and correctly captures the human perception of quality. With human assessment of translation quality, the assessment is generally performed by several different people, but in the vast majority of cases these people are involved in different workflow stages. Therefore, it is crucial to have an idea about how reliable and valid the human quality evaluation is, including evaluation by more than one person.

Studies show that quality evaluations performed by insufficiently trained evaluators yield low Inter-Rater Reliability (IRR). Inter-rater reliability is a measure of reliability used to assess the degree to which different judges or raters agree in their evaluation decisions. What is interesting, however, about the human analytic method of translation quality measurement is that even in cases when evaluators demonstrate low agreement on concrete categories and severities, they show better agreement on the resulting quality evaluation[1].

The need to conduct proper training for quality evaluators is paramount, because untrained evaluators are unable to properly identify quality categories, assign proper severities, and or even to judge the accuracy of a translation, especially if their knowledge of the subject matter area or the source language is insufficient.

It is common knowledge in the industry, however, that even when the evaluators are properly trained, qualified, and knowledgeable, their evaluations still vary to a noticeable extent.
As noted, inter-rater reliability is a way to measure the level of agreement between multiple raters, or judges. The higher the IRR, the more consistently multiple judges will rate items with similar scores.

In general, a rule of thumb states that at least 75% of inter-rater agreement is sufficient in most fields for a test to be considered reliable. For medical fields inter-rater reliability of 95% may be required, for instance in determining whether or not a certain treatment should be used for a given patient, and of course, in unambiguous fields such as product manufacturing, IRR ratings must be very high within very narrow tolerances.

The problem, however, is that at least 20-30 data points are required to calculate the mathematical variance for a normal distribution. In settings where sample size is less than 30, and the standard deviation of the entire population is unknown, Student’s distribution can be used to evaluate standard deviation based only on the number of measurements between one and 30 (1, 2, or 4).

Example. Two evaluators arrived at quality scores of QS1=76.85 and QS2=81.99 (on a scale from 0 to 100). What is the reliability of this evaluation?

Solution. 81.99 is 6.7% greater than 76.85, and 76.85 is 6.3% less than 81.99, therefore QS2 agrees with 93.3% of QS1, and QS1 agrees with 93.7% of QS2. This is almost 95% agreement, so it looks good for most cases. However, if the PASS/FAIL threshold is 80, the difference may be crucial for the translator. The following discussion examines the Confidence Interval for this kind of measurement.

The Sample Mean is 79.42 for this sample of two.

The Sample Standard Deviation for this sample is 2.57.

The Confidence Interval depends on the desired Confidence Level which, in turn, depends on the subject matter area of the content which was translated.

For most fields, the confidence level should be at least 80%. The critical number for that level (0.1 which is 20% divided by 2 for one tail of the graph) and two measurements (one degree of freedom) is 3.078, as shown on Fig. 1. below.

Figure 1. Student’s distribution, critical values.

Therefore, the margin of error for these measurements is:

E = 12.71 * 2.57 / SQRT (2) = 5.6

This means that the confidence interval for these two measurements is 11.2, which indicates that with an 80% degree of confidence, it is safe to conclude that the true quality score lies on this interval:

[79.42 – 5.6, 79.42 + 5.6], or [73.82, 85.02]

We need a third measurement to improve this interval, but the good news is that this is a fairly narrow interval for such a subtle, almost intangible factor as the human perception of quality.
In a production setting, the additional data points are rarely obtained by repeated evaluation, due to the cost and time constraints required for such a process.

More often, additional data points are taken from other evaluations of different samples, in the course of translation. The Confidence Interval for one measurement is (as shown above) wide, and therefore one evaluation cannot be taken as a basis for process decisions, so more evaluations are required. Subsequent evaluations effectively are placed into two categories: (a) mostly PASS with only rare occasional FAIL, (b) all other cases (mostly FAIL, or more than two FAILs). This strategy is caused by the desire to ensure that a system is reliably well above the PASS/FAIL threshold, and thus ensures quality results. Multiple evaluations also confirm the validity of quality measurements.

Sources:

[1] Charlampidou, Parthena & Serge Gladkoff. 2022. “Application of an industry practical human MT output Quality Evaluation Metric in the EMT classroom,” NeTTT Conference paper, July 2022.