For translation quality evaluation (TQE), the safest option is always to evaluate the entire translated content. However, for reasons of cost-effectiveness, this is often not an option and TQE is then carried out based on a sample of the translated content. This entails some risks and challenges, which implementers should be aware of, so that decisions on sample selection (sample size and sample chunk size) are taken in such a way as to ensure that these risks and challenges can be mitigated and managed in an optimal way.
Different sampling approaches are possible. During the preparatory stage, when designing the evaluation system, implementers therefore need to decide which sampling approach best serves their evaluation purposes. The following is a non-exhaustive list of options that can be combined in numerous ways.
Full-text sampling (not all texts in a project are included in the sample, but all segments in the chosen texts are reviewed). Full-text sampling mitigates the risks linked to the fact that texts are not uniform (see the last paragraph below). Since full-text sampling leaves the text intact, another advantage is that it makes it possible to identify and annotate errors arising from problems in cohesion. Cohesion problems can of course also be satisfactorily addressed in other sampling approaches if (1) the whole text is accessible to the evaluator and (2) the designated error location for a cohesion problem is in one of the sampled segments.
Partial-text sampling (some segments in individual texts are sampled and others not; in multi-text projects, each text would have some portion sampled). The evaluation would examine multiple chunks. It has to be decided what sample chunk size is the most suitable (for example 1500 characters, 300 words). For partial text sampling, several options for sample selection exist:
- Representative samples based on random selection of chunks (samples that will give a statistically valid view of the whole text)
- Stratified sample selection (samples chosen to give special attention to a well-defined subset of the segments, by oversampling, undersampling, or excluding different subsets). Examples of subsets relevant for stratified sampling:
- High-profile passages: titles and headers, front matter, forewords and summaries, introductory paragraphs of sections or chapters, legends and captions, etc.
- Text fragments or non-running text: all segments in tables, captions, call-outs, sidebars, footnotes, endnotes, and appendices
- Running text: Everything BUT these those types in Item 2.
- New translation passages (when translation memory was used in the translation): all exact or fuzzy matches produced by the TM, to validate and update the TM, OR all segments not matched by the TM, to focus on new translations and not waste time and money on segments that have already been validated several times.
- Segments identified by translators as problematic: Segments about which translators raised questions or where they noted source problems.
- Strings processed using machine translation: Segments produced by MT that were not edited by a human translator or all segments produced using MT
This list is not exhaustive. Implementers can decide other ways to slice and dice the segments in a way that makes sense in their implementation context.
A major challenge with sampling is that for any statistical approach to be applicable the material has to be statistically uniform, because this is what makes well-developed statistical distributions and models applicable. However, texts are not uniform.
For further reading on this aspect, see document here.