Shared Task on Machine Translation Gender Bias Evaluation with Multilingual Holistic Bias


Demographic biases are relatively infrequent phenomena but present a very important problem. The development of datasets in this area has raised the interest in evaluating Natural Language Processing (NLP) models beyond standard quality terms. In Machine Translation (MT), gender bias is observed when translations show errors in linguistic gender determination despite the fact that there are sufficient gender clues in the source content for a system to infer the correct gendered forms. To illustrate this phenomenon, sentence (1) below does not contain enough linguistic clues for a translation system to decide which gendered form should be used when translating into a language where the word for doctor is gendered. Sentence (2), however, includes a gendered pronoun which most likely has the word doctor as its antecedent. Sentence (3) shows two variations of the exact sentence with the only variation of the gender inflection. 

1. I didn’t feel well, so I made an appointment with my doctor. 

2. My doctor is very attentive to her patients’ needs. 

3. Mi amiga es una ama de casa / Mi amigo es un amo de casa. (in English, My (female/male) friend is a homemaker)

Gender bias is observed when the system produces the wrong gendered form when translating sentence (2) into a language that uses distinct gendered forms for the word doctor. A single error in the translation of an utterance the like of sentence (1) would not be sufficient to conclude that gender bias exists in the model; doing so would take consistently observing one linguistic gender over another. Finally, a lack of robustness is shown in sentence (3) if the translation quality differs in the translation of sentences in (3). It has previously been hypothesized that one possible source of gender bias is gender representation imbalance in large training and evaluation data sets, e.g. [Costa-jussà et al., 2022; Qian et al., 2022]


The goals of the shared translation task are:

Shared Task Description

We propose to evaluate the 3 cases of gender bias: gender-specific, gender robustness and unambiguous gender.

Description Task 1: Gender-specific

In the English-to-X translation direction, we evaluate the capacity of machine translation systems to generate gender-specific translations from English neutral inputs (e.g.  I didn’t feel well, so I made an appointment with my doctor.) This can be illustrated by the fact that machine translation (MT) models systematically translate neutral source sentences into masculine or feminine depending on the stereotypical usage of the word (e.g. “homemakers” into “amas de casa”, which is the feminine form in Spanish and “doctors” into “médicos”, which is the masculine form in Spanish). 

Description Task 2: Gender Robustness

In the X-to-English translation direction, we compare the robustness of the model when the source input only differs in gender (masculine or feminine), e.g. in Spanish: Mi amiga es una ama de casa / Mi amigo es un amo de casa.

Description Task 3: Unambiguous Gender

In the X-to-X translation direction, we evaluate the unambiguous gender translation across languages and without being English-centric, e.g, Spanish-to-Catalan: Mi amiga es una ama de casa is translated into La meva amiga és una mestressa de casa  

Submission details

X Languages. In addition to English, our challenge covers 26 languages: Modern Standard Arabic, Belarusian, Bulgarian, Catalan, Czech, Danish, German, French, Italian, Lithuanian, Standard Latvian, Marathi, Dutch, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tamil, Thai, Ukrainian, Urdu

Evaluation. The challenge will be evaluated using automatic metrics. Evaluation criteria will be in terms of overall translation quality and difference in performance for male and female sets. More details will be provided.

Submission platform. We will use the Dynabench platform for all tasks.

Important Dates.

From December: Fill in the interest form

Mar 20, 2024: Model Submission Opens

May 20, 2024: Model Submission Closes

May 24, 2024: System paper submission deadline

June 21, 2024: Notifications of the acceptance

July 5, 2024: Camera-Ready version

August 16, Workshop at ACL


Marta Costa-jussà, Pierre Andrews, Eric Smith, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Daniel Licht, and Carleigh Wood. 2023. Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14141–14156, Singapore. Association for Computational Linguistics.