NLP models have been shown to exhibit gender bias, e.g., being more accurate when pronouns are used to refer to gender-stereotypical entities. Results have been limited to English and a small set of tasks. For this shared task, we present a multi-language, multi-task challenge dataset, instead focusing on a single linguistic phenomenon, namely non-gendered possessive reflexive pronouns in gendered pronominal systems – found in Scandinavian, Slavic and Sino-Tibetan languages. The set of languages we include are Chinese, Danish, Russian, and Swedish. The tasks are machine translation (MT), natural language inference (NLI), coreference resolution, and language modeling. Using manually created templates we have designed a challenge dataset to detect and quantify gender biases in existing models. Initial experiments showed strong, gender-related bias in several task-language combinations. Specifically, we make use of the 60 occupations listed in Caliskanet al. (2017) containing statistics about gender percentages taken from the U.S. Bureau of Labor Statistics. The amount of data varies across task-language pairs. All sentences contain mentioning of an occupation, a pronoun, and a description of an action. Our starting point is 4,560 data points for each task-language pair (76 sentences times 60 occupations). For NLI and coreference, we have three variations of each data point (masculine, feminine, and neutral), totalling 13,680 sentences per language. For machine translation, we consider masculine and feminine variations only.
We will be hosting the shared task on Kaggle. Instructions to participants will be published on the workshop website. To encourage inclusion, Kaggle will offer compute resources to all competing teams.