Evaluating gender bias in NLP

What is your take on intrinsic vs. extrinsic evaluation?

Should gender bias evaluation be task specific?

What is currently missing the most in this space?

Relation between evaluation of gender bias and related harmful biases/phenomena (e.g. hate speech)

(+ questions from Twitter or panelists)

Panelists:

Kellie Webster, Google Research

Kai-Wei Chang, University of California Los Angeles

Seraphina Goldfarb-Tarrant, University of Edinburgh

Mark Yatskar, University of Pennsylvania