Triggering Models: Measuring and Mitigating Bias in German Language Generation
Master's Thesis by Angelie Kraft, August 2021
Abstract
Pre-training large language models on vast amounts of web-scraped text is a current trend in natural language processing. While the resulting models are capable of generating convincing text, they also reproduce harmful social biases. This thesis explores expressions of gender bias in German text generation. Analyses were performed on samples by the generative models GerPT-2 (GPT-2 [Radford et al., 2019] netuned for German) and GPT-3 [Brown et al., 2020]. A German classier for the concept of regard was developed after Sheng et al. [2019]. It captures the social perception of a person evoked by a description. For the development of the classier, a dataset was crowd-sourced, cleaned, and independently annotated. GerPT2 generated signicantly more negative descriptions for male prompts than female prompts. Additional qualitative analyses grounded in the ambivalent sexism theory [Connor et al., 2017] revealed that both models reproduce dierent facets of sexism: A benevolent sexist caregiver bias and a hostile sexist sexualization bias towards females were found, as well as a perpetrator bias towards males. Bias mitigation triggers [Sheng et al., 2020] are debiasing tokens tted through gradientguided search. They reduce negative and increase positive and neutral regard. In this thesis, a trigger tted on GerPT-2 mitigated negatively connotated sexist biases in both models. However, triggers also introduce unwanted contextualization, causing a content shift in the generated output. The trigger-based debiasing approach, hence, needs renement to preserve domain independence. Finally, transferability to markedly higher-parameterized models, like GPT-3, is a valuable property that could facilitate low-threshold usage.