Benchmark for Measuring Intersectional Bias in Race and Gender of Large Language Models Using Synthetic Data 


Vol. 14,  No. 2, pp. 82-94, Feb.  2025
https://doi.org/10.3745/TKIPS.2025.14.2.82


PDF
  Abstract

Large Language Models (LLMs) are generative AI systems trained on vast datasets, capable of producing human-like text and widely used across industries. However, LLMs risk generating biased outputs by internalizing stereotypes related to race and gender, potentially exacerbating social inequalities. To address this, researchers design datasets with queries that elicit biased responses to quantitatively evaluate LLM bias. Despite progress, existing studies face two key limitations. First, most datasets rely on manual curation of crawled data, restricting scalability and diversity in bias evaluation scenarios. Second, research on intersectional bias—arising from interactions between domains such as race and gender—is limited, as most studies focus on single-domain biases. This approach, while insightful, fails to capture the complexities of real-world, multidimensional biases. This study introduces a large-scale dataset of 16,082 entries to evaluate intersectional biases in race and gender within LLMs. Using U.S. labor and population statistics, we analyzed occupational distributions and their associated race-gender combinations, defining Pro-stereotype (aligned with societal stereotypes) and Anti-stereotype (counter to stereotypes) categories. Positive and negative contexts were systematically constructed for each occupation's race-gender pairing, and the DPSDA2 synthetic data generation method was applied to expand scenario coverage. The dataset consists of multiple-choice items, each with one Pro-stereotype sentence and seven Anti-stereotype sentences, enabling quantitative bias evaluation based on LLM responses. This work addresses the limitations of existing studies, offering a scalable and comprehensive framework for assessing intersectional biases in LLMs.

  Statistics


  Cite this article

[IEEE Style]

L. Jueun and H. Bae, "Benchmark for Measuring Intersectional Bias in Race and Gender of Large Language Models Using Synthetic Data," The Transactions of the Korea Information Processing Society, vol. 14, no. 2, pp. 82-94, 2025. DOI: https://doi.org/10.3745/TKIPS.2025.14.2.82.

[ACM Style]

Lee Jueun and Ho Bae. 2025. Benchmark for Measuring Intersectional Bias in Race and Gender of Large Language Models Using Synthetic Data. The Transactions of the Korea Information Processing Society, 14, 2, (2025), 82-94. DOI: https://doi.org/10.3745/TKIPS.2025.14.2.82.