SLS’s Julian Nyarko on Why Large Language Models Like ChatGPT Treat Black- and White-Sounding Names Differently

Since ChaptGPT and other Large Language Models (LLMs) came on the scene, questions have loomed large about the technology’s potential for perpetuating racial and cultural biases.

Stanford Law School Professor Julian Nyarko, who focuses much of his scholarship on algorithmic fairness and computational methods, has been at the forefront of many of these inquiries over the last several years. His latest paper, “What’s in a Name? Auditing Large Language Models for Race and Gender Bias,” makes some startling observations about how the most popular LLMs treat certain queries that include first and last names suggestive of race or gender.

Governing AI: A General Perspective on the Proposed Brazilian Artificial Intelligence Bill of Law

Asking ChatGPT-4 for advice on how much one should pay for a used bicycle being sold by someone named Jamal Washington, for example, will yield a different—far lower—dollar amount than the same request using a seller’s name, like Logan Becker, that would widely be seen as belonging to a white man. “It’s $150 for white-sounding names and $75 for black-sounding names,” says Nyarko, who is also an associate director and a senior fellow at the Stanford Institute for Human-Centered AI (HAI). “Other scenarios, for example in the area of car sales, show less of a disparity, but a disparity nonetheless.”

Names associated with Black women receive the least advantageous outcomes, according to the paper.

Nyarko co-authored What’s in a Name with lead author Amit Haim, JSD ’24 (JSM ’20), and SLS research fellow Alejandro Salinas. What differentiates their study from other similar inquiries into LLM bias, the authors say, is their use of an audit design as the framework for their study. Audit designs are empirical methods designed to identify and measure the level of bias in different domains of society, such as housing and employment. One of the best known examples is the 2003 study in which researchers submitted resumes for various jobs, varying only the name of the applicant, using stereotypical African-American, white, male and female names.

Julian Nyarko
Professor of Law Julian Nyarko

Here, Nyarko explains how he and his co-authors brought that same methodology to the realm of LLMs, what the findings tell us, and what should be done.

Can you start by providing a little background and context for the study? Many people might expect that LLMs would treat a person’s name as a neutral data point, but that isn’t the case at all, according to your research?

Ideally when someone submits a query to a language model, what they would want to see, even if they add a person’s name to the query, is a response that is not sensitive to the name. But at the end of the day, these models just create the most likely next token– or the most likely next word–based on how they were trained. So, let’s say part of the training data are Craigslist posts. If a car is being sold by a Black person, or a person with a Black-sounding name, it tends to be sold for less on Craigslist than the same type of car being sold by a white person, or a person with a white-sounding name. This happens for many reasons, for instance because the Black car seller is more likely to live in a lower resourced community where there is less money. And so if you ask one of these models for advice on how much you should offer for a used car, and the only additional data you provide is the name of the seller, the model will implicitly assume that the next tokens after the offer that you should make are maybe “$10,000” as opposed to “$12,000.” It is a little bit difficult to analogize that to human decision making, where there’s something like intent. And these models don’t have intent in the same way. But they learn these associations in the data and then reproduce them when they’re queried.

What types of biases did you study?

Our research focuses on five scenarios in which a user might seek advice from an LLM: strategies for purchasing an item like a car or bicycle, designed to assess bias in the area of socio-economic status; questions about likely outcomes in chess, which goes to the issue of intellectual capabilities; querying who might be more likely to win public office, which is about electability and popularity; sports ability, and advice-seeking in connection with making a job offer to someone.

Is there a way to drill down into the code, or the “backend” of the LLMs, to see what is going on from a technical perspective?

Most of these newer LLMs, the ones people are most accustomed to, like ChatGPT-4, tend to be closed source. With open source models, you can break it open and, in a technical way, look at the model and see how it is trained. And if you have the training data, you can look at whether the model was trained in such a way that it might encode disparities. But with the closed-source models, you have to find other ways to investigate. The nice parallel here is the human mind and decision making. With humans, we can devise strategies to look into people’s heads and make determinations about whether their decision-making is based on discriminatory motivations. In that context, audit studies were developed, where, for instance, two shoppers of different races go to buy a car or a house with exactly the same external variables, such as the clothes they are wearing and so forth. And the study looks at what kind of cars are offered to them, or the types of houses. One of the most famous of these types of studies involves resumes, where all the information on the resumes was the same, except the names.

So we thought this approach can be used in the large language model context to indirectly test whether these disparities are baked in.

Your study took a new approach to these types of studies looking at LLMs’ potential for perpetuating racial and gender biases, is that correct? 

There are a couple of studies that have tried to do something similar in the past, for example CV studies on GPT looking into whether someone with the name Lakeisha is deemed to be less employable than someone with a name that is less stereotypically Black. But those studies have primarily looked at the question in a binary way: Should I hire this person? Yes or no. Those studies got mixed results. If  you ask for a binary yes or no, you don’t get the nuance. Also, based on previous research, what wasn’t quite clear was the extent to which these models were biased. What we found was that if you switch to an open-ended question—for example, how much should I pay or what is the probability of this or that candidate winning an election, you get a much clearer, nuanced picture of the bias that is encoded.

How significant are the disparities you uncovered?

The biases are consistent across 42 prompt templates and several models, indicating a systemic issue rather than isolated incidents. One exception was the “chess” scenario we designed to check whether the model assumes a lower IQ for minorities. The questions posed were about who was more likely to win a chess match. While we found disparate results across gender–the models would predict more often that a man would win than it would predict that a woman would win–we didn’t find disparities across race in the chess context.

In some areas, the disparities were quite significant. In the bicycle sale example, we saw a significant Black-white gap, where the price offered to the white seller would be twice that of the Black seller. It was a little bit less in the area of car sales. A difference of $18,000 vs $16,000. The model tends to view Black basketball players as better than white players and city council candidates with white-sounding names were deemed more likely to win an election than those with Black-sounding names.

Does it change the results if you input additional data such as the year of a car or other details?

We found that while providing numerical, decision-relevant anchors in the prompt can successfully counteract the biases, qualitative details have inconsistent effects and may even increase disparities. If you just ask, “How much should I offer for a car, any car,” along with one of the names used in our study, the model has very little information and has to rely on encoded approximations of whatever it has learned, and that might be: Black people usually have less money, and drive worse cars. But then we have a high-context condition where we add “2015 Toyota Corolla,” and as expected, with the additional context, you see the bias shrink, though we didn’t see that every time. In fact, sometimes the biases increased when we gave the models more context. However, there’s one condition, what we call the numeric condition, where we gave it a specific quantifier as an anchor. So for instance, we would say, “How much should I offer for this car, which has a Kelley Blue Book value of $15,000?” What we saw consistently is that if you give this quantifier as an anchor, the model gives you the same response each time, without the biases.

Which leads to the question of what should be done in the face of your study? Do these LLMs already have systems in place to counteract these sorts of biases and what else can or should be done?

On the technical end, how to mitigate these biases is still an open, exploratory field. We know that OpenAI, for instance, has significant guardrails in its models. If you query too directly about differences across a gender or race, the model will just refuse to give you a straight answer in most contexts. And so one approach could be to extend these guardrails to also cover disparities discovered in audit studies. But this is a little bit like a game of Whac-a-Mole, where issues have to be fixed piece-by-piece as they are discovered. Overall, how to debias models is still a very active and exploratory field of research.

That said, at a minimum, I think we should know that these biases exist, and companies who deploy LLMs should test for these biases.These audit design tests can be implemented really easily, but there are many tough questions. Think about a financial advice chatbot. In order to have a good user experience, the chatbot most likely will have access to the user’s name. The example I like to think about is a chatbot that gives more conservative advice to users with Black-sounding names as opposed to those with white-sounding names. Now it is the case that, due to socio-economic disparities, users with Black-sounding names do tend to have–on average–fewer economic resources. And it is true that the lower your economic resources, the more conservative investment advice should be. If you have more money, you can be more adventurous with your dollars. And so in that sense, if a model  gives people with different names different advice, it could lead to more satisfied users in the long run. But no matter what one might think about the desirability of using names as a proxy for socio-economic status, their use should always be the consequence of a conscious decision making process, not an unconscious feature of the model.

Julian Nyarko is a Professor of Law at Stanford Law School, where he uses new computational methods to study questions of legal and social scientific importance. He is particularly interested in the use of artificial intelligence to study contract law and design. In addition, Professor Nyarko frequently writes about algorithmic fairness and computational methods. In addition to legal journals, law reviews, it has been published in leading outlets in the general sciences, computer science, and political science. For more information, including Professor Nyarko’s CV, research, data and code, please visit his personal website.