Let GPT Be the Judge
Not everything that counts can be counted, but maybe GPT can measure it anyway.
The modern world runs on metrics. Companies constantly try out changes to their product and keep the things that make the product better, as judged by a metric. Today the vast majority of such metrics are based on counting something, whether it’s news websites counting page views, Facebook counting user interactions, or politicians counting poll numbers.
The great thing about simple count metrics is that you can compute them quickly and cheaply. This allows them to be used as part of a rapid iteration cycle. And rapid iteration cycles are the secret behind all progress. The bad thing about simple metrics is that “not everything that counts can be counted”. Count metrics aren’t good at capturing things like “are you harming people” that are actually pretty important.
But what if we used GPT (or other similar LLMs) to compute our metrics? Rather than Facebook counting comments, GPT could count how often people helped each other or strengthened their social relationships. Rather than a website counting page views, GPT could count whether users are seeing content that users gain real benefit from, thus boosting the web sites long term brand.
In the rest of this essay I explore this idea further, including both the ways it could fix problems society has today, and the way it could create a class of creepy new problems.
This essay is part of a broader series of essays about AI and social organization.
In GPT is Rather Good at Feed Ranking, I showed that GPT is very good at judging whether content falls into categories like “political” or “disrespectful”, potentially opening the door to creating custom ranking by just saying what kind of content you want.
In The Power of High Speed Stupidity, I explained how all things that matter are created by an iterative process where we try out possible changes to a thing, and keep the changes that make it better. I argued that the secret to making this work well is to allow very rapid iteration.
In Simulate the CEO, I argued that large human organizations function by giving members the ability to simulate the CEO and make the decisions the CEO would have taken.
In this essay, I glue all these ideas together, and argue that the future of rapid iteration is using LLMs like GPT to guide the iteration process, by evaluating how well the thing we want to improve meets our vision of what we want.
Much has been written about the supposed social ills caused by online social platforms like Facebook and Twitter.
I spent some time in the Facebook Integrity group, trying to make things better. We had very good intentions, we tried to make things better, and we definitely made some progress, but the problem was hard.
One of the things that made the problems hard was that, while it was very easy to see with our own eyes when something bad was happening on the platform, it was very hard to measure how often such things happened at scale, and that made it hard to know whether changes we were making to the product were actually making things better, or whether changes other people were making to the product were undoing the progress we had made.
News content can be vital civic information or misleading alarmism. When people talk to each other they can be helping each other or harassing each other. When someone spends lots of time watching videos, they can be learning useful information, or rotting their brains with garbage. When someone provides factual information, it can be true or false.
None of these good/bad distinctions are easy to make with a simple count metric. None of them are easy to classify with the primitive machine learning models of the pre-LLM era (we tried when I was at Facebook). All of them are doable with GPT-4 - indeed I’ve built successful prompts for several of them.
A big part of the reason why companies want their employees to be in the office rather than working from home is that it allows them to make sure that employees are spending a minimal amount of time doing their job and aren’t just slacking off. This comes at a huge cost - hours spent commuting, and people having to buy expensive houses in order to all live close to the same employers.
Some companies like to solve this problem by having people install apps on their laptops that make sure that the employee is at their laptop for some minimum amount of time, but some people like to think while walking around, or watch TV while jiggling their mouse, or meet with other people as part of their job, so such mechanisms have limited utility.
In theory you could solve the problem by instead having a manager look at the work people were doing and judge whether they are getting roughly as much done as would be expected, were the employee actually working the number of hours they had signed up for. But it takes a lot of human judgment to determine how hard a particular task is, and to carefully evaluate the amount of effort that is likely to have been put into all the work an employee did. So just using office-work requirements ends up being a cheaper option.
But what if a large language model could look at the actual output you produced at the end of a week, evaluate how difficult that work is likely to have been, and judge whether this is a reasonable amount of work to have accomplished for someone at your pay grade and expected hours?
The idea of using AI to generate metrics doesn’t need to be limited to text. Recent AI models have shown an impressive ability to generate and understand images, as anyone who has used tools like Dall-E will know. One can thus imagine a future where core company metrics are calculated by using AI tools to understand what is going on in images or video.
If you run a grocery store then you want your store to be clean, your staff to be friendly, and your customers to find products easily. All of these are hard to measure with counters, but potentially very easy to measure using modern AI.
If you run a city then you want your city to feel welcoming, have little crime, and be filled with positive social interactions. A few video cameras and a bit of AI and this is totally possible.
The impact of GPT-metrics need not be limited to companies. One can imagine societies in general choosing to judge themselves by GPT metrics rather than the counter-based metrics they use today.
People sometimes complain that one of the evils of “late capitalism” is the way that everyone optimizes for simplistic metrics like profit, or academic tests, or consumption, rather than the things that really matter for human flourishing, like feeling your life has purpose, or being part of a supportive social group, or being intellectually stimulated.
It’s hard to measure human flourishing with simple counts, but maybe we could measure it with an LLM, either by observing people’s behavior, reading their email, or having them chat with an LLM to see how flourishing-ish they seem.
Maybe AI will finally give us the ability to have rapid progress, while making sure that progress is actually aligned with what is meaningful to us as human beings.
One common criticism of large language models is that it’s hard to work out why they gave the results they did. If your metric is a simple count then it’s easy to understand where the number came from, and look back through logs to see the count was computed. If your metric is GPT’s judgment about whether your company is doing something immoral, then that’s a much fuzzier concept, and you can’t work backwards through logs in the same way to work out why it gave the answer it did.
On the other hand, the same is true of humans. If you ask you whether something you did was immoral, I can’t look at your neurons to work out why you gave the answer you did, or look at a log of everything you observed. However I can just ask you why you came to that conclusion, and ask you follow-up questions to get your to justify your reasoning. In practice GPT is pretty good at that too, and it’s likely to get better in the future. Critics will argue that the reasoning GPT gives may not be the real reason it gave the answer it did, but the same is true of humans. It is, however, likely to be good enough to help you know what to do differently in order for GPT to score you higher in the future, and that’s what matters in practice.
Similarly, if an LLM has analyzed the large scale behavior of my company, it might be able to answer questions about why the company has the issues it does much more effectively than can be done by having an analyst slice count metrics. Count metrics can be misleading, and it’s often hard to know which ways to slice things to reveal the insights that are important. Maybe in the future we’ll be able to ask an LLM questions like “why are my users unhappy” and get well-informed, actionable answers.
The downside of all this is that it’s really really creepy.
The great thing about having a human evaluate how your life is going is that humans don’t have an easy way to permanently store everything they see and read. But if everything in my life is being studied by an AI then there is little to stop the AI from adding its observations to a permanent record, and potentially using that information for nefarious purposes like tracking government critics. Of course, tech companies are already collecting more data about people than we’d like, either to provide us with products (like email) or to serve us ads. But if gathering even more data about us allows companies and societies to compute better metrics then maybe they would collect even more.
A lot of people are concerned about the risk of super-intelligent AI taking control of humanity, turning us into its slaves. Cynics say “that won’t happen, humans would never give AI that much control”, but if we decided to have AI compute the metrics that determine both what companies do and what societies do, then we’d essentially be handing those super-intelligent AIs power over humanity - and it’s not clear that’s a good idea.
Large language models are created by humans, and those people are biased. They have particular ideological beliefs, and people’s earnestly held ideological beliefs have a natural tendency to align with their self interest. This inevitably leads to political and ideological bias in the outputs of the LLMs those humans create. If a small number of powerful people are creating the world's LLMs then giving those LLMs the power to compute the metrics that run the rest of society gives those people outsized power. It’s easy to imagine biased definitions of “misinformation” or “helpful behavior” or “welcoming environment” or “productive work”, and those biases can allow centralized control of the culture of a whole society.
A big part of the reason why medieval society was primarily feudal was that rulers lacked the technology to exert centralized control. When new centralizing technology arrived in the forms of good roads, the printing press, and the telegraph, societies rapidly transitioned to the Nation State. Similarly, there are some who argue that our current state of relatively decentralized liberal capitalism is because current technology makes it hard for a ruler to exert central control of society. It’s possible that AI-driven metrics could unleash a new era of centralized control, and that probably isn’t good.
In the short term, I think it’s pretty much inevitable that we’ll see LLMs like GPT play a growing role in how organizations compute metrics, particularly as LLMs get cheaper and more effective. They are simply too good at doing this for it to not happen.
As this happens, I’d expect this to have a ton of knock-on consequences, some of which will be good, some of which will be bad, and some will be totally unexpected.
We live in interesting times.
I cannot help thinking of Richard Brautigan's poem "All Watched Over by Machines of Loving Grace".
How feasible do you think it will be for people to tune GPTs to share their varying moral intuitions and thereby reduce some of the creepiness? Outsourcing some of these judgments seems least creepy if one thinks of the LLM as "a trusted elder who shares my values" and perhaps we could get closer to that ideal if LLMs were both more tunable and better at eliciting values clarification.