GPT is Rather Good at Feed Ranking

If ranking is as easy as saying what should rank highly, then lots of interesting things happen.

Mar 07, 2023

I’ve done ranking-adjacent work at three companies (Google, Quora, and Facebook), so when OpenAI released the ChatGPT API a few days ago, my natural first instinct was to ask myself “I wonder if this would be good at ranking social media feeds?”.

Based on my experiments so far, it looks like the answer is “yes”. Indeed GPT1 makes content-based feed ranking so easy that it has the potential to completely change the way we think about ranking - potentially shifting ranking power from platforms to individual users and groups.

If you want to jump straight to a demo, go to https://rankmagic.org/ and have a play around.

This demo ranks a fixed snapshot of 1000 mastodon posts using a user-customizable combination of engagement features and GPT-derived content features. Drag the sliders to say what you think is important. The GPT-based content features were implemented by simply asking GPT whether a post is political or disrespectful, using a prompt that explains the concept in more detail.

What’s notable about this is not how well it works (better classifiers exist), but how quickly I was able to put this together by using GPT. The whole thing took me about a day of work, of which around a couple of hours was prompt engineering.

But first we should step back and talk about how feed ranking was done in the dark ages of last week.

Broadly speaking, social platforms like Facebook and Twitter rank posts based on a combination of engagement features and content features. Engagement features are things like how likely you are to click/like/share/comment-on a post. Content features are what the post actually contains (e.g. is it sexual, or have a clickbait-style headline).

Typically the bulk of feed ranking work is done using engagement features. One reason for this is that, by showing people the things they are most likely to click on, you can argue that your ranking is “neutral” and isn’t biased towards your own judgements about what is good or bad. The other reason is that content-based ranking has historically been a total pain to implement.

Let’s say I want to have less political content in user feeds and I want to create a classifier that can tell whether content is political. I’ll likely follow the following process:

Grab a random sample of content and manually label ~1000 of them with my own opinions of whether they are political
Based on those labels, write rater guidelines that a human contractor I’ll never meet can use to decide whether to label something as political
Get those rater guidelines approved by legal and PR, since they are almost certain to leak to the press
Negotiate for rater budget, sign a contract with a rating company, send the guidelines off to human raters, and wait a while to get data
Find out that the human rated labels are terrible, because the guidelines weren’t clear enough and the raters misunderstood them.
Come up with guesses for why the raters gave the weird labels they did and revise the guidelines accordingly.
Jump back to step 3, and iterate until the data is high enough quality to train a machine learning model, eventually ending up at step 8.
Build a model that can predict the labels from the rater data.
Regularly repeat this process, since which topics count as “political” seemingly changes every few months.
Find out that the guidelines leaked to the press, and lots of people are very angry at us because they don’t like how we defined “political”
Find out that the raters are getting traumatized by the horrible experience of having to look at lots of political posts, and another load of people are very angry at us.
Resolve to avoid creating content-based features ever again

The whole process took months, cost millions of dollars, traumatized a few thousand raters, required many highly skilled employees, created PR risks, and was generally an unpleasant experience for everyone involved.

Now let me show you how easy it is with the ChatGPT API:

Open an interactive ChatGPT session, give it a brief definition of “political”, and then ask it to tell me which of a list of posts are political.
If ChatGPT labels some posts differently to how I expected, ask it why.
Based on GPT’s explanations for its “wrong” answers, update my definition of “political” and jump back to step one. Repeat until it consistently labels posts the way I like.
Send ChatGPT all the posts I want to rank and use it’s “political” labels to adjust their ranking.

This whole process took one moderately skilled person about an hour of work and the quality is pretty comparable to doing it the slow hard painful way. Go to https://rankmagic.org/ to judge for yourself).

It’s hard to overstate how much easier this was than the previous way. It’s so easy that it has the potential to completely change the way that content-based ranking is done. It’s so easy that it makes it possible to imagine a world where individual users or groups could create their own custom rankings by just using human language to say what kind of content they want to rank high or low.

At this point you are probably going to raise the obvious objection of “isn’t GPT impractically expensive compared to running a normal ML model”, and the answer is of course yes. Using current ChatGPT API pricing, it cost me roughly $2 to rank the 1000 posts in my little demo. However twitter has 200 billion tweets a year. If I was to pay the same per-post cost as in my demo it would cost $400million a year just to run these models. That’s a problem.

Fortunately you don’t actually have to apply GPT to every post. Amazon recently showed that it’s practical to use a big expensive model like GPT to generate labels that can be used to train a small cheap model. In this world, GPT is essentially taking over the role previously occupied by human raters. I haven’t tried this myself yet, but I’d expect it to work pretty well. My back of the envelope calculation suggested it would cost around $1000 to generate enough GPT labels to train a good student model.

So what changes if it becomes super-easy to create content-based ranking models?

One thing that happens is that we can actually build feeds that rank the content we want to see. People have talked for ages about wanting their feeds to contain more content that is constructive, trustworthy, and important, and less content that is incendiary, lacks context, and polarizing. We’ve struggled for years to find engagement signals that proxy these ideas, but all of them have a “I know it when I see it” aspect to them, so maybe we can just tell a large language model that this is the kind of content we want, and actually get what we want.

Another thing that happens is that it becomes practical to democratize ranking, allowing individuals and groups to decide what kind of content they want to rank high in their feeds. Historically ranking has been so difficult to do that it has only been practical for large centralized platforms to do it, and individuals and groups just had to accept what they are given. But if content-based ranking becomes easy, then maybe every group can experiment with its own ideas of what kinds of content is best.

One possible future is an “app store” of ranking models, where lots of individuals and groups create their own content-based feed ranking models, and individuals and groups can choose the models they like.

So what about bias?

One of the big reasons why products like Facebook have been reluctant to use content-based ranking is that they want to avoid accusations of bias. Engagement-based ranking creates it’s own issues, but it’s generally more resistant to accusations that you are favoring a particular political worldview, or the interests of your company.

However many of these problems go away if you outsource ranking to groups and individuals. While I might be skeptical of an “offensiveness” classifier created by a large corporation, I might feel more comfortable being able to choose between several such classifiers created by different organizations that I trust.

A related issue is explainability. People talk about deep learning models as being opaque compared to old-school models, but in some ways they are actually easier to understand. In the old school models you could look at the internal weights to see how it reached its decision, but you had to really understand the tech to do it. With a GPT model you can’t understand the weights, but you can just ask it why it made the decision it did in plain English, and it will usually give a pretty good explanation. Maybe future models will even have GPT suggest improvements to its own prompt to avoid whatever disagreements you have with its judgements.

Perhaps the bigger bias issue is the bias of the underlying language model. If all content classifiers are being implemented on top of GPT then they might all bias towards the opinions of OpenAI (GPT’s creator). One potential solution to this is to have a variety of competing companies offering competing large language models, and indeed that seems to be the way things are heading.

Or maybe I’m wrong and there is some reason why none of this will happen. Let me know in the comments.

Most of that I say about GPT and ChatGPT likely applies to other Large Language Models too, but “Large Language Models” is a pain to say, so for the purposes of this post I’m just going to say GPT most of the time.

Jeremy Arnold

Mar 7, 2023

This is super super interesting. While I’m skeptical that many people actually want to see the content they claim to value, I reckon there are definitely large (and monied) cohorts who would value this—especially sitting atop Google. I would instabuy access for eg $10/mo to a few of these if they seemed promising. Even just filtering out all the SEO-gamed crap out of my SERP would be worth that imo, which presumably you could do by weighting heavily against common formatting techniques, keyword stuffing, etc.

Expand full comment

Kevin Merlini

Mar 23, 2023

I also worked on feed ranking at FB - it was interesting playing around with rankmagic. Aside from all of the larger & exciting use cases (which I have a lot of thoughts on), I think even just being able to build up better intuition around the what type of content that gets distributed when you change the weights of various signals is useful. Otherwise the feedback loop is very long by running an experiment and analyzing the results on a aggregate content level vs just tuning a knob and seeing what happens.

3 more comments...

Messy Progress

Discussion about this post