Using Codex Well (Experiences so Far)
The future of work is managing a team of AI agents to do work.
My manager asked me to give a talk on how to use OpenAI’s new Codex coding agent. That somehow turned into this blog post.
OpenAI released Codex on May 16th. AI coding tools aren’t new, but this one feels like a step change. Previous tools worked best as a way to help me write code, but with Codex I’ve found myself mentoring a team of agents as they write code. Used well, this is a superpower. Used poorly it wrecks your code.
It’s worth understanding Codex even if you don’t write code, because this model of “manage a team of AI agents who do work for you” is likely coming to every other field of knowledge work pretty soon.
I haven’t been using Codex for very long, but neither has anyone else (outside of OpenAI), so it’s possible some of the lessons I’ve learned will be useful to others. In particular:
Have a ton of sanity checks - Codex alternates between being absolutely brilliant and making dumb mistakes. If you don’t have checks to catch the dumb stuff you are going to have a bad time. Fortunately Codex is great at setting up those sanity checks.
Get your code clean before getting Codex to work in it - Codex does a better job if your code is super-clean. Fortunately Codex is great at doing the kind of refactorings needed to get your code clean.
Document how things should be written - If you don’t tell Codex what you think it means to write good code, then it’s not going to write what you think is good code. So make sure you have a set of docs describing good style for the various parts of your code base and kinds of changes you might want. Fortunately Codex is great at helping write these docs.
Don’t do big things all at once - If you tell Codex to do something really big, it’s likely to either get stuck, or do something weird. Instead you should give it a clear high level plan, and break it down into a set of smaller well-defined tasks. Fortunately Codex is great at drafting such plans and sub-tasks.
Use Codex to help codex - A lot of the art of using Codex well seems to be to use one Codex agent to get your system into a state that will let other Codex agents work well - whether it’s writing tests, cleaning up code, documenting things, or creating detailed work plans. This same principle likely applies elsewhere.
In the following sections I’ll go into each of those in more detail. I mostly work in Javascript/Typescript, so some of my recommendations will be specific to that context, but a lot should generalize not just to coding but to knowledge work in general.
Take everything I say here with a moderate handful of salt. I (like most of other people) am still fairly early in the journey of using Codex, and most of my work so far has been spent applying Codex to tech debt tasks rather than creating entirely new features.
Sanity Checks
Codex can do great work, but often makes weird mistakes humans wouldn’t make, like deleting an important line of code because it was hard to refactor. To use Codex well, Codex needs a way to recognize its own mistakes and fix them.
In particular:
Get your unit test coverage super-high - Good unit tests will catch cases where the agent broke your system.
Port Javascript code to Typescript - Typescript types help catch subtle bugs Codex introduces.
Use ESLint aggressively - Don’t just use regular ESLint rules, but also add custom lint rules that enforce conventions you want followed.
If possible, get end to end tests running in the codex container - This will catch errors not picked up by type checking, but may be tricky to get to work.
Put important tests in Github Actions - This guards against cases where Codex didn’t actually run the right tests.
Setting up all this stuff is a lot of boring work, but it’s the kind of boring work that Codex excels at since it’s well defined and easy to evaluate whether it was done correctly. So use Codex to help you use Codex.
It’s an interesting property of intelligence that it’s much easier to evaluate whether something is good than to create something that is good, indeed this is the fundamental principle behind machine learning. If you have a way of measuring how good something is, and a way of rapidly iterating on it, then you can build amazing things.
Clean Code
You’ve probably heard that AI is great at writing the first version of an app, but that that code rapidly becomes a mess as you ask the AI to extend what it wrote. This is partly because AI agents work best in a clean code base, and an empty project is the ultimate clean code base. If you want Codex to work well, you likely want to first use Codex to make your code super-clean.
In particular:
Get rid of circular dependencies - They make it easy for you, and the agent, to run into subtle bugs.
Get rid of relative imports - They make it harder for you, and Codex, to know what is being linked to, and harder to move files around in a refactoring.
Enforce consistent naming - The same word should always mean the same thing, and you should never use two words to mean the same thing.
Follow concepts consistently - Your code should apply a set of well defined concepts in a consistent way.
Get rid of out of date APIs - Both their implementations and their use. Port everything to the same, latest version.
Put code in the right places - Your repo probably has code in places that don’t fully make sense given the logical organization of your code.
Comment things well - Each file should have a short comment giving an overview of what it does, how it relates to the broader model, and anything tricky about how it works.
Fix stale documentation - Human-written docs tend to rapidly become out of sync with the way the code works, making it harder to understand.
Fortunately, like writing sanity checks, these kinds of code clean-ups are the kind of well-defined, easily-evaluated tasks that Codex excels at. If Codex struggles to work well in a part of your code base, think about whether there are ways you could clean up that code so that Codex could work better.
Style Guides
Left to its own devices, Codex will write code in a style you don’t like and abuse your abstractions. You can make this a lot better by writing a set of docs describing how you want things to be done.
In particular:
Have a good AGENTS.md file - This is the standard file Codex always looks at. It should tell Codex what tests to run, what really important things to know, and what other docs to look at for more detail or specific things.
Write area-specific doc files - If Codex does something you don’t like, don’t just give it feedback in that particular task. Put the principle you want followed in a doc committed to the repo, link to it from AGENTS.md, and mention it in tasks where it is relevant.
Fortunately Codex is great at writing these kinds of style guides. Start by asking it to look at your code, and document what the inferred principles are. Then work with it to get the guide exactly right. Then get Codex to refactor your code to follow those principles strictly.
Break up Tasks
If you ask Codex to do something big and vague all in one go, it’s probably going to get stuck or do something utterly different to what you wanted.
A better approach is to follow the following sequence:
Explain the problem you want to solve, and ask for possible ways to solve it - Giving background context helps Codex “get” what’s needed, and it can be a good partner in helping think about the right solution - sometimes coming up with a better solution than what you’d been thinking of.
Pack one of Codex’s suggested approaches, and ask it to flesh out a more detailed plan - This gives deeper insight into what Codex might do, and helps catch stuff that is more wrong headed. If the plan seems wrong, give feedback on how to improve it.
Ask Codex to break that plan down into a sequence of well specified tasks - Each task should be specified in enough detail that a Codex agent will solve it well.
Fire off multiple Codex agents to work on each of these sub-tasks separately - Some of them will go great. Some of them will require help. Some of them will go sideways so much that you need to take over and do them yourself.
It’s not that different to how one should go about solving a task oneself, but by breaking it into sub-tasks, you create clear points where Codex should get your input, and clear ways to break things down into smaller tasks so that if one thing goes sideways it doesn’t make everything go sideways.
Software Engineering is going to get weird
If you are an early career software developer used to writing code under the direction of a more senior engineer, then the near future might be tough. Codex may well already be better than you, and if it isn’t better than you yet, it will be soon.
On the other hand, principal-level engineers and managers are probably going to be fine in the near term. They already spend most of their time mentoring engineers, designing architectures, and crafting rules for keeping code bases clean - and I think those skills transfer pretty well to managing AI agents.
That doesn’t mean that there is no way for early career developers to exist. It just means that the skills they need to develop aren’t the ability to write code, but the ability to manage a team of agents who write code.
At least until AI gets better at those tasks too, which might well happen sooner than we think.
If you like this post, you might also want to read some of my other writings on AI, such as:
Virtue is a Vector - Treating things as good and bad is too simple
Let GPT be the Judge - Not everything that counts can be counted, but maybe GPT can measure it anyway.
How AI will Change Education - Education does lots of things for society. AI will change all of them.
Artificial General Horsiness - AI doesn't need to match all human intelligence to replace all human work. If it's faster and cheaper, we'll adapt the rest of the world around it.
Simulate the CEO - Do large organizations function like a single mind? What if we simulate that mind?
The Power of High Speed Stupidity - How rapid iteration allows us to create the incomprehensible
GPT is Rather Good at Feed Ranking - If ranking is as easy as saying what should rank highly, then lots of interesting things happen.
Chatbots make Search interesting again - They could disrupt all four of the pillars that Google's business rests on.
Or subscribe to my blog to get future stuff too - I don’t write that much so you won’t get spammed.
Thanks to Blaine Cook, who gave useful feedback on a previous draft.