LLM Benchmarking: Surprising Task Complexity Gains

AI.tificial

16 hours ago

The main purpose of many large language models (LLMs) is providing compelling text that’s as close as possible to being indistinguishable from human writing. And therein lies a major reason why it’s so hard to gauge the relative performance of LLMs using traditional benchmarks: quality of writing doesn’t necessarily correlate with metrics traditionally used to measure processor performance, such as instruction execution rate.

But researchers at the Berkeley, Calif. think tank METR (for Model Evaluation & Threat Research) have come up with an ingenious idea. First, identify a series of tasks with varying complexity and record the average time it takes for a group of humans to complete each task. Then have various versions of LLMs complete the same tasks, noting cases in which a version of an LLM successfully completes the task with some level of reliability, say 50 percent of the time. Plots of the resulting data confirm that as time goes on, successive generations of an LLM can reliably complete longer and longer (more and more complex) tasks.

No surprise there. But the shock was that this improvement in the ability of LLMs to reliably complete harder tasks has been exponential, with a doubling period of about seven months.

IEEE Spectrum reached out to Megan Kinniment, one of the authors of an METR research paper describing this work and its surprising implications.

Evaluating LLM Performance Metrics

Did you suspect that you’d get these results?

Megan Kinniment: I, at least personally, didn’t expect us to have quite as clear an exponential as we did. Models have definitely been getting better quickly, though. So some fast rate of progress wasn’t entirely unexpected.

As you point out in the paper, it’s always dangerous to look into the future and extrapolate. However, you suggest that there is a likelihood of this continuing, which means that by 2030 we’ll be looking at monthlong tasks being within the capability of the most advanced large language models.

Kinniment: Let’s have a look at that. By one month, we mean around 167 working hours, so the number of [human] working hours in a month. And that’s at 50 percent reliability. But longer tasks typically seem to require higher reliability to actually be useful. So that’s something that could make the in-practice, real-world, economic impacts not be as intense as what is predicted.

There are a number of things that would have to continue for this prediction to come true. Hardware would have to continue improving at roughly the rate it’s improving; software would have to keep improving. You would have to have sufficient training data and availability of that training data to continue training at the breathtaking clip that’s been occurring in recent years.

Kinniment: The forecasts and the dates that we’ve found are just extrapolating the trend that we see on our task suite. [The trends are] not taking into account real-world factors or compute-scaling changes.

If a large language model could somehow achieve the ability to complete 167-hour type tasks with 50 percent reliability, what are the kinds of things that that now puts in the realm of capability for a large language model?

Kinniment: Well, the big one that we often think about is accelerating AI R&D research itself. To the extent that you can make models that accelerate your company’s ability to make better models, you could end up in a situation where AI capabilities develop really quite rapidly.

What Exponential Growth in AI Means for Humanity

What you are describing is reminiscent of the idea of the singularity, where you have AIs creating other AIs on their own, not assisted by human beings.

Kinniment: I think that you could get acceleration that is quite intense and does make things meaningfully more difficult to control without it necessarily resulting in this massively explosive growth. There are reasons to think that you might have various bottlenecks that slow things down in practice. Even if it were the case that we had very, very clever AIs, this pace of progress could still end up bottlenecked on things like hardware and robotics. But yeah, the singularity is for sure an idea that is relevant to this whole sector of things.

Things could go quite quickly, but it’s not like it’s the singularity or nothing. [AI-development rates] that were mild compared to a singularity could still be quite intense for how the world needs to adapt.

You indicated in the paper that some large language models seem to be improving in their ability to adapt and improve from mistakes.

Kinniment: I think it’s actually been a relatively gradual thing since ChatGPT, and potentially before that. They’re less likely to get stuck. They’re a bit better at changing strategies when things aren’t working, but that’s a bit hit or miss. And they’re definitely a lot better at doing things than they used to be and better at using tools. But it does seem like there’s some fundamental aspects that haven’t changed a great deal. One thing that I like to look at when I get a new model is, on each task, we give the model a number of tokens, a number of words that it can say. And if you could imagine giving them more and more time or more and more tokens to do a task, how does that affect how likely they are to succeed? And basically, what we see is they plateau quite strongly. There’s a point at which you give them more tokens and it doesn’t really help. And for each new model, that plateau gets a bit higher.

Megan Kinniment was on the team at METR that published the results of a study of LLM performance.Megan Kinniment

Humans, I imagine, also have diminishing returns. But if you give a human lots and lots of time to do something, they’ll probably do a better job, especially if you have multiple humans. And I think I’d be pretty impressed with a large language model that, even if its absolute score was lower, seemed like it could just keep doing things and improving. That could be a big deal.

You found that models performed worse on tasks that had higher “messiness” scores. Was there any signal that you got out of the data that this state of affairs might be changing? In other words, that models might be gaining greater ability to handle tasks that had higher messiness?

Kinniment: Messiness was a measure that I made to try and get a somewhat quantitative measure of how unrealistic our tasks were compared to the real world. And most of our tasks aren’t that messy. It’s a 16-point scale. The mean is about 3, and the most messy tasks are about 8 out of 16.

So what would a 16 task be in terms of messiness?

Kinniment: Something like espionage, where you have a lot of resource limitations. It’s very punishing. You have agents that are optimizing against you actively. It’s easy to mess up. It’s novel.

Are you all planning to follow up this study?

Kinniment:OpenAI published o3, and o3 was a little bit more capable than anticipated given the trend. So we are doing some amount of follow-up in terms of measuring other models. We do want to keep focused on informing the world about AI development and catastrophic risks from AI systems.

Catastrophic Risks from Advanced AI

What are the most likely catastrophic risks from AI? I mean, the ones that come to my mind are massive dislocations in employment if and when AI becomes supremely capable.

Kinniment: When we’re talking about catastrophic risks, we’re not just talking about mass unemployment. We’re talking about things that are more like this: if everybody became unemployed or you just didn’t need human workers for the vast majority of things, you might not need human workers to maintain your military, or much fewer humans. That could make it easier for somebody to perform a coup, essentially. Or, if you have a vast quantity of geniuses in a data center, then that would make you a very powerful person. If you use that to produce military hardware, it’s possible we could get a concentration of power, and you might not have a democratic state anymore.

All this would happen, obviously, without any form of consciousness. These would be machines that would have the capability to scheme and plot and plan, but without the kind of consciousness that characterizes human ability to do this. Consciousness isn’t necessary for this.

Kinniment:Consciousness is a hard problem. I’m not sure if consciousness is necessary for any particular behavior. It feels a bit above my pay grade. I also think it’s not crazy that they could be conscious at this point. They would be very intelligent.

So you think it’s possible that they may be conscious at some point in the future?

Kinniment: I mean, if they’re as intelligent as you and I, then it doesn’t seem quite crazy. It doesn’t seem crazy for them to not be, and it doesn’t seem crazy for them to be.

Source link