The thesis of McCoy et al.'s Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve is interesting. However, I spotted a two errors early on which make me question the authors' attention to detail as well as their claims in areas that I am less familiar with.
First, on page 1, Embers mis-characterizes the central claim of Bubek et al.'s Sparks of Artificial General Intelligence and the reasoning behind it. According to Embers (emphasis added):
Virtually any task can be framed in the form of linguistic queries, so LLMs could be applied to virtually any task—from summarizing text to generating computer code. This flexibility is exciting: it led one recent paper to argue that LLMs display “sparks of artificial general intelligence” (Bubeck et al., 2023).
But the flexibility of LLMs is not what triggered the "sparks of AGI" paper. If that were the case, the paper could have just as well have been written about GPT-2 or GPT-3.5, which are equally flexible in the sense of being able to perform tasks "framed in the form of linguistic queries". In fact, though, the paper was about GPT-4 and its impressive performance across a variety of problem domains. As the abstract of Sparks states:
Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
Then, on page 2, Embers mischaracterizes the term "autoregression" in an ML context (emphasis added):
The crucial question to ask, then, is: What problem(s) do LLMs need to solve, and how do these pressures influence them? Here we focus on perhaps the most salient pressure that defines any machine learning system, namely the task that it was trained to perform. For the LLMs that have been the focus of recent attention in AI, this task is autoregression—next-word prediction (Elman, 1991; Bengio, Ducharme, and Vincent, 2000; Radford et al., 2018)—performed over Internet text.
Recent large language models have indeed been "trained" to predict the next word or token, but "autoregression" is not synonymous with next-word prediction. Rather, it has to do with generating output based on previous outputs. In statistics, an "autoregressive model describes a system whose status (dependent variable) depends linearly on its own status in the past". In machine learning, "autoregression" refers to a process by which a model generates outputs in a loop, one output at a time, where each additional output is appended to the input and fed back into the process to generate the next output.