Tuesday, October 3, 2023

Errors in McCoy et al.'s "Embers of Autoregression"

The thesis of McCoy et al.'s Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve is interesting. However, I spotted a two errors early on which make me question the authors' attention to detail as well as their claims in areas that I am less familiar with.

First, on page 1, Embers mis-characterizes the central claim of Bubek et al.'s Sparks of Artificial General Intelligence and the reasoning behind it. According to Embers (emphasis added):

Virtually any task can be framed in the form of linguistic queries, so LLMs could be applied to virtually any task—from summarizing text to generating computer code. This flexibility is exciting: it led one recent paper to argue that LLMs display “sparks of artificial general intelligence” (Bubeck et al., 2023).

But the flexibility of LLMs is not what triggered the "sparks of AGI" paper. If that were the case, the paper could have just as well have been written about GPT-2 or GPT-3.5, which are equally flexible in the sense of being able to perform tasks "framed in the form of linguistic queries". In fact, the paper was about GPT-4 and its impressive performance across a variety of problem domains. As the abstract of Sparks states:

Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.

Then, on page 2, Embers mischaracterizes the term "autoregression" in an ML context (emphasis added):

The crucial question to ask, then, is: What problem(s) do LLMs need to solve, and how do these pressures influence them? Here we focus on perhaps the most salient pressure that defines any machine learning system, namely the task that it was trained to perform. For the LLMs that have been the focus of recent attention in AI, this task is autoregression—next-word prediction (Elman, 1991; Bengio, Ducharme, and Vincent, 2000; Radford et al., 2018)—performed over Internet text.

Recent large language models have indeed been "trained" to predict the next word or token, but that is not what "autoregression" means. Autoregession has to do with generating output based on previous outputs. In statistics, an "autoregressive model describes a system whose status (dependent variable) depends linearly on its own status in the past". In machine learning, "autoregression" refers to a process by which a model generates outputs in a loop, one output at a time, where each additional output is appended to the input and fed back into the process to generate the next output.

Monday, March 20, 2023

An undecidable Diophantine equation

As shown in Jones (1982), it is undecidable whether the equation below has a solution over the positive integers for given positive integers x and v:

$\begin{alignat*}{2} & &&(elg^2 + α - (b - xy)q^2)^2\\ &+ &&(q - b^{5^{60}})^2\\ &+ &&(λ + q^4 - 1 - λb^5)^2\\ &+ &&(θ + 2z - b^5)^2\\ &+ &&(l - u - tθ)^2\\ &+ &&(e - y - mθ)^2\\ &+ &&(n - q^{16})^2\\ &+ &&(\\ & && ~~~r\\ & &&~~~- [g + eq^3 + lq^5 + (2(e - zλ)(1 + xb^5 + g)^4 + λb^5 + λb^5q^4)q^4][n^2 - n]\\ & && ~~~- [q^3 - bl + l + θλq^3 + (b^5 - 2)q^5][n^2 - 1]\\ & &&)^2\\ &+ &&(p - 2ws^2r^2n^2)^2\\ &+ &&(p^2k^2 - k^2 + 1 - τ^2)^2\\ &+ &&(4(c - ksn^2)^2 + η - k^2)^2\\ &+ &&(k - r - 1 - hp + h)^2\\ &+ &&(a - (wn^2 + 1)rsn^2)^2\\ &+ &&(c - 2r - 1 - φ)^2\\ &+ &&(d - bw - ca + 2c - 4aγ + 5γ)^2\\ &+ &&(d^2 - (a^2 - 1)c^2 - 1)^2\\ &+ &&(f^2 - (a^2 - 1)i^2c^4 - 1)^2\\ &+ &&((d + of)^2 - ((a + f^2(d^2 - a))^2 - 1)(2r + 1 + jc)^2 - 1)^2\\ &+ &&(v - ((zuy)^2 + u)^2 - y)^2\\ &= &&~0 \end{alignat*}$

That equation corresponds to the 19-equation system from Jones' Theorem 3 (1982). Jones' system of equations

L₁ = R₁, L₂ = R₂, …, L₁₈ = R₁₈, Lᵥ = Rᵥ

is equivalent to:

(L₁ – R₁)² + (L₂ – R₂)² + … + (L₁₈ – R₁₈)² + (Lᵥ – Rᵥ)² = 0

The single equation above has a solution just when x is a member of the vth semidecidable set, with v representing the index of each semidecidable set listed in an order determined by the equation. Some semidecidable sets are not decidable. Therefore, an algorithm capable of deciding the solvability of the equation for any given x and v would be able to decide membership in undecidable sets.

Reference: Jones, James P. (1982). Universal Diophantine Equation. The Journal of Symbolic Logic, 47(3), 549-571. https://doi.org/10.2307/2273588.

Thursday, October 19, 2017

The high cost of "safe" estimates

A summary of an idea from Goldratt's Critical Chain

When talking about task duration estimates, I'll define a "true" estimate as an estimate in which we have 50% confidence, that is, an estimate that has a 50% probability of being greater than or equal to the actual duration. Similarly, a "safe" estimate is an estimate in which we have 90% confidence.

People want to be seen as reliable, so they tend to give safe estimates. But their stakeholders tend to try to whittle down the estimate, so the person often has to fight for their estimate. Now they might look dumb if they finish much sooner than the estimate for which they just fought so hard. Even if the stakeholder is initially pleased when this happens, the person knows that, in the future, their estimates may be seen as inflated.

If you were to draw a graph of the likelihood of a task finishing at various points in time, it would commonly look something like this:

Here, the X axis represents time (say, in weeks), and the area under the curve from x₁ to x₂ is the probability that the task will take between x₁ and x₂ weeks to complete. Since there a chance that the task will take a long time, the curve has a long tail on the right-hand side (not shown completely in this graph) that drifts down to 0 very slowly. The area under the entire curve is 1, reflecting an unrealistic assumption that the task will definitely be completed at some point in time.

The area under the curve from 0 to 1 is about 0.5 (as you can see noting that the green area of the graph is slightly less than a 1 x 0.6 rectangle), so the probability that this task will take less than 1 week is about 50%. In other words, the 50th percentile time for this task is 1 week. The probability that it will take between 1 and 2 weeks is about 25%. The probability that it will take between 2 and 4 weeks is about 15%. And the probability that it will take longer than 4 weeks is about 10%. Adding these numbers up, we see that this task has a 50% chance of finishing in less than a week, a 75% chance of taking less than 2 weeks, and a 90% chance of taking less than 4 weeks.

So, an estimate with 90% confidence—a safe estimate—will have to be many times longer than a true estimate.

For more on this topic, see Cutting Edge's excellent white paper on Managing Duration Uncertainty.

Monday, January 11, 2016

2011 Sprouts Clash of the Titans - Game 2 Analysis

Clash of the Titans, Game 1: Analysis by Josh Jordan

Clash of the Titans, Game 2

Analysis by Josh Jordan

For background, see my analysis of game 1.

22- (Khorkov - Jordan+Aunt Beast*) 1(23)1[2-10] 1(24)23[11-14] 11(25)24 2(26)2 2(27)26[3] 15(28)16 15(29)16[17-19] 15(30)17 16(31)20

11(32)11[12]

Aunt Beast doesn't know who wins this position, but she can see that Roman loses if he moves in the 0⁴ component.

20(33)31[21,22] N

Roman moves in the 0⁴ component, changing it to a 1⁵. In a way, Roman's hand is forced: he wants S(17) to have the nim-values 1^x, and this is the only move that does that.

29(34)30[28] P

Move 34 and every subsequent even-numbered move is P. Based on its nim-values, this seems to be another one of those moves which magically win, seemingly at random. As John Conway wrote in Winning Ways, "You mustn't expect any magic formula for dealing with such positions."

Situations like this demonstrate just how inadequate nim values can be in the analysis of misère sprouts. Understanding these positions will require a fundamental breakthrough in the mathematical theory of misère sprouts games.

17(35)17[18,19] 21(36)22 18(37)19 18(38)19 19(39@35)38 20(40)36

After move 40, S(3) still has nim-values 1¹ (the sphere hasn't changed since move 27), but S(40) has evolved into a component with nim-values 5⁵ — and the result is still a misère P-position! We've truly crossed over into the Twilight Zone.

18(41)37[39] 4(42)4[5-8] 4(43)42[5] 3(44)3 3(45)27 9(46)9 9(47@10)46 21(48)22 21(49@33)48 40(50)49[22] II

After a long endgame, Roman concedes.

2011 Sprouts Clash of the Titans - Game 1 Analysis

Clash of the Titans, Game 1: Analysis by Josh Jordan

Clash of the Titans, Game 1

Analysis by Josh Jordan

Game 1 of the World Game of Sprouts Association-sponsored “2011 Clash of the Titans” match between WGOSA World Champion Roman Khorkov and me took place over email in early February, 2011. I was assisted by my computer program Aunt Beast, while Roman used only his brain. After I won the game, Roman wrote:

“This was an amazing game. Very intriguing... It added my white hair. It will be good if you make a detailed analysis. We simply have to bring it to the public!”

As requested, here is my analysis. First, though, let's have a look at these two titans.


Roman Khorkov	Josh Jordan

27- (Khorkov*)

Sprouts is a two-player pencil-and-paper game invented in the 1960s by mathematicians John Conway and Michael Patterson. The game starts with a field of dots on an otherwise-empty piece of paper. The move is to add a new dot along with two lines joining it to existing dots subject to the conditions that (a) lines may not cross and (b) no dot may have more than 3 lines coming out of it. (Mathematically speaking, each move extends a plane multigraph of maximum degree 3 by adding a new vertex of degree 2.) The players alternate turns until it is no longer possible to move, at which point the game is over. Like any impartial combinatorial game, sprouts can be played with two different win conditions: (a) last player wins or (b) last player loses. The former is known as normal play, the latter as misère (pronounced something like "MEE-zair") play.

The top human players seem to be able to play normal sprouts quite accurately. For this reason, the emphasis in modern tournament sprouts has shifted to misère play. (According to the Wikipedia entry on misère games, "Perhaps the only misère combinatorial game played competitively in an organized forum is Sprouts.") However, skill in normal play does not go to waste: the main principle of human misère sprouts strategy is to to create a large component with firm nim-values (see below), in the presence of which the game's misère outcome is usually the same as its normal outcome. (A component is a minimal contiguous region of the game with the property that a move in that region does not affect the moves available in the rest of the game.)

In sprouts, one player serves, that is, determines the number of initial spots and the win condition (normal or misère), and the other player then decides who plays first.. The player who plays first is known as Left; the other is known as Right. Here Roman has served 27 spots, misère. Had he intended normal play, the serve would have been written as "27+".

27- (Jordan+Aunt Beast - Khorkov*) 1(28)2

The shape that results from connects two initial spots like this is known as an anago.

Even though 27- is thought to be a win for Right, I opt to move first, in part to show how a computer can help win a theoretically losing position. We're a long way from knowing for sure whether 27- is a win for Left or a win for Right, so in the opening phase I merely try to make the position more complex, in accordance with the Enough Rope Principle.

The diagrams here follow Winning Ways by coloring Left's move bLue and Right's move Red.

1(29)1[2,3]

Roman responds with the analogue of the correct reply in 15-. According to theory, adding a multiple of 6 spots to a region with a large number of spots doesn't change the outcome of the game, so this move should also work for 27-.

4(30)29

Continuing the standard opening for 15-, I make a *3 component in S(2). (The notation S(x) means the entire component containing spot x.) I will try to make more of these *2 and *3 components as the game continues. An identical pair of nim-heaps of size at least 2 make a component with firm nim-values 0⁰. In misère play, when the position contains a component with firm nim-values, further identical pairs of these nim-heaps can can usually be ignored as in normal play, but not always…

It may be useful to recall some definitions and notation. A nim-heap of size x is denoted by *a. The nim-value of a position x is the size of the nim-heap that must be adjoined to x to result in a P-position, that is, a loss for the player to move. A position that is a win for the player to move is known as an N-position.

The nim-values of a position are its normal nim-value a and its misère nim-value b, written a^b. A component with firm nim-values is a component with identical normal and misère nim-values. This is not the same as a firm component, which is a tame component with firm nim-values. A component x is tame whenever there exist nim-heaps a, b, …, c such that, for any impartial combinatorial game y, a+b+…+c+y has the same misère outcome as x+y. (Note that y doesn't necessarily have to follow the rules of sprouts or nim.) Most components with firm nim-values are not tame. A partially-correct table of nim-values for various common sprouts components can be found at Playing Sprouts with Misère Grundy Numbers (Peltier, 2008).

4(31)4[5-20]

Roman makes a group of 16 spots and a group of 7. If spots 30 and 31 were removed, both of these groups would be components with firm nim-values of 1¹.

5(32)5[6-9]

With Aunt Beast, I try to disrupt the human strategy by steering the game into positions in which the normal outcome differs from the misère outcome, even in the presence of components with firm nim-values. Playing those types of positions correctly seems to be very difficult for today's human sprouts players.

I have noticed that many of these tricky positions contain groups of four spots, so I try to create such a group here. This was almost certainly a mistake, since it allows Roman to effectively remove those four spots from the game on his next move.

5(33)32[6]

Roman simplifies. S(6) is now effectively dead (equivalent to *0).

Dead spots and components which I know to be effectively dead (equivalent to *0, either by themselves or in combination with other components) are colored gray in the diagrams. Further play in these dead components is pointless.

10(34)11

Trying again to add complexity, I fall back on the usual approach of playing an anago.

30(35)31[21,22]

I expect Roman to close off the anago with 10(35)10[11] or 10(35)10[11,12], but he surprises me by playing this move instead, splitting the rest of the position into two components with nim-values 3⁰ and 0⁰.

I think it's bad practice for a human player in a theoretically winning position to leave an anago exposed like this, since the opponent can use it to create complex positions that could otherwise be prevented from arising.

In the diagram, the nim-values a^b of various components have been noted in boxes. When a box contains *n, the component is equivalent to a nim-heap of size n.

23(36)24

Trying to make things even more complex, I play a third anago.

21(37)35

Roman is now facing two exposed anagos, a situation in which he does not usually find himself.

10(38)10[11]

This move looks to have been a mistake on my part. Even if the 0¹ component were equivalent to *0, the remaining components do not make a misère P-position. If instead the 3⁰ component is replaced by *3, the resulting position is also not a misère P-position.

My move does have a possible virtue: there are only three replies that make the normal-play nim-values sum to *0 by changing the 3⁰ component into a 0¹, namely 23(39)23, 25(39)36, and 23(39)27. Each of these are subtly different from a misère perspective, and each yields a different unusual configuration in S(23). These unusual configurations are unlikely to have been well-studied by human sprouts theorists.

Instead, though, I should have played 12(38)34 or 12(38)13, both of which better accord with the Enough Rope Principle.

12(39)12[13]

Roman returns to a position in which the normal-play nim-values sum to *0. Interestingly, though, he does so by raising the nim-values of S(11) from 0¹ to 3⁴.

23(40)23[25]

I have set a trap. Aunt Beast still can't figure out who wins here, but she can see that Roman loses if he moves in the 5¹ component. It's likely that he will, since human players can't abide positions with normal-play nim-values greater than 3.

Update (March 2011): In subsequent analysis, Aunt Beast determined that this move is not just a trap — it actually wins! How strange that *2 + *2 + 5¹ + 3⁴ would yield a misere P-position. If God doesn't play dice with the universe, he must not play sprouts either, because both seem equally random.

26(41)26 N

Roman moves in the 5¹ component, reducing it to *3. Now Aunt Beast can see that Roman has lost. Adopting a notation from combinatorial game theory, I have appended "N" to losing moves, which stands for "next player wins".

14(42)15 P

There are six winning options here: 34(42)38, 12(42)13, 14(42)14[15-16,38], 14(42)14[15-18], 14(42)14[15-17], and 14(42)15. I like to play an anago whenever possible. Aunt Beast didn't compute the exact normal-play nim-value of the anago component during the game, but in the course of writing this article, I used Glop to determine that it is at least 5.

Adopting another notation from combinatorial game theory, I have appended "P" to winning moves, which stands for "previous player wins". In this case, the previous player is the person who just played the move, that is, me.

14(43)14[16-20] N

Looping on a tail of the anago, Roman encloses a group of 5 spots.

According to the "component with firm nim-values" heuristic, this position's nim-values 3³ + *3 + *3 + *3 would seem to yield a clear 0⁰, that is, a P-position in normal and misère play. Actually, though, it's a misère N-position — a perfect example of a case in which the heuristic goes horribly wrong.

Even though it loses, this is still an excellent move: it leaves me with only one winning reply, a reply which would be impossible to find today (early 2011) without the aid of a computer.

16(44)16[43] P

This was the only winning move. The fact that this position is P cannot be predicted by a theory that looks only at the nim-values of its components.

17(45)17[18,19] N

Roman adds an additional level of nesting to the loops and pivots in S(11). This is another excellent move that leaves me only one winning reply.

In 2006, Dan Hoey pointed out that a related class of positions containing nested pivot spots “tend to have reasonably large game complexity, since Dawson's Kayles produces nim-values up to *9.” For more on this, see these messages in the “Dawson's Sprouts” thread on sprouts-theory.

18(46)19 P

Again, this was the only winning move. Now the game's outcome and its nim-values begin to correspond: the 1³ component has restive nim-values, so we can guess that the position is P if and only if *3 is equal to *3 + *2 + *2, which indeed it is.

26(47)41 N

Ordinarily, when you are winning, the answer to an anago in an otherwise-empty region is to convert it to a big loop, but doing that here via 18(47)19 allows four different winning replies, while Roman's move allows only one.

21(48)21[22] P

For the third time in a row, this was the only winning move. Game trees in which two moves in a row have unique winning replies tend to be difficult for humans to analyze, so I can only imagine what a tree with three in a row is like.

This position is clearly P on the basis of its nim-values alone.

2(49)2[3] N

This is another good move; it allows me only two winning replies.

12(50@13)39 P

This was one of only two winning moves; the other was 17(50)46, which leaves S(11) with nim-values 6¹.

Roman usually resigns in losing positions with such few remaining live spots. Either the normal-play nim-value of 5 is clouding his vision, or he is doing an excellent job of defending a losing position, by forcing his opponent to play very accurately.

18(51)19 N

Roman plays yet another excellent move which allows me only one winning reply.

17(52)18 P

This was the only winning move.

20(53)20[45] N

Another strong move that leaves me only two winning replies.

16(54@20)44 I P

After this move, S(17) is effectively dead and Roman concedes. This was the simpler of only two winning moves; the other was 19(54@45)46, with nim-values 5⁰.

Josh Jordan, February, 2011

Taking the Fun Out of Puzzles & Games

Tuesday, October 3, 2023

Errors in McCoy et al.'s "Embers of Autoregression"

Monday, March 20, 2023

An undecidable Diophantine equation

Thursday, October 19, 2017

The high cost of "safe" estimates

A summary of an idea from Goldratt's Critical Chain

Monday, January 11, 2016

2011 Sprouts Clash of the Titans - Game 2 Analysis

Clash of the Titans, Game 2

Analysis by Josh Jordan

2011 Sprouts Clash of the Titans - Game 1 Analysis

Clash of the Titans, Game 1

Analysis by Josh Jordan

Followers

Blog Archive

Contributors