Building a Rating System for Competitive NES Tetris

Over the last decade, a fantastic competitive scene has grown around NES Tetris. Every year, competitors travel the world to compete in the Classic Tetris World Championship.

Players go 1 vs 1, starting their games simultaneously. The player with the higher score wins. Typically, whoever wins best of either 3 or 5 games wins the match.

The Database

As an extension to the yearly tournament, there are qualifier events (played in real life), the Classic Tetris Monthly (played by streaming video of games played at home), and other smaller competitions. One member of the community took it upon themselves to document nearly every tournament match ever played.

This is a great data set, and some community members have used it to track Elo ratings. There was talk of switching to something more advanced such as Glicko-2, so I offered to help out.

Most of the games come from tournaments, which limits the size of the data set. Online games with automated record-keeping or chess organizations with much larger playerbases can generate more recorded games. More games results in more accurate ratings, so that is the first hurdle. The second hurdle is that we don't know the particular order of wins and losses, only the totals. The last hurdle is the random nature of NES Tetris. Minor mistakes can quickly change the course of a game. This results in more volatility than most other games.

Benchmarking

I decided to try multiple rating systems to see which worked best. I benchmarked each system by first "priming" player ratings with half the data set. Then, before each match, I made a prediction by looking at the probability of one player winning and comparing it to the actual result. Only then did I update the ratings for that game. I repeated this process through the end of the data set.

I used the following formulas to make predictions. They result in a value between 0 and 1 indicating the probability of player A defeating player B. I've listed them here, since some are mostly absent from the usual sources related to these rating systems.


    (1) Probability of winning with Elo


    (2) Probability of winning with Elo, alternate method


    (3) Probability of winning with Glicko-1 (given from Glickman's Boost paper)


    (4) Probability of winning with Glicko-2

They are all similar in structure. Elo can be thought of as a subset of Glicko. How do we use these to evaluate a rating system? One method is to find the accuracy.


Accuracy is okay, but it doesn't consider the following:
  1. How close was the prediction to the outcome? For example when the outcome is a win (1.0), then 0.99 is a better guess than 0.6. However, accuracy will treat both predictions as the same.
  2. What was the margin of victory? A 0.99 prediction is more representative of a match score of 3 - 0 when compared to a match score of 3 - 2. (This is important when we make predictions on sets of games instead of individual games.)
To address these concerns, let's start with what I'll call "scaled win margin" (SWM),



We get a value between 1 (an "ace") and 0 (all games lost).

We can use the negative binomial distribution to calculate the probability of winning a best of n games, given the probability of winning a single game p:


We can then find the mean absolute error of our results (SWM, or y) and predictions (y hat).



MAE helps when optimizing parameters, since it allows us to incorporate information about the number of wins and losses, despite not knowing their order.

Elo

Elo is predictive because it factors in expectations when updating ratings. The basic idea is this:


Expectation in Elo is found with formula (1) in this article.

If a 3000-rated player defeats a 1000-rated players (outcome = 1), their ratings will not change much, as this is expected (expectation ~= 0.999). We update the 3000-rated player's rating by first calculating 1 - 0.999 = 0.001, and then we do 0.001 * K + old rating. If the 3000-rated player loses, then their rating will drastically change by over 99% of the value of K.

P-Factor 

I recreated something called the "P-factor" system. P-factor is calculated by looking at the match's margin of victory. It equals 1 if |wins - losses| = 1 or 0, 1.5 if 2, otherwise it equals (11 + |wins - losses|) /8. It is then multiplied by K when updating Elo ratings. I do not recommend it, as it adds complexity without adding predictive value. I've included it only to benchmark the system that was already in place.

Individual game simulation

The data set does not contain the order of wins within a match. However, since we do know the number of wins, we can estimate the game order. The last game is always a win for the winner of the match, and then the we can alternate the remaining wins at the tail-end of the match. For example, if the win (W) - loss (L) record for a match was 3-2, then the sequence would be WLWLW. We can then update Elo on a per-game basis instead of per match. This resulted in an improvement in predictive value.

Two-tier K with minimum game threshold

The next thing I tried was using a different K based on how many games a player has under their belt. Beginners use a higher K. After they've played so many games, they start to use a lower K. This, combined with game simulation, improved predictive value over all previous methods.

Glicko-1

Glicko is an extension of Elo, created by Mark Glickman in 1995. I won't go into detail about Glicko's implementation, as its formulas are much more complex, but you can read an explanation here. Glicko's key features are:
  • Keep a second statistic of how certain we are of a player's rating, known as rating deviation (RD).
  • Base the weight of which we update a rating (based on Elo's K) on that certainty, RD.
  • Slowly increase RD over periods of inactivity.
  • Simultaneously update in batches of multiple games at a time, if desired.
Since Glicko doesn't care about win order when processing multiple games per rating update, the natural thing to do here is to update all games within a match. Glicko was originally conceived to do batches of games played within a month, which is how often FIDE updates their chess rankings. It is becoming more common to update Glicko after a game rather than after a period of time in order to give players immediate feedback. Another example is Trueskill, which is based off similar ideas to Glicko. It updates on per-game basis. I've experimented with monthly and daily batch updates and saw no significant improvement in the ratings compared to processing only the games within a match at one time.

As mentioned before, Glicko increases a player's RD whenever they don't play during a period. It does so by using a constant C. I've found as a practical solution, it's easiest to do this before each match using the following:


I was skeptical, though. In my opinion, player skill doesn't change much over periods of inactivity. At least, they don't after warming up with some practice. To my surprise, this did, however, increase predictive value. Whether that's worth the extra complexity for a particular application, I can't say. Another thing to consider is how it is easier to over fit the data when you have one more variable to tune. As a side note, the Glicko specifies that RD should never exceed a maximum of the initial starting RD. For this particular data, I did not find that feature to influence predictive value.

Glicko-2

Glicko-2 was created in 2000, and further extends Glicko-1 by adding a factor sigma that updates RD based on the volatility in a player's results. This is constrained by a constant tau. It also scales down values when making calculations, where mu represents a scaled rating and phi represents a scaled RD. You can find details on its implementation here.

Keeping track of volatility adds a good bit of complexity to Glicko-2. Glickman states that it works best by processing 10 to 15 games at a time. This data set contains an average of 3.4 games per match (processed in batch), which probably explains why its predictive value did not exceed Glicko-1. Multi-day periods would solve this. I did try that, but the benchmark is not comparable to the rest. Doing so updates ratings less often and automatically has a lower accuracy when using those ratings to make predictions on a per-game basis. My gut feelings is that this extra complexity would probably not be worth the extra predictive value, if it has any. There is more discussion on that topic here.

The results


Rating System Accuracy Mean Absolute Error Parameters
Elo single K, no game sim 0.6688 0.2701 K = 160 ACC, K = 27 MAE
Elo, P-factor method, no game sim 0.6669 0.2700 K = 108 ACC, K = 17 MAE
Elo, single K, game sim 0.6734 0.2614 K = 43 ACC, K = 19 MAE
Elo, 2-tier w/ min game threshold, game sim 0.6771 0.2565 (K = 10, K2 = 46, Min games = 27) ACC, (K = 15, K2 = 57, Min games = 25) MAE
Glicko-1, no passive RD 0.6763 0.2566 (RD = 550, C = 0) ACC, (RD = 200, C = 0) MAE
Glicko-1, passive RD 0.6800 0.2527 (RD = 450, C = 0.465) ACC, (RD = 150, C = 0.288) MAE
Glicko-2 0.6626 0.2696 (phi = 250, sigma = 0.1, tau = 0.01) ACC, (phi = 150, sigma = 0.1, tau = 0.01) MAE

In the end, the community decided to continue using the existing system in place. The main reasons given were:
  1. When updating ratings on a per-game basis, a player can win a match but lose points overall if they lost individual games within a match. This mainly can happen when the players are unevenly matched, so losing even one game can bring down the higher-rated player's rating significantly.
  2. Disagreements over where certain players would rank on the leaderboard. 
  3. If a player saw their ranking fall, then there was an automatic distrust in the new system.
There is an element of politics involved when a community switches from one system to another, and that is probably part of why some big chess organizations such as FIDE haven't moved on from Elo for so long. 

Here are the top 15 Glicko-1 ratings as of 2019-12-10:
208284 RD44 sets142 gamesJOSEPH
1985120 RD29 sets75 gamesGREENTEA
190367 RD79 sets218 gamesKORYAN
1773106 RD24 sets59 gamesBUUUUCOOOO
176673 RD59 sets163 gamesJANI
1757150 RD47 sets122 gamesJONAS
1754106 RD20 sets53 gamesJOSH
175351 RD42 sets152 gamesBUBBLE
1745140 RD47 sets126 gamesHARRY
174466 RD48 sets148 gamesBATFOY
1741123 RD30 sets72 gamesQUAID
172690 RD7 sets24 gamesMATT
1715129 RD20 sets53 gamesJEFF
170172 RD18 sets61 gamesDANQZ
170192 RD57 sets156 gamesBEN


Example probability calculation

At the 2019 CTWC, 7-time champion Jonas lost 1-2 to MegaRetro. What were the chances? Jonas' rating up until that point was 1936 with an RD of 150. MegaRetro's rating was 1548 with an RD of 68. Plugging those values into formula (3) results in a win probability of 88%
. That's for a single game. Using the negative binomial distribution for a best of 3, Jonas' chance of winning the match was 96%. The odds of what occurred was 1 in 25.

Popular posts from this blog

Upstacking

Downstacking