Building a Rating System for Competitive NES Tetris

Over the last decade, a fantastic competitive scene has grown around NES Tetris. This originated from and is centered around the Classic Tetris World Championship.

In what is otherwise a single player game, now players compete in 1 vs 1 matches, starting the game simultaneously. The player with the higher score wins. Typically, whoever wins best of either 3 or 5 games wins the match.

The Data

As an extension to the yearly tournament, there are qualifier events (played in real life), the Classic Tetris Monthly (played by streaming video of games played at home), and other smaller competitions. One member of the community took it upon themselves to document nearly every tournament match ever played.

This is a great data set, and currently it employs a modified Elo rating system. There was talk of switching to something more advanced such as Glicko-2, so I offered to help out.

Most of the games come from tournaments, so there is a limited amount of data to start with compared to online games with automated record-keeping, or chess organizations with much larger playerbases. That is the first hurdle. The second hurdle is that the particular order of wins and losses are unknown. The last hurdle is the stochastic nature of NES Tetris. Minor mistakes can quickly change the course of a game. This results in more volatility than most other games.

Benchmarking

I decided to try multiple rating systems to see which worked best. I benchmarked each system by first "priming" player ratings with half the data set. Then, before each match, I made a prediction by looking at the probability of one player winning and comparing it to the actual result.

I used the following formulas to make predictions. They result in a value between 0 and 1 indicating the probability of player A defeating player B. I've listed them here, since they are mostly absent in the usual sources for information related to these rating systems.


    (1) Probability of winning with Elo, normal distribution


    (2) Probability of winning with Elo, logistic distribution


    (3) Probability of winning with Glicko-1 (given from Glickman's Boost paper)


    (4) Probability of winning with Glicko-2

They are all similar in structure. Elo can be thought of as a subset of Glicko. How do we use these to evaluate a rating system? One method is to find the accuracy.


Accuracy is okay, but it doesn't consider the following:
  1. How near was a prediction was to an outcome? For example when the outcome is a win (1.0), then 0.99 is a better guess than 0.6. However, accuracy treats both predictions as the same.
  2. What was the margin of victory? A 0.99 prediction is more representative of a match score of 3 - 0 when compared to a match score of 3 - 2. (Predictions cannot be made on individual games, since the order of wins is unknown.)
To address these concerns, let's start with what I'll call "scaled win margin" (SWM),



We get a value between 1 (an "ace") and 0 (all games lost).

We can use the binomial distribution to calculate the probability of winning a match of n games, given the probability of winning a single game p:


For example, a player wins a best of 5 match when they win at least ceil(5/2) = 3 games. We then sum the exact probabilities of x = 3 to x = 5 by using C(5 choose x) * p^(x) * (1 - p)^(5 - x). This is our prediction for the match.

We can then find the mean absolute error of our results (SWM, or y) and predictions (y hat).


MAE helps when optimizing parameters, since it allows us to incorporate information about the number of wins and losses, despite not knowing their order.

Elo

Elo is predictive because it factors in expectations when updating players' ratings. The basic idea is this:


Expectation in Elo is found with formula (1) in this article.

If a 3000-rated player beats a 1000-rated players (outcome = 1), their ratings will not change much, as this is expected (expectation ~= 0.999). The we update the 3000-rated player's rating with 1 - 0.999 = 0.001, and then 0.001 * K + old rating. If the 3000-rated player loses, then their ratings will change by over 99% of the value of K.

P-Factor 

I recreated something called the "P-factor" system. P-factor is calculated by looking at the match's margin of victory. It equals 1 if |wins - losses| = 1 or 0, 1.5 if 2, otherwise it equals (11 + |wins - losses|) /8. It is then multiplied by K when updating Elo ratings. I do not recommend it, as it adds complexity without adding predictive value. I've included it only to benchmark the system that was already in place.

Individual game simulation

The data set does not contain the order of wins within a match. However, since the number of wins is known, we can create an estimation of the game order. The last game is always a win for the winner of the match, and then the rest of the wins and losses are alternated at the tail-end of the match. For example, if the win (W) - loss (L) record for a match was 3-2, then the sequence would be WLWLW. We can then update Elo on a per-game basis instead of per match. This resulted in an improvement in predictive value.

Two-tier K with minimum game threshold

The next thing I tried was using a different K based on how many games a player has under their belt. Beginners use a higher K. After they've played so many games, they start to use a lower K. This, combined with game simulation, improved predictive value over all previous methods.

Glicko-1

Glicko is an extension of Elo, created by Mark Glickman in 1995. I won't go into detail about Glicko's implementation, as its formulas are much more complex, but you can read an explanation here. Glicko's key features are as follows.
  • Keep a second statistic of how certain we are of a player's rating, known as rating deviation (RD).
  • Base the weight of which we update a rating (Elo's analog of K) on that certainty, RD.
  • Slowly increase RD over periods of inactivity.
  • Be able to update in batches of multiple games at a time.
Since Glicko doesn't care about win order when processing multiple games per rating update, the natural thing to do here is to update all games within a match. Glicko was originally conceived to do batches of games played within a month, which is how often FIDE updates their chess rankings. It is becoming more common to update Glicko after a game rather than after a period of time in order to give players immediate feedback. Another example is Trueskill, which is based off similar ideas to Glicko. It updates on per-game basis. I've experimented with monthly and daily batch updates and saw no significant improvement in the ratings compared to processing only the games within a match at one time.

As mentioned before, Glicko increases a player's RD whenever they don't play during a period. It does so by using a constant C. I've found as a practical solution, it's easiest to do this before each match using the following:


I was skeptical, though. In my opinion, player skill doesn't change much over periods of inactivity. At least, they don't after warming up with some practice. To my surprise, this did, however, increase predictive value. Whether that's worth the extra complexity for a particular application, I can't say. As a side note, the Glicko specifies that RD should never exceed a maximum of the initial starting RD. For this particular data, I did not find that feature to influence predictive value.

Glicko-2

Glicko-2 was created in 2000, and further extends Glicko-1 by adding a factor sigma that updates RD based on the volatility in a player's results. This is constrained by a constant tau. It also scales down values when making calculations, where mu represents a scaled rating and phi represents a scaled RD. You can find details on its implementation here.

Keeping track of volatility adds a good bit of complexity to Glicko-2. Glickman states that it works best by processing 10 to 15 games at a time. This data set contains an average of 3.4 games per match (processed in batch), which probably explains why its predictive value did not exceed Glicko-1. Multi-day periods would solve this. I did try that, but benchmark is not comparable to the rest. Doing so updates ratings less often and automatically has a lower accuracy when using those ratings to make predictions on a per-game basis. My gut feelings is that this extra complexity would probably not be worth the extra predictive value, if it has any. There is more discussion on that topic here.

The results


Rating System Accuracy Mean Absolute Error Parameters
Elo single K, no game sim 0.6688 0.2701 K = 160 ACC, K = 27 MAE
Elo, P-factor method, no game sim 0.6669 0.2700 K = 108 ACC, K = 17 MAE
Elo, single K, game sim 0.6734 0.2614 K = 43 ACC, K = 19 MAE
Elo, 2-tier w/ min game threshold, game sim 0.6771 0.2565 (K = 10, K2 = 46, Min games = 27) ACC, (K = 15, K2 = 57, Min games = 25) MAE
Glicko-1, no passive RD 0.6762 0.2566 (RD = 550, C = 0) ACC, (RD = 200, C = 0) MAE
Glicko-1, passive RD 0.6790 0.2527 (RD = 350, C = 0.467) ACC, (RD = 150, C = 0.288) MAE
Glicko-2 0.6626 0.2696 (phi = 250, sigma = 0.1, tau = 0.01) ACC, (phi = 150, sigma = 0.1, tau = 0.01) MAE

In the end, the community decided to continue using the existing system in place. The main reasons given were:

  1. When updating ratings on a per-game basis, a player can win a match but lose points overall if they lost individual games within a match. This mainly can happen when the players are unevenly matched, so losing even one game can bring down the higher-rated player's rating significantly.
  2. Disagreements over where certain players would rank on the leaderboard. 
  3. If a player saw their ranking fall, then there was an automatic distrust in the new system.
There is an element of politics involved when a community switches from one system to another, and that is probably part of why some big chess organizations such as FIDE haven't moved on from Elo for so long. 

Here are the top 15 Glicko-1 ratings as of 2019-12-10:
208284 RD44 sets142 gamesJOSEPH
1985120 RD29 sets75 gamesGREENTEA
190367 RD79 sets218 gamesKORYAN
1773106 RD24 sets59 gamesBUUUUCOOOO
176673 RD59 sets163 gamesJANI
1757150 RD47 sets122 gamesJONAS
1754106 RD20 sets53 gamesJOSH
175351 RD42 sets152 gamesBUBBLE
1745140 RD47 sets126 gamesHARRY
174466 RD48 sets148 gamesBATFOY
1741123 RD30 sets72 gamesQUAID
172690 RD7 sets24 gamesMATT
1715129 RD20 sets53 gamesJEFF
170172 RD18 sets61 gamesDANQZ
170192 RD57 sets156 gamesBEN
Example probability calculation

At the 2019 CTWC, 7-time champion Jonas lost 1-2 to MegaRetro. What were the chances? Jonas' rating up until that point was 1936 with an RD of 150. MegaRetro's rating was 1548 with an RD of 68. Plugging those values into formula (3) results in a win probability of 88%. That's for a single game. Using the binomial distribution for a best of 3, Jonas' chance of winning the match was 96%. The odds of what occurred was 1 in 25.