How to Measure Mahjong Luck and Skill

Every Game of Mahjong Is A Dice Roll

Imagine you are playing a 4 player game of mahjong against 3 clones of yourself. The probability that you get 1st, 2nd, 3rd, or 4th place is 25% - you're all equal in skill and playing the same strategy, so there's no reason any clone would have an advantage. The average placement you would achieve is 2.50. If you were playing against weaker players and had an average placement of 2.40, we would say that you have some edge, or advantage.

Now suppose you were playing perfect mahjong against competent opponents. How much edge is possible? One estimate of the maximum possible edge would be the results of the AI LuckyJ, who has the best performance of any person or AI in the Tokujou room on Tenhou. Over 1145 games, LuckyJ's spread of placements is 31.5% 1st, 27.5% 2nd, 24.1% 3rd, and 16.7% 4th. Its average placement is 2.26. There is some selection bias in estimating the edge of perfect play using the best performance we've seen - maybe LuckyJ got lucky. At the same time, it may be possible to achieve a greater edge with a stronger AI. With these two factors counterbalancing each other, I think LuckyJ's performance is a reasonable estimate of perfect performance.

The Stable Rank

We first looked at LuckyJ's performance using placements and average placement, but these stats are not the most relevant to playing online mahjong. The rewards that the online mahjong client gives you for 1st/2nd/3rd/4th will determine how good a spread of placements is.

If I had to pick one statistic to represent a player's skill, it would be the "stable rank", which takes into account the reward system. In this post, we'll assume that you're playing on one of the popular online riichi mahjong clients - Mahjong Soul, or Tenhou. We'll assume that you're playing in Jade Room for Mahjong Soul, or Tokujou / Houou for Tenhou. We won't discuss other reward systems, but we could repeat a similar analysis for any arbitrary ruleset.

For a given room, your "stable rank" is a measure of what rank you will end up at if you repeat a set of results infinitely. The room will determine what rewards you get for 1st / 2nd / 3rd / 4th. As your rank goes up, the penalty of 4th place will go up, and you will have to perform better to avoid deranking.

For reference, LuckyJ achieves a stable rank of 10.68 dan in Tokujou. The same spread of placements in Jade Room would correspond to a stable rank of roughly Saint 9. Jade Room is generally considered a weaker room than Tokujou, so this would be an underestimate of an equivalent performance in Jade Room.

As an example of how to interpret the stable rank of someone's online performance, take a look at Futoshi's Jade Room stats on Mahjong Soul. Futoshi is well known for being a professional mahjong player, one of 20 people to hit the highest rank on Tenhou, and one of the human players that the Mahjong AI Naga is trained on. Futoshi achieves a pretty impressive stable rank of Saint 6 over 1200 games, which is enough to achieve the highest rank on Majsoul. However, we can't be sure that his true skill is Saint 6. What if in those 1200 games, he got unlucky, and his true stable rank is actually Saint 9? Or what if he got lucky, and his true skill is Saint 3?

The Sampling Distribution

We decided that the "stable rank" is a good single statistic for capturing a player's skill, but realized that it is subject to variance. Let's assume that a player has some true level of skill. If they played infinitely in an online room, they would achieve some spread of 1st/2nd/3rd/4th rate and a stable rank of X. A player can get lucky and achieve a stable rank value that is higher than X, or unlucky and achieve one that is lower than X.

If a player plays n games, the observed stable rank will be randomly distrubted according to some sampling distribution. From the example above, there's no way we can figure out what Futoshi's true Jade Room stable rank is. All we can observe is the stable rank from that sample of 1200 games, and then use our understanding of sampling distributions to determine a range that likely captures his stable rank. How do we decide what that range is?

Stable Rank Confidence Intervals

To analyze stable rank variance, I'm going to follow the procedure outlined in this post which analyzed stable rank sampling distributions in the houou room on Tenhou. At the end of this blog post, I'll share some technical details on why I chose this approach instead of other possible ones.

Let's first look at a sampling distribution I generated for Jade Room.
For the Ms1 rank, I sampled 100, 400, 1000, and 4000 games with a spread of 25% each placement. I repeat this 100,000 times for each game count, and plot a histogram over 100 bins to show the distribution of realized stable rank for a true stable rank of Ms1. The sampling distribution looks like a normal distribution that skews slightly right. As you play more games, the width of the distribution shrinks.

As a side note, the Mahjong Soul ranking system awards a flat uma for placement, as well as rank pts for raw score. To simplify the process, I ignore raw score and use a larger uma of [35, 10, -10, -35] to perform the analysis, verifying that the results match up reasonably with the stable rank shown on the amae-koromo website.

Now let's take a look at a chart that I generated for Jade Room, which simplifies each sampling distribution into some key numbers.
Let's interpret the first row of the chart. We start with the assumption that our true stable rank is Ms1. If we play 100 games, our average stable rank is Ms1.04 (50th percentile). What if we get lucky, achieving an 84th percentile result? Our stable rank for these 100 games would be St 2.06. If we received the bottom 16% of luck, our stable rank would be Ms -1.96.

I chose the percentiles based on the 68-95-99.7 rule . Under a normal distribution, a standard deviation from the mean will get you to the 16th and 84th percentiles. 2 standard deviations will get you to the 2.5 and 97.5 percentiles. Because the sampling distribution is close to normal, you can think of these columns as representing 1 or 2 standard deviations of good / bad luck.

Using this chart, let's build some rough confidence intervals for Futoshi's stable rank. Remember that Futoshi achieved a stable rank of Saint 6 over 1200 games. Let's first build a 95% confidence interval. For the lower end of the interval, we consider the possibility that Futoshi high rolled a top 2.5% (+2std) outcome. For 1K games, the St2 +2std stable rank is St5.17, and the St3 +2std rank is St6.35. We can roughly estimate the bottom of this interval at St2.5. Now let's estimate the upper end of the interval. I don't show the higher stable ranks in the chart, but the St9 -2std stable rank is St 5.16. Our rough 95% confidence interval for Futoshi's skill is [St2.5, St9+]. This means that there is roughly a 95% probability that Futoshi's true skill level is within this range based on his Jade Room results.

To repeat the above analysis for the 68% confidence interval, we just need to use the 16% and 84% columns. The end result is a 68% confidence interval of [St4.5, St8].

Here is the chart for Tokujou and Houou:
From these tables, we observe that it takes on the order of 400 games in a room to narrow down the stable rank 95% confidence interval to around +-2 dan, or +-4 ranks on MJS.

You may notice that the variance on the Tenhou charts seem smaller than the MJS ones. This is because the increase in difficulty between ranks on MJS is smaller than on Tenhou. Ignoring MJS uma / raw score, going from zero sum Ms1 to Ms2 is a 9.1% increase in 4th penalty from -165 to -180, whereas going from zero sum 5 dan to 6 dan is a 14.3% increase in 4th penalty from -105 to -120.

I would also like to highlight a few observations from the original blog post that generated these tables for Houou. The first is that the sampling distribution is right skewed - you can high roll larger stable ranks than you can low roll lower ones. If you got no 4ths in a sample your stable rank is infinite, so very low 4th rates can blow up the observed stable rank. The second is that the higher your rank gets, the higher your variance.

Demotion and Promotion Probabilities

Stable rank does not take into account the fact that your rates change when you promote or demote. For example, let's say you're playing in the Tokujou room as a true stable 5 dan, with a rank of 5 dan. You play a few hundred games, and you high roll your way to 6 dan with some good luck. At this point, your 4th penalty goes up by 14.3%, and you will require more luck to avoid demoting back to 5 dan, your true stable rank. The same applies for demotions - if you demote, the reward system gets easier, and you would need to keep getting unlucky to avoid promoting back to 5 dan.

To model this effect, I ran 1K simulations that model the probability of promoting, demoting, and staying at the same rank after 400 games with a given amount of edge. With a current rank of 5 dan, an edge of .5 dan means you're playing with a true stable rank of 5.5 dan at the 5 dan starting score of 1000/2000.
The demote and promote game count columns are the average amount of games it took to demote / promote.

For the stable 5 dan player that high rolled to 6 dan, their probability of demoting within 400 games is 71.9%. For the stable 5 dan player that low rolled to 4 dan, their probability of promoting within 400 games is 79.6%.

Here are the tables for MJS Jade Room and Houou. For the St3 rank, I used 1000 games instead of 400 games, because most St3 players stayed at the same rank after 400 games.
To summarize these tables, on Tenhou, we can estimate the probability that a player ranked 1 dan above or below their true skill moves toward their stable rank at 70-80%. For MJS, since the ranks represent smaller differences in skill, this same probability range applies to players ranked 2 MJS ranks above or below their true skill. However, it is less likely that a MJS player is 2 ranks off of their true skill versus a Tokujou player being 1 rank off of their true skill, since the MJS player would have had to move away from their true skill for two ranks instead of one.

Analyzing the Stable Rank of People on Discord

I made a google colab notebook that can perform a more precise version of the stable rank confidence interval analysis for a player's results in Jade Room, Tokujou, or Houou. I asked a few people on discord for permission to analyze their ladder stats using the charts from this notebook. The link for the notebook is here if you want to generate charts for your Tenhou of MJS ladder results, or take a look at the code.

Let's start by taking a look at some stable rank plots for this Jade Room player. Looking at the bottom graph, the player's running stable rank improved from roughly Ms3 (Ms1 <=> 1 on the chart) at 400 games, to St1 at 1000 games. In general, we expect players to improve at the game as they gain experience, and the blue line trending up is a good sign. The 400 game line also trends up. This line leaves out game from 400+ games ago, and those games could potentially underrepresent an improving player's skill.

Depending on which line we look at, this upward trend of 1 or 2 stable ranks represents a probable increase in skill. Using the 400 game variance, the player performed slightly below 1std better in the last 400 games compared to the first 400 games. The probability of a static player performing 1std higher for 2 independent samples of 400 games is roughly 30%, so there's around a 70% chance that this player has improved from the first 400 to the last 400 games, looking only at the performance. With that said, 30% is a pretty significant number, so I wouldn't worry too much about seeing the lines go up. While the analysis of this hypothesis is valid, we want to be careful of 'p-hacking', or testing too many hypotheses and using confirmation bias to pick out the stories that fit what we want to see.

The spike in the yellow line around games 600-800 is pretty insane, reaching a running 100 game stable rank of St20+. This is above a 100 game +2std percentile result for a St3 player, but perhaps it's not as uncommon as it may initially seem. 1000 games includes 10 independent samples of 100 games, and it's much more likely for one of those 10 samples to be very lucky or unlucky. This is not even considering the overlapping samples of 100 games. It is also true that the higher the true stable rank, the higher the possible variance - better players (relative to the zero-sum rank of the room) are more prone to relatively good or bad streaks. It may be tempting to remove an obvious high roll from the dataset, but to avoid bias in either direction, I would recommend against excluding extreme good or bad results.

Finally, I want to bring back our estimate of "perfect play" from the beginning of this post to highlight that despite high stable rank variances, there is also a high skill ceiling. Looking at the bottom chart, this player's running stable rank is St1 and their bottom 2.5% luck stable rank is St4. Our estimate of the stable rank of perfect play for Jade Room was St9. Even if we assume this player was very unlucky, they are making mistakes compared to optimal play, leaving behind 5 stable ranks of performance as a low estimate. I'm not saying this to disparage the player - none of us are perfect. I just want to emphasize that although variance is high, for this St1 sample stable rank player, the potential for optimization is even higher.

Now let's interpret the results of this Tokujou player. I'll keep it shorter since we'll be looking at similar things to the above analysis. This player increased their running stable rank from 4 dan at 400 games to 6 dan at 1200 games. Remember that tenhou rankings cover wider gaps in skill than MJS ones, so this is a very impressive improvement. In the bottom graph, the running stable rank improvement is almost 3 std., and we can confidently say that this player got much stronger from game 400 to game 1200. Since the player has improved a lot, we may want to focus on the running 400 game green line, which goes from 4 dan to around 7 dan.

Again, we see some spikes in the running 100 stable dan line, three of them hitting stable 12 dan. This is a +2std result for a stable 6 dan player, and between a +1std and a +2std result for a stable 7 dan player.

Finally, let's compare the performance to our estimated 10.68 stable dan for perfect mahjong. If we use the green line as our estimate of this players stable rank, that's roughly a 3.7 dan gap with perfect play. Even strong players have room to improve their play.

Some Comments About Luck

I want to end this post with a few comments about the nature of luck, and how people interpret their luck. 65% of people believe that they are smarter than average . I think that self-evaluating whether your mahjong luck is above average is harder that self-evaluating whether your intelligence is above average, so I'll propose that 75% of mahjong players think they are less lucky than average.

I think complaining has a purpose, but at the same time many complaints about mahjong luck are rather short sighted. For example, people often complain about events that happen in an individual game. "My 3-sided wait lost to a hell wait", or "I dealt into someone who overpushed their hand", or "I won no hands in that hanchan". Look at the yellow 100 game line above, or the green 400 game line. The yellow line is incredibly spiky, and the green line can also move up and down by a few ranks. One fourth place is nothing in the context of 100 or 400 games. For a reasonable estimate of skill, we should be thinking of performances on the order of hundreds of games. In a ladder system, 150-200 games is roughly how long it takes to rank up or down in Tokujou or Jade Room, so hitting 400 games over time should be expected for regular online players.

Even if you get 5 4ths in a row, or you run bad for 20 games, if you get average luck for the rest of the games in a 400 game sample, you're not even close to -1std of bad luck. Remember, only 16% of players hit that level of bad luck, so you'd have to be less lucky than 84 out of 100 people.

I make these clarifications because I don't want people to look at these tables or graphs and automatically assume that they're below average in luck. Variance goes both ways - you could be lucky or unlucky, and people will be biased towards thinking they are unluckier than they are. I'll conclude this section with a recommendation for any serious online player to read The Mental Game Of Poker , which is a great resource on how to think about luck and improvement, and how to manage tilt.

Summary

Using statistics, we can build confidence intervals around a person's stable rank to estimate a range of their skill level. It takes on the order of 400 games in a room to narrow down the stable rank 95% confidence interval to around +-2 dan, or +-4 ranks on MJS . The higher your true stable rank is compared to the zero sum rank of the room, the wider the interval.

However, stable rank does not tell the whole story, because underranked and overranked players will get pushed towards their stable rank. The probability that a player ranked 1 dan above or below their true skill moves toward their stable rank is around 70-80%. For MJS, this same probability range applies to players ranked 2 ranks above or below their true skill. However, it is less likely the MJS player gets 2 ranks away from their skill than it is for the Tenhou player to get 1 rank away from their skill.

Analyzing the stable ranks of a few players suggests that although stable rank variances are high, we have statistical evidence that most players are noticably weaker than the stable ranks of top AI such as LuckyJ or Naga. I believe that players should acknowledge that due to luck, they may very reasonably be ranked higher or lower than their true stable ranks by +-1 dan on Tenhou (2 dan in extreme cases), or +-2 ranks on MJS (3 ranks in extreme cases). However, there is always room for growth - perfect play is 5+ ranks stronger than the majority of players. There are always more decisions to be optimized, and the edge you gain from those optimizations may be higher than you think.

If you want to plot the stable rank for your own ladder results, you can use this notebook .

Technical Notes on Confidence Interval Generation

For confidence interval estimation, I generated sampling distributions for specific ranks, and then interpolated the variances of those sampling distributions. The end result of this interpolation is a function that maps game count and stable rank to 4 different variance numbers (-2std, -1std, 1std, 2std). I then estimated lower and upper bounds for a player's stable rank to generate a confidence interval. This approach is correct under the assumption that the sampling distributions are normal with a known fixed variance, which is not true. The sampling distributions are close to normal, but have a right tail. The variances also increase with higher stable ranks. However, these assumptions become more valid with higher samples, and this approach avoids some of the downsides of other approaches.

Another possible approach is bootstrapping, which resamples results from the observed distribution, and was used in this analysis of AI stable ranks. I think the approach I chose is superior to bootstrapping. One major flaw of the bootstrapping approach is that it generates a confidence interval with a right tail. This is true because the resampling distribution is the same as sampling numbers from a spread of possible placements and then computing stable rank, which will follow the same patterns as the sampling distributions that we generated. Based on the sampling distributions we created, we know that the opposite should be true. Because you generally high roll larger stable ranks than you low roll lower ones, the lower bound of the confidence interval should be further away from the sample stable rank than the upper bound. Bootstrapping fails to utilize the known information about the sampling distribution, producing an output with the opposite effect of what we know to be true analytically.

There was also a post that used a Bayesian approach to estimate stable rank, and a corresponding website. The goal of my post is to analyze the inherent variance in mahjong (which comes from rolling a weighted 4 sided die many times). The inherent variance is hidden by the prior assumption that your stable rank is within a relatively small range. For example, the bayesian approach uses a prior where the 95% confidence interval of houou player's stable ranks is between 5.58 and 8.72. This prior rules out the possibility that there is a stable 10 dan low rolling, or a stable 4 dan high rolling, which is possible if we only think about the properties of rolling a weighted die.

The best possible approach that I can think of for this problem would be to use a version of parametric bootstrap , which would use a likelihood function to weigh varying 4 sided die and their probability of generating the sampling distribution, and then sample from a weighted combination of those 4 sided die. I believe this would improve upon my current approach by better handling the sampling distribution variance changing with stable rank, while still utilizing the information we have about how the sampling distributions are generated. With that said, I felt like this approach was overkill for this post, and the current results should be reasonably accurate.

Comments

Popular posts from this blog

Push Fold Fundamentals: Winrate/Dealinrate

Suphx / Naga Haipai Efficiency - Part 1