Chess Rating (for Laymen)

Paper by Dr Vladica Andrejić, University of Belgrade, Faculty of Mathematics, Serbia

vladica andrejic

Introduction

In 1970 the World Chess Federation (FIDE) accepted the rating system founded by Arpad Elo that has been functioning in the similar way ever since today. The number of players included in the first rating lists was less than 600, whereas the rating floor was 2200 points and as such corresponded to decent understanding of chess alike candidate master.

In the meantime, a lot of things have changed but the rating system remained the same. Probably due to commercial reasons, FIDE decided to assign certain points to each chess player and to decrease the rating floor onto 1200 at which level one is familiar with moving of chess pieces, but not much more.

With the sample that today amounts to more than 119,000 rated players with enormous rating differences among them, it comes logical that the initial system of chess rating can not keep on its own any longer. FIDE tried to make some corrections, but eventually it all boiled down to inappropriate introduction of an artificial increase of the opponent’s rating so as to avoid some apparent contradictions: a difference in rating of more than 400 points shall be counted for rating purposes as though it were a difference of 400 points.

Keeping mathematics aside, in this Paper I try to present my standpoint and suggestions that can improve certain shortcomings. Nonetheless, all the details and calculations that hold up subsequent theoretical propositions can be found in [7].

The nature of the chess rating

The main problem of chess rating system is to determine the f function, which assigns probability to rating difference i.e. to the expected result. In the course of time the original principle of that calculation disappeared and only the tables used by FIDE have remained. Today there is no Arpad Elo and only few can give answer to why the figures are as specified in these tables. Precisely that was the first problem I inspected.

One of potential hypotheses is that the nature of rating system preserves established ratios. In order to make it easier to understand I will provide an example and those keen on mental gymnastics may consult the original text [7] at any time. So, let’s assume that player 1 and player 2 have played a match of 10 games and that the player 1 won by 6:4. Let’s also assume that player 2 was not discouraged because of the score and that he played another match of 10 games against player 3, on which occasion he was more successful and won by 8:2 (i.e. the ratio was 4:1).

If we compare these ratios that we want to preserve, by derivation of 6:4 and 4:1 we obtain 6:1 i.e. a projection that according to our assumption the expected outcome of the match between player 1 and player 3 will correspond to the ratio 6:1 (meaning that in the event of 10 games our best estimation of the result of this match would be the triumph of player 1 by 8.5:1.5).

If we apply the hypothesis about ratio preservation on our function for rating calculation, we get (for details see [7]) the solution for which FIDE Handbook states that is a close approximation of real (FIDE) values for f [4], but that FIDE tables are to be exclusively used for making calculations [3]. Entire tables are created in accordance with calculations based on different formulas, but discrepancies between corresponding values are rather small and they amount to about 1%.

Does the ratio preservation law from our hypothesis adequately reflect the real state of things? I questioned it for the first time after I had read the article by Sonas [5] where he proposed linear scale as the alternative. For that reason I carried out several experiments using data from my website Perpetual Check [6] where I covered almost all significant tournaments (from classical time control over rapid to blitz) that were organized in Serbia and Montenegro since 2006.

Results of the experiments have indisputably convinced me that the ratio preservation hypothesis is the natural law that has to be inherent to any chess rating system. That is why in calculating I rather make use of direct formula from [4] (without rounding on two decimal places), instead of tables from [3].

A previously unrated player

We will not be able to go any further without formulas, but do not get discouraged because of that! If we apply higher mathematics in the cases of small rating difference x between the players (namely in cases where |x| is under 100), it will turn out that formula f(x) ~ 0.5 + x/695 works very well. This only proves the commonly known fact among chess players that each percent over 50% amounts to 7 (or more precisely 6.95) rating points. In order to illustrate it we will use the game between player 1 and player 2 in which the score 6:4 respectively corresponds to 60% and 40%, i.e. to the difference of 10% measured from 50%.

This difference of 10% corresponds to 7 x 10 = 70 rating points, which means that in the match between players with such rating difference the expected score should coincide with 6:4 ratio, and vice versa: if the score of this match corresponds to this proportion, it can be concluded that the most probable difference between the two players is about 70 rating points regardless of whether they play at grandmaster or amateur level.

Since player’s results as a rule deviate from expectations, it is necessary to establish how many rating points they gain or lose in such situations. Even the least experienced chess player is familiar with the method used in such situations: the expected score in percentages is expressed in points and then the difference between this score and the acquired score is calculated. Multiplying this difference by the so-called “development coefficient” one loses or gains rating according to the acquired score.

FIDE currently uses three development coefficients for rating calculations: K=25 for a player new to the rating list until he has completed events with at least 30 games; K=10 for a player who at certain point of his career has reached over 2400 rating points and K=15 for the rest of players [3].

To put it simple, changes of the development coefficient that follow the career of a certain player represent FIDE’s endeavor to rapidly provide the player at earlier stages (by assigning greater coefficient) with rating that approximately reflects his real power and that after that “point of stabilization” new changes do not occur so turbulently. In the following paragraph we will try to examine how good this works.

It is time for some mathematics, but an important conclusion is to be drawn: when K stands for the player’s development coefficient and N for the ordinal number of a game, I managed to demonstrate ([7]) that KN ~ 695, as well as KN > 695. In simple terms, it means that the development coefficient and the ordinal number of a game against the rated player are in a simple correlation (their product is always 695) and therefore one can conclude that the influence of the last played game is the same as if it was the one among the total of 695/K games played.

Now, let’s pay attention to the previously unrated player. According to FIDE Handbook, when one plays at least 9 games against rated players his initial rating will be calculated in two different ways depending on whether he scored less or more than 50%. As his initial rating the player will be assigned a performance rating of some sort for results up to 50% and if he scores more than 50% he will get the average rating of all opponents plus 12.5 rating points for each half point scored over 50% [3].

Now let’s observe the appearance of a previously unrated player, who has been assigned K=25 after a 9-round tournament. As long as he has that coefficient (first 30 games), his last game will be valued as the one of 695/K ~ 28, though the actual number of games played is much smaller than that. This may be easier to understand if the games and their respective development coefficients are presented in the following table:

N - 8 9 10 11 12 13 14 15 16 17 18 19 20

K - 77 77 25 25 25 25 25 25 25 25 25 25 25

695/N - 87 77 70 63 58 53 50 46 43 41 39 37 35

It becomes even more transparent when we realize that the first nine games are valued as if they were played under the coefficient only to be followed by a dramatic coefficient decrease for more than three times onto 25. Even greater injustice has been imposed on players who initially made more than 50% and gained only 12.5 rating points for each half point over that, because they were earning bonuses according to coefficient 25 despite the fact that those games required three times greater coefficient.

It is exceptionally important to make a good choice of the first rating tournament in which a player is going to compete. For example, if you take part in the 9-round tournament against players whose average rating is 2000 and you win almost all the games, your rating will not be greater than 2100. On the other hand, a player who competes at the average of 2500 and gains 25% (e.g. loses the first game and plays to a draw one, and so on) will eventually score 2300 rating points and immediately become a FIDE master.

We should also point out that it will not harm those 2500 players, as their draws have no influence over their rating. This is where different kinds of abuse may emerge: for example, a draw (just like the one mentioned before) between the non-rated player and the player rated 2500 can be a result of a series of arranged draws; such draws can be a result of the team captain’s orders at the team championship where, cunningly enough, the non-rated player is placed at the first board so as not to make harm to anyone else.

Consequently, we may conclude that FIDE’s solution for the previously unrated players is not the best one, which may have far-reaching consequences. If a player gets a rather unreasonable initial rating it will take him a lot of time and games to play so as to reach the level proportional to his strength, while in the meantime he will positively or negatively influence the rating of his opponents.

I will try to make it clearer: FIDE has tried (and mostly succeeded) to prevent “catapulting” of new players and their appearance at the rating list with high ratings, but at too steep a price: almost all players (especially if they “trick themselves into” playing the first tournament against the lower-rated opposition) appear on the rating list rated below their real playing strength, and then they need too much time to reach the level they actually deserve according to their strength.

With current increasing popularity of chess, number of newcomers on the rating list is ever growing and as a rule these are young players who advance swiftly. The rising number of such players with irrationally low ratings increasingly undermines the reliability of rating list and their each game additionally disrupts the system objectivity.

Because of that, it is completely clear to every objective observer that well-conceived changes should be introduced and implemented as soon as possible. By no means should rating depend on player’s ability to choose the tournament with opponents who are strong enough or on his ability to estimate his strength and accordingly make his choice.

What I would like to suggest is to provide a player with a sliding coefficient K = 695/N as soon as he gets his initial rating, where N represents the ordinal (chronological) number of a game that he has played against rated opponents. According to previous results it is a good approximation, while according to KN > 695 it is also a minimum that should be set.

The sliding coefficient should be valid until it falls under the value of K0 intended for players with stable rating, i.e. until the number of played games is under 695/K0. Naturally, reaching certain rating with the aim of obtaining an international title should be calculated only after the previously explained had happened i.e. the rating becomes stable, so that the results before that moment should not be taken into consideration.

There is also a possibility of a more radical solution according to which at the beginning of the event a so-called initial rating is assigned to all players who enter the above mentioned procedure and face possible decrease of coefficient K for extremely small values of N (primarily N=1 and N=2), for example like in the following table:

N - 1 2 3 4 5 6 7 8 9 10 11 12

K - 232 232 232 174 139 116 99 87 77 70 63 58

In that case, due to the unreliability of rating of players with small number of games, the opponent’s coefficient should also be modified (especially if he has a stable K0), so that neither he would be affected. I believe that in case when the opponents (according to the previously explained) have coefficients K and K1 (is less than K), the second one (K1) should be replaced with K1x(K1/K).

Regarding the best K0 value, a number of experiments that I have carried out proved that if we have a well-conceived rating system the coefficient K should be as small as possible. In my opinion, the main problem is the badly conceived introduction of a previously unrated player, and I think that values between 10 and 16 are quite fair.

However, I can hardly endorse the introduction of greater values of development coefficients for players with stable rating as it would completely devastate the very essence of rating system. Additionally, I believe that there should not be any discrimination as to whether a player has ever exceeded the 2400-point plateau in his career and thus acquired a different coefficient.

Rating performance

Throughout recent years, the rating performance of a player has been an extremely important issue as it is what brings grandmaster and international master norms as well as medals based on player’s performance at the Chess Olympiad. Nevertheless, the way this performance is calculated is quite controversial and it is the total of the opponents’ average rating plus the table value that depends on the player’s percentage at the tournament.

For many years such a way of calculation has been acceptable, but with the decrease of the rating floor it now requires certain changes. I would kindly ask for your patience in the following paragraph as I hope that the accompanying examples will help to clear up your doubts.

It is a common occurrence at many tournaments that certain opponents have ratings so small that beating them actually decreases one’s performance; hence, for the sake of norm performance the rating of the worst rated player is boosted so that the game can potentially be improved. Now, there is a recent grandmaster tournament example of IM Miša Pap’s performance at Paleochora Open 2010.

In the first three rounds he won all the games he played (against lower-rated opponents), and in the following six rounds he made 3.5 points against opposition with an average rating of 2576. The result he obtained in the last six rounds is equivalent to the FIDE performance of 2633, but with his three initial wins that performance (with the additional rating increase of the weakest player) fell onto 2600 sharp (without the additional rating increase it would be 2536). Luckily enough, last year the requirement for the grandmaster norm was set down from 2601 onto 2600, and thus Pap managed to get it.

I would suggest defining the performance as the rating for which the acquired result is also the expected result at the same time. It is actually the rating of a player that would remain the same after subsequent rating calculations after the tournament in question. It can easily be calculated iteratively, and the good thing is that iterative calculation does not depend on the (small) development coefficient that is included in the calculation.

For these calculations I would not use rounded values from the table, but the real values for f (i.e. according to the previously elaborated theoretical considerations), and most certainly without any rating limits like those of 400 rating points that are artificially imposed by FIDE today.

Whatever the case may be, winning the game certainly can not do any harm (as it actually did in the previous paradoxical example) and some other things are also improved. For example, let’s assume that a player gains 1.5 points in two games playing against opposition with an average rating of 2400. FIDE’s calculation of such a performance would be 2591 (or 2593 according to the tables); that would also be the value of my calculation, but only if both players have 2400 each. However, rating is not linear so if the players have 2300 and 2500 instead, the performance is 2605, and if the players have 2200 and 2600 the performance is 2648!

Even better illustration of the present illogical FIDE calculation can be derived from the similar situation, i.e. in case of a score of 1.5 points from two games against a certain rating average. To calculate the performance, we will create a new situation where a stronger opponent has rating that is greater than the calculated performance while the weaker opponent has such a rating that the rating average is the same as in the first situation (good example would be the abovementioned average of 2400 against the opponents with 2200 and 2600). If 1.5 points are gained by means of a draw against the stronger opponent and a win against the weaker opponent, if we disregard a triumph over the weaker opponent, then logically the performance can not be smaller than the rating of the stronger opponent, but according to FIDE calculations it actually is (i.e. 2591 compared to 2600)! According to my calculation, Pap’s performance at the mentioned tournament is 2648.

There has recently been a discussion at the Chessbase’s website about the rating performance [1,2] and it was proposed to calculate the performance with an additional virtual draw against oneself. Nevertheless, this concept breaches the fundamental rating performance principle according to which rating performance does not depend on the rating of the player whose rating we are calculating. The aim is not to get more realistic numbers (closer to players’ ratings), but to define that both in case of all wins or all losses rating performance should not be calculated.

As for Navara’s fantastic result 8.5/9 at the Czech championship (FIDE performance 2963), in my opinion the appropriate performance should be 2982. It simply means that if Navara had entered the tournament with this rating (2982) and played it the way he did, he would not either gain or lose any rating points.

I hope that with this Paper I have managed to clarify some of my considerations (and supplementary suggestions) also to the readership who do not want to pore over mathematical explanations of the problem (and I apologize if I did not always succeed to articulate it with the right measure).

The elaborated issue is related to all chess players who take part in competitions and who should be given an opportunity to comprehend the rating system. Even more important thing is to have this system well-conceived so that it can meet the needs of contemporary chess tournaments and as such contribute to the popularity of chess.

It goes without saying that this is only one of many possible solutions and I am looking forward to receiving feedback both from the readers with relevant mathematical knowledge as well as from laymen. I will be truly happy if this is a step towards a better and fundamentally fair rating system.

“Chess Rating (for Laymen)” by Dr Vladica Andrejic in PDF format. The author is deeply grateful to IM Ivan Markovic who translated the original text into English.

Dr Vladica Andrejic is a chess player (highest elo 2275), chess arbiter and owner of Perpetual Check website. Ph.D. in Mathematics (Differential Geometry) and Professor (docent) at University of Belgrade, Faculty of Mathematics.

References

* [1] ChessBase, Navara with a 3241 performance at the Czech Championship, 2010.

* [2] ChessBase, Navara wins Czech Championship with 8.5/9 points, 2010.

* [3] FIDE Handbook, The working of the FIDE Rating System, 2010.

* [4] FIDE Handbook, Some comments on the Rating system, 2010.

* [5] Jeff Sonas, The Sonas Rating Formula – Better than Elo?, 2002.

* [6] Vladica Andrejic, Perpetual Check, 2010.

* [7] Vladica Andrejic, The Truth about Chess Rating, 2010.