You are playing a decent unit game. A solid pair arrives at your table and promptly bids and makes a cold minor suit slam. Alas there was no alternate slam in a major or notrump that they missed. Figure yourself for a 20% board. It could have been worse. From the Slam Statistics article, we know your expectation would be only 12% in a 299er event. But even a successful major suit slam made against you in your field, though easier to bid, leaves you with only a 25% expectation.
If only we had had a chance to shine on this deal, you think. Or failing that, maybe the boards we played against this pair could have been boring, yielding average results. And if not that, we could at least be playing in a stronger field, one where lots of other pairs seated in the opposite direction would also bid the slam, a field with real field protection.
If you play in strong fields where they are many results per board, perhaps open pairs at a regionals, you may observe greater consistency in the results. For example, 46 out of 65 pairs bid the percentage 6♥ slam on the board featured in The Bridge Goddess Thumbs Her Nose and all of them went down one.
But exactly how much does the field protection increase with increasing field strength? Answering this will require mathematical definitions for field protection and field strength. For the latter it will suffice to examine the average masterpoint holding of the field, preferably the geometric mean rather than the arithmetic mean. For any one player masterpoints correlate poorly with skill but averaged over a large field, it is a quite reasonable metric.
Quantifying field protection is trickier. One approach would be to have a pro go through each hand much as is done for Instant Matchpoint games and rank the most logical outcomes and then create a metric based on weighting each result, e.g. 100% for the most logical outcome, 50% and 25% for the next most logical outcomes, etc. Then for example, if these outcomes occurred four, three, and one time respectively, we could calculate the field protection as a weighted mean, e.g. (4 × 100% + 3 × 50% + 1 × 25%) / 8 = 84%. Then we can average this field protection metric over many boards that are played in multiple events, e.g. open, 749er (“Gold Rush”), and 299er pairs. But the logical outcomes on a board are subject to many factors, e.g. bidding system and level of bidding aggressiveness, and the assignment of weights isn’t obvious either. And now that human time is vastly more valuable than computer time, it would be ideal to find a purely computational basis on which to evaluate field protection.
A diversity index
Let’s return to the idea that strong fields typically have more consistent results on a board. A search through the scientific literature reveals that most scientific disciplines frame this oppositely, in terms of diversity rather than consistency. For example, ecology researchers measure the species diversity in different localities or under different circumstances and seek to compare them in a way that yields a meaningful statistical confidence. There are numerous diversity indexes but the one that stands on the best mathematical footing is called the Shannon index or Shannon entropy, named for Claude Shannon, the father of information theory. It is defined as follows.
Here the pi are the proportion of each species. For us, each species corresponds to a distinct raw score on a board. Suppose a board with eight results has four +420 results and four -50 results. The proportion of each result is 0.5 = 50% and we can easily calculate the diversity as:
Suppose instead the board had six +420 results and two -50 results. Now the diversity computes as:
The resulting diversity is less the previous value of 1. This fits with our intuitive notion that even though there are still two distinct results on the board, there is less diversity when one result is dominant.
When frequencies fi rather than proportions pi are our starting point, we can save ourselves the trouble of converting to proportions by recasting Shannon’s diversity index into this form:
where N is the total number of observations, here the number of results on the board. For our second example, this formula gives:
The result is the same but this form produces cleaner program code and runs slightly faster.
What does Shannon diversity actually measure?
In the previous section, Shannon diversity has been presented using log2, the logarithm to base 2 even though it is often presented as the natural log. The choice of base has no impact other than to scale the result by a constant. By choosing base two, we obtain an answer in bits. The Shannon diversity is really a measure of entropy, a measure of how uncertain we are about what each observation is going to be. Suppose I showed you all the results on a completely flat board and then pulled a result at random and asked you to identify my pick through a game of Twenty Questions. For the flat board you wouldn’t even need to ask a single question because there isn’t any entropy in the results. For the case when the board is split evenly between +420 and -50 results, you would ask the single question, “Is the result +420” and my yes or no answer would reveal the result I pulled at random.
Our second scenario above with the unequal breakdown of +420 and -50 results is suggesting that you could ask about 80% of a yes/no question to figure out one randomly pulled result. If we play this game once, you are going to have to ask one full question. But if we play this game many times, you can average less than one question. Let’s see how this works. Suppose there are two outcomes, A and B, which I tell you are 92% and 8% likely each time respectively. The Shannon entropy suggests you might get away with less than half a question per result:
How might this work? Suppose we represent different sequences of A/B triplets by the following code:
|AAA||0||77.87% = 0.923|
|BAA||100||6.77% = 0.922 × 0.08|
|ABA||101||6.77% = 0.922 × 0.08|
|AAB||110||6.77% = 0.922 × 0.08|
|ABB||11100||0.59% = 0.92 × 0.082|
|BAB||11101||0.59% = 0.92 × 0.082|
|BBA||11110||0.59% = 0.92 × 0.082|
|BBB||11111||0.05% = 0.083|
Then we can encode the following series of A/B observations as follows:
The questions for our game of Twenty Questions (really at most five) become more complicated. The first question is “Does B appear at all in the triplet?" If no (zero), we score big time by learning three observations with only one question. This will happen 0.923 = 77.87% of the time. If yes (one), move on to the next question: “is the triplet either AAB or contain B more than once? If no (zero), our third question distinguishes between BAA and AAB. Otherwise our third question distinguishes between AAB and the remaining rare cases, the latter of which require two more questions to distinguish.
In the example above we determine 51 results (17 triplets) using only 27 yes/no questions (bits). We didn’t quite achieve half a bit per result in part because we were unlucky to have a rare ABB triplet that required five bits to encode in our short sequence. But in the long run we expect this code to require 1.479 bits / triplet or 0.493 bits / observation based on the probability of each triplet and the number of bits required to encode it.
Observe that 0.493 bits is not as good as the 0.4022 bits given by the Shannon entropy calculation above. To do better we would need a more sophisticated code, one that handled bigger chunks than triplets. But no matter how sophisticated our code we can never average better than the limit given by the Shannon entropy.
The code above is an example of a Huffman code, a well studied topic in information theory and computer science. Huffman codes are often used for lossless compression and even lossy compression often uses Huffman coding after the lossy steps have been performed. For example the widely used JPEG image format falls into the latter case.
Number of equivalent species / distinct results
Consider an ecologist who samples a region of the ocean and finds 1000 distinct species and based on sample frequencies computes the Shannon diversity index to be 8.37. After an oil spill, the ecologist samples the same region and finds 970 distinct species and computes the Shannon diversity index to be 7.64. The oil industry might conclude that only 3% of the species diversity has been lost. An inept ecologist might counter that 9% of the species diversity has been lost, calculating (8.37 − 7.64) / 8.37. The reality is much worse.
The Shannon diversity for N species present in the same amount, i.e. a maximally diverse sample, is:
This means that for any Shannon diversity, we can compute the effective number of species N as 2H. Before the oil spill, the effective number of species was 28.37 ≈ 331. After the oil spill it is 27.64 ≈ 199. Species diversity has actually fallen by 40%, calculated as (331 − 199) / 331. The effective numbers of species is also sometimes called the Shannon equitability index or the Shannon evenness index.
In a similar manner it is useful to talk about an effective number of distinct results on each board as if each distinct result had occurred an equal number of times.
First look at the bridge data / subsampling
Using a program based on the Payoff Matrix code, I am able to compute the effective number of distinct results on each board in an event by directly reading ACBLscore game files. The following plots are based on data from several D22 regionals where large open and Gold Rush (750 MP limit) events played the same boards in each session.
The histogram above shows the effective number of results per board for boards played in large D22 regional open pairs events where there were at least 52 results per board. The tallest bin spans 4.5–5.5 distinct results but there is a long tail such that the average number of effective distinct results is 7.24.
But already there is a problem. Not all boards are played the same number of times because the number of pairs varies from event to event. The movement matters too. For example, A web movement improves on the standard Mitchell movement by arranging to have more pairs in each section play each board. Directly comparing the effective number of results for boards played a different number of times is inaccurate. For example, imagine comparing the results above to those for the same boards played in small club game where there are only five results per board. The highest effective number of distinct results (equivalent species) for the club game is five, occurring when there are five completely different results on the board. But it would be a mistake to look at this histogram and conclude that the playing style in regional open pairs events is generating more diverse results than the club game.
There doesn’t appear to be a closed form mathematical solution to this problem. However, modern computers are so fast that it presents no difficulty at all. When we wish to compare a board with N results to a board with M results where N>M, we can simply pretend that the bigger set of results has only M results by randomly choosing M of the N results, derive the Shannon entropy based on this subsample, and use the entropy to calculate an effective number of distinct results. In fact we can perform this subsampling iteratively, say 100 times, and compute an average effective number of distinct results from all the subsamples. This has already been done to generate the histogram above.
Let’s see how the number of equivalent species varies as a function of the sample size, computed by subsampling for each of 4 to 48 results per board.
There is a fairly sharp rise until we reach ~20 results per board, such as might be realized at a healthy La Jolla unit game. At this point the majority of the boards have enough results. Only the boards with a high species equivalent slowly drive the curve higher as the number of results per board increases beyond ~20.
Small club games are clearly undersampled. This explains the feeling of playing a decent game that “didn’t matchpoint well.” Sometimes it really didn’t. Too many of your would be 60% boards in a big field wound up being 30% boards in the small field.
Comparing regional open pairs vs. Gold Rush pairs
The popularity of Gold Rush events at regionals gives a convenient mechanism for comparing two fields, each with good statistics, such that the amount of subsampling is modest. D22 open pairs events run concurrently with Gold Rush events have a typical field strength of 3000–4000 MP (geometric mean is 2000–2500 MP). Gold Rush events are capped at 750 MP and have a typical field strength of 300 MP (geometric mean is ~250 MP). This should represent enough of a skill disparity to observe field protection if it is present in any significant amount.
Here is a scatter plot of the effective number of results per board for the open and Gold Rush events. Only boards which have at least 32 results in both events are included and the results are subsampled to 32 results per board using 100 iterations per board in each event.
The correlation is reassuring. As expected the lower left quadrant contains many non-competitive boards where there is a clear cut major suit or notrump game contract with slams present to a lesser extent. The upper right quadrant contains the more competitive boards.
The next plot shows a comparison of the histograms for the open and Gold Rush events, effectively a projection onto each axis of the data points in the previous plot.
The two distributions look similar. When there are few effective distinct results, the open field appears tighter. Glancing through the results on these boards, this appears to reflect more consistent skill at bidding and making routine games. But at the same time the tail is longer for the open field. My guess is that this results from a greater diversity of bidding systems, competitive bidding treatments and competitive bidding choices in the open field. The means for the open and Gold Rush fields are 6.294 and 6.619, suggesting the stronger field is just a bit less diverse, i.e. tighter, and therefore offers a bit more field protection.
It is possible that these two distributions are not really different, reflecting instead only random variation in results pulled from the same distribution. There are many statistical hypothesis tests for whether two sets of data are significantly different from each other. I’m not a good enough statistician to argue over them. For our purpose the Student’s t-test should suffice. This test indicates how likely the two data sets come from distributions with equal means, the so called null hypothesis of the test. For the data above, the test result indicates there is only an 11% probability that the null hypothesis is true and hence an 89% confidence that we are observing a degree of increased field protection in the stronger open pairs.
An 89% confidence is fairly good if one can assume it is arrived at by honest means rather than fudged to barely meet FDA acceptance on a multi-billion dollar drug trial. Still our result falls short of the 95% confidence level that is arbitrary but commonly accepted as the minimum threshold for scientific publication. However, the result holds up for other field comparisons. For Gold Rush vs. 299er pairs events at the same D22 regionals, we can be 99.5% confident that the Gold Rush event offers greater field protection. For the Charlotte Bridge Studio open vs. 299er pairs events we can be extremely confident.
The bottom line is that we can be quite confident that field protection, measured as a decrease in the diversity of results, increases as a function of field strength. However, the degree of the increased field protection is very modest in terms of the decrease in the average number of distinct results per board. Meaningful matchpointing is far more a function of the number of results per board than the strength of the field.
Alpha, beta, and gamma entropy
Even if the results from a stronger field are not substantially less diverse, it might be argued that the results are noticeably different. For example, suppose all pairs in a strong field bid a 6♠ slam and half make it on the nose (+980) and half go down a trick (-50). Meanwhile in the weak field, all the pairs stop in game with half making one overtrick (+450) and half making two overtricks (+480). The diversity in each field is 1 bit (two equally common results) and yet the results are entirely dissimilar. One way to quantify this is to examine the diversity of the combined results and compare that to average of the individual field diversities. To keep our example simple, suppose the two fields have an equal number of pairs and use the same movement such that each field has an equal number of results on each board. The combined set of results has four equally common results (+980, -50, +450, and +480) and therefore two bits of entropy.
The combined set entropy is called the gamma entropy and the weighted average of the individual field entropies is called the alpha entropy. The difference between the two is called the beta entropy, i.e. Hβ = Hγ - Hα. The total entropy Hγ is thus decomposed into inter-field entropy (Hα) and intra-field (Hβ) entropy. It is possible to compute Hβ directly but it is more common to infer it after computing Hα and Hγ, which is the approach I took.
Ecologists often examine the species diversity at many different locations, sometimes in literal fields, usually termed plots in their literature. Here we will only examine two fields, open and Gold Rush, because it involves only modest subsampling to 32 results per board. Including the 299er field would require subsampling to 8 results per board. With only two fields, the beta entropy can never exceed 1 bit as occurs in the extreme case outlined above. The following is a histogram of Hβ for each board in the open and Gold Rush pairs comparison.
The beta entropy is quite small. Remember that the typical 4–8 effective distinct results per board corresponds to 2–3 bits of entropy, such that the alpha entropy is much larger than the beta entropy.
Field protection is a meaningful concept and it does increase with field strength. However, the effect is quite small. Players worried about field protection would do better to seek out larger fields than stronger fields.
Field protection seems to increase with the field strength primarily for pedestrian boards, non-competitive hands where a major suit or notrump game exists.
Many club games are undersampled. Matchpointing robustness sets in around 20 results per board.
My instinct from the start was to take an approach based on Shannon entropy but when faced with the practical details I was grateful to find Lou Jost’s discussions of Diversity and Similarity measures and the Effective number of species. I read the following papers:
Jost,l, 2006, Oikos 113:2, Entropy and Diversity.
Jost,l, preprint, Ecology, Partitioning diversity into independent alpha and beta components.
Marcon,e et al, 2012, Oikos 121:516-522, The Decomposition of Shannon's Entropy and a Confidence Interval for Beta Diversity.
Get the data
Download a zip file containing Excel and tab delimited text versions of board data used for the D22 regional open and Gold Rush pairs comparison. The first column bnum is the board number which has no significance for the analysis. The MP-0 and MP-750 columns are the effective number of distinct results on the board for the open and Gold Rush pairs events respectively. The gamma and beta columns are the gamma and beta diversities respectively, i.e. exp(Hγ) and exp(Hβ) respectively. log2 of these values gives Hγ and Hβ respectively in bits. The R-0 and R-750 columns give the raw results on each board for the open and Gold Rush events respectively. Results are ordered by decreasing frequency where the frequency of a result occurring more than once is indicated in parentheses after the result.
If you are interested in the Perl code, contact me. The code is kludgy and still has some dead code floating around. It does not handle all scenarios; for example the events to be compared must be in the same ACBLscore game file, which is usually, but not always, the case. The dataset would be larger if this issue were addressed. The program has no documentation beyond the command line help.