2017-07-17

On Reporting Bust Rates in Contests

The usual metric used for comparing the bust rates for different stations is very simple: if a station makes a N QSOs, of which B are detected as busts, (usually by comparing the station's log with the logs submitted by other competitors) then the bust rate R is given simply by:
$$R = B / N$$
Admirably simple though this is, it suffers from at least two defects that seem to me to be fatal.

1. Allowing for the number of checkable QSOs


Suppose that we have two stations, 1 and 2, and we wish to compare their bust rates. Suppose that both stations make 100 QSOs, and suppose further that they both bust 5 QSOs, (as determined by an inspection of the logs of other competitors); then the usual inference is that the bust rate for the two stations is the same.

But one cannot conclude that, because the usual equation does not take into account the fact that, of the 100 QSOs made by each of the two stations, a very different number might have submitted their log to the contest sponsor in the two cases.

Suppose, for example, that all 100 of the QSOs made by the first station can be cross-checked with submitted logs, but only 50 of the QSOs made by the second station can be so checked. It is clear that the second station is likely not to be as good at copying calls as the first, and a better measure of the bust rate is made by using NV, the number of verifiable QSOs in place of the raw number N:
$$R = B / N_V$$
We can see whether this affects the ordering of, for example, the stations with the most busts in the CQ WW contest.

If we look at the stations with the most busts in the 2016 CQ WW CW contest, using the normal formula for calculating the percentage of busts, we find:

2016 CW -- Most Busts
Position Call QSOs Busts % Busts
1 PV8ADI 2,057 217 10.5
2 TK0C 12,521 171 1.4
3 LU2WA 2,048 169 8.3
4 TM1A 6,255 156 2.5
5 HG5F 3,062 147 4.8
6 HK1NA 14,037 145 1.0
7 CN2R 12,351 137 1.1
8 LZ9W 10,613136 1.3
9 PI4CC 7,736 119 1.5
10 NP2P 3,842 116 3.0

Using the revised formula, this becomes:

2016 CW -- Most Busts
Position Call Verified QSOs Busts % Busts
1 PV8ADI 1,666 217 13.0
2 TK0C 9,664 171 1.8
3 LU2WA 1,596 169 10.6
4 TM1A 5,307 156 2.9
5 HG5F 2,638 147 5.6
6 HK1NA 10,108 145 1.4
7 CN2R 9,544 137 1.4
8 LZ9W 8,946136 1.5
9 PI4CC 6,706 119 1.8
10 NP2P 3,162 116 3.7

2. Allowing for sampling distortions


The second issue with the normal calculation is more subtle, and is perhaps most easily seen with an example. In order to make the problem clearer, we will use an extreme example, but it should readily be seen that the problem exists to a lesser degree when the situation is less extreme.

Suppose we have two stations, the first of which makes 100 verified QSOs and busts no calls. Obviously, his verified bust rate is 0%. Now suppose that the second station makes 1,000 verified QSOs and also busts no calls. The problem now should be obvious: both stations have the same bust rate, and yet it is clear that the second operator is almost certainly better at copying than the first. We need a way to deal with this kind of situation.

The solution is to realise what we are trying to calculate, and how it relates to the data available to us. The real goal of calculating the bust percentage is to produce a measurement (or, rather, an estimate) of the rate at which an operator makes mistakes in copying calls.

So we make the simplifying assumption that each operator has a probability $p_n$ of busting a single call (if you think about it, this is not quite as silly as it might seem, since although the value of $p$ will doubtless vary under different reception conditions even for a single operator, all those variations can be taken into account by changing the meaning of $p$ so that it is a kind of mean value that reflects the conditions prevailing at the operator's location; this adjustment will vary from location to location, but that does not really matter, since we aren't trying to estimate some kind of ideal bust rate for a perfect location, but, rather, the bust rate that actually prevails for each operator at that operator's location). Now we can easily analyse how to compare the accuracy of competing operators.

Suppose that an operator has a probability $p_B$ of busting a call (and, hence, a probability $q = (1 - p_B)$ of copying a call correctly). Now if the operator makes $N_V$ verified QSOs, then the probability of there being a particular number of busts $B$ is given by the binomial distribution:

$$ {N_V \choose B} (p_B)^B (q^{(N_V - B)}) $$

where:

$$ {N_V \choose B} = { {N_V}! \over {B!(N_V-B)!} }$$

Call the actual number of busts $B_V$; then we have:

$$ \rm prob(B_V) = {N_V \choose B_V} (p_{B_V})^{B_V} (q^{(N_V - B_V)}) $$

Now, since we know that $B_V$ busts were measured out of a total of $N_V$ QSOs, we can determine what the distribution of $p_{B_V}$ looks like. 

(From this point on, we will drop the ${}_V$ suffix, since we will take it as read that we are discussing verified numbers.)

Looking at the table above, and plotting the relative probability of obtaining the actual number of busts as a function of $p$ for the ten stations listed, we find:



 We normalise each curve, so that the area under each is the same:



A couple of things are apparent from this plot:
  1. We can see that there is a non-negligible overlap between the curves for PV8ADI and LU2WA. This tells us that, despite the apparent substantial difference in the measured error rates in the logs of the two stations, in fact there is a not-completely-negligible probability that the base error rate for LU2WA is actually higher than that for PV8ADI. (We could calculate the actual probability, but at this point I just want to point out that it's obviously of the order of a few percent.)
  2. Although the original rates for PI4CC and TK0C are essentially identical, (and, hence, the peaks of the two curves occur at the same value of $p$) the curve for PI4CC is slightly broader than the curve for TK0C. This is a reflection of the fact that TK0C had more QSOs, and corresponds to the observation above about the two hypothetical stations that had no errors but who logged different numbers of QSOs.
Let us take a brief diversion prior to taking the next step.

Suppose that we somehow knew that a station $S$ had a probability of 0.1 of busting each QSO. And let us suppose that $S$ makes a total of 1,000 QSOs. Then it is easy to plot the probability of the number of actual QSOs $S$ would bust over the course of the contest (this is just the situation in the first equation above):


Now let us change the situation slightly. Suppose that we know that probability of S busting a QSO is either 0.09 or 0.11, but we don't know which of these it is, and each is equally likely. We can see that the difference in the probability curve for the expected number of busts is significant (the black curve is for probability = 0.09, the red for probability = 0.11):



With this information to hand, what is our best guess for the actual probability curve -- where, by "best guess" we mean "minimising the error"? Now, we know that the actual curve will be either the black curve or the red one, but since we don't know which it will be, our best guess will be the mean of the two -- i.e., the green curve. (You might remember from secondary school statistics that this is called the "expectation" -- a name that can be a bit confusing, since we expect this curve to be wrong! Similarly, a statistician will tell you that the "expectation" or "expected value" when rolling a fair 6-sided die is 3.5.)

Now suppose that we know that we know that the probability of a bust lies somewhere between 0.09 and 0.11, with a uniform distribution. Then we have:


where the white represents all the curves with a bust probability between 0.09 and 0.11, the black represents the expectation values taking into account just the two extreme curves, for bust probabilities of 0.09 and 0.11, and the green is the expectation curve for the entire range of probabilities between 0.9 and 0.11, uniformly distributed.

This is almost the situation that pertains in the case we are examining, with the exception that the curves we saw before we started this digression show that the values of $p$ are not distributed uniformly over a range. (To a good approximation, they are gaussian, at least for the higher values of the ratio $B / N$, although we won't take advantage of that approximation.)

So, if we take the example of PV8ADI, we can create a plot that shows the relative probability of obtaining a particular number of busts, given the distribution of probabilities $p$ that lead to the observed number of busts, $B$, in a total of $N$ QSOs. (If you think about it, you might be able to see that what we are doing here is to account for a second order perturbation on the first-order results. The size of this perturbation increases as the ratio $B/N$ decreases, as does the asymmetry of the first-order curve. The general effect will be to smear the first-order results to become more spread out.)

 For PV8ADI, we find:

where the black line is the basic binomial distribution for PV8ADI, with probability 217 / 1666, normalised to the number of busts for 1,000 QSOs. The green line is a similarly normalised line that takes into account all the binomial distributions for the various values of $p$, weighted by the probability of each value of $p$. As predicted, we see that the green line is more spread-out than the original black line.

We can subject each of the stations to the same treatment, normalising all results to 1,000 QSOs:

So what can we deduce from all this?

To start with, rather than simply quoting some kind of "bust rate", a range should be quoted for each station, representing, say, the 99% confidence limit for the rate.

If we do that, and reorder the stations in order of decreasing upper limit (which seems like the most reasonable ordering: it means that we are 99.5% sure that the actual bust rate is less than this number), then we find:

2016 CW -- 99% confidence limits for $p_B$
Position Call lower limit upper limit
1 PV8ADI 0.110 0.153
2 LU2WA 0.088 0.127
3 HG5F 0.045 0.068
4 NP2P 0.029 0.046
5 TM1A 0.024 0.036
6 PI4CC 0.014 0.022
7 TK0C 0.015 0.022
8 LZ9W 0.0120.019
9 HK1NA 0.012 0.018
10 CN2R 0.012 0.018

We can also perform more sophisticated comparisons between or among stations. For example, we might ask the question: what is the probability that LU2WA will have more busts than PV8ADI if both make 1,000 QSOs? [FYI, the answer is a little under 9%]

The important point here is that the single number that is usually quoted for a station's bust rate leaves much to be desired, and, in particular, may not be useful if one intends to use it to make comparisons to other stations' bust rates, unless both stations have a similar number of (verified) QSOs. A table, graph or chart is generally a much more useful guide when comparing bust rates between or amongst stations.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.