## 2017-04-14

### A Grid-Based Scatter Metric for the RBN

I recently posted about an attempt to find a reasonable measure of the scatter of the stations that post reports to the Reverse Beacon Network (RBN). In that post, I looked at a metric based on the direct measurement of distances between posting stations. In this post, I look at a quite different approach based simply on counting the occupancy of cells defined over the 2-sphere that closely models the surface of the Earth.

If we look at two geographic representations of the RBN as it was in 2009 and then again in 2016...

...what strikes me most is not anything to do (directly) with the distribution of distances between points, nor is it to do with the colours of the points, but rather the obvious facts that: (i) there are a couple of areas where relatively many points are clustered; (ii) the vast majority of the area has no points at all; (iii) some spreading and filling-in has occurred over the time-span covered by the two figures.

That being so, it seems plausible that a better metric of the spreading of the RBN could be obtained by dividing the terrestrial 2-sphere into cells and counting the cells that are occupied (regardless of the number of posting stations within, or the number of posts emanating from, each cell). Such a metric has the additional merit of being relatively simple to calculate compared to the approach that derived a metric from the distances between posting stations.

#### Defining the Cells

Ideally, one would define the cells so that they would have two properties: (i) they would all be the same size; and (ii) they would all be as compact as possible.

The second desideratum may not be obvious: however, consider the case in which all the cells are defined such that each cell comprises all the longitudes between pairs of latitude lines (while defining the latitude lines in such a way that the area of each cell is the same, of course). One could easily divide the 2-sphere in such a manner, but this would mean that, for example, the cells for all the latitudes that cover Europe would appear as being populated, which would make a mockery of any attempt to interpret the number of populated cells as any kind of metric for the distribution of RBN posting stations.

To meet both desired criteria simultaneously essentially requires that we solve our friend from last time, the Tammes problem. Indeed, unless we allow non-compact cells, the problem of covering the surface of the Earth with cells of equal size results in a non-trivial distribution of cells (see, for example, the HEALPix projection and the documentation for the HEALPix code).

To make reasonable progress, we take a slightly different approach and relax our criteria, so that the cells are merely both reasonably similar in size and reasonably compact (after all, what we really want is a system that allows us quickly to decide in which cell a given RBN station is located, with the condition that reasonably separated RBN stations will be placed in different cells, rather than a system that meets any precise mathematical requirement).

The simplest approach seems to be (assuming latitudinal symmetry) to move from a pole towards the equator in bands defined by parallels of latitude, and defining the (integral) number of cells around each band in such a way that the size of the cells varies by only a relatively small amount.

For simplicity, we will consider just one half of the Earth's surface (the southern hemisphere); the pertinent results for the entire terrestrial 2-sphere will follow immediately by symmetry.

It's easy to show that the area 1A2 of that portion of a sphere of radius r between two co-latitudes λ1 and λ2 is:
$$_1A_2 = 2\pi r^2 (cos \lambda_1 - cos \lambda_2)$$
So, suppose that we divide the hemisphere into bands of latitude 10°, and wish to construct a grid with a total of 50 cells (i.e., 100 cells cover the entire 2-sphere) aligned along these bands. The resultant best-fit cells can be summarised as:

Cells in 10° bands
λ1(°) λ2(°) 1A2 N100
0 10 0.015192247 1
10 20 0.0451151 2
20 30 0.736672 4
30 40 0.099981 5
40 50 0.1232568 6
50 60 0.1427876 7
60 70 0.1579799 8
70 80 0.1683728
80 90 0.1736482 9

A similar table for latitude bands of 15° is similarly summarised:

Cells in 15° bands
λ1(°) λ2(°) 1A2 N100
0 15 0.0340742 2
15 30 0.0999004 5
30 45 0.1589186 8
45 60 0.2071068 10
60 75 0.241181 12
75 90 0.258819 13

In these tables, we labelled the column that shows the number of cells in each latitude band with the subscript 100 rather than 50, because that is the number of cells over the entire 2-sphere.

Note that these are just two of many tables that could be created for various values of the widths of the latitude bands and the total number of cells desired. We can denote any particular table by using parameters, so that G(15, 100) denotes the second table above, defining a total of 100 grid-based cells over the 2-sphere, with 15° bands of latitude. G(15, 100) seems like a reasonable metric with which to begin.

The cells for any latitude band may, of course, start at any longitude, but it makes things easier if, for each latitude band, we place a boundary at the Greenwich meridian, and also require that cell boundaries lie exactly on a meridian that is an integral number of degrees.

This gives us the following table for the longitudes at which cell boundaries occur:

Longitudes of Cell Boundaries
N Cells 1 2 3 4 5 6 7 8 9 10 11 12 13
1 360
2 180 360
3 120 240 360
4 90 180 270 360
5 72 144 216 288 360
6 60 120 180 240 300 360
7 51 103 154 206 257 309 360
8 45 90 135 180 225 270 315 360
9 40 80 120 160 200 240 280 320 360
10 36 72 108 144 180 216 252 288 324 360
11 33 65 98 131 164 196 229 262 295 327 360

12 30 60 90 120 150 180 210 240 270 300 330 360
13 28 55 83 111 138 166 194 222 249 277 304 332 360

With these constraints, G(15, 100) defines 100 cells as shown on this map:

These look like reasonable cells; therefore we will proceed with G(15,100) as the basis for the grid-based scatter metric. The actual value of the scatter metric over some time-frame Δ will therefore be the number of cells in the above plot that contain at least one RBN poster within the time-frame.

#### Digression - How Many Cells?

I have not addressed the issue of the number of cells we should use. There is no obvious criterion for choosing a particular number, although the number should be sufficiently large that it reflects the fact that large parts of the Earth's surface have no RBN posting stations. On the other hand, the number should not be so large that essentially every RBN node is in a distinct cell. The 100 cells on the above map seem a reasonable number to me. It also naturally leads to an index whose value, at least in theory, runs from 0 to 100.

#### Results

We can now easily compare the results of using G(15, 100) and the previously-defined scatter metric that was calculated directly from the positions of the posting stations.

Simply plotting the values of the two metrics as a function of time, we obtain:

(Pearson correlation coefficient is 0.78; rate of increase is about 145 per year.)

(Pearson correlation coefficient is 0.91; rate of increase is about 2.2 per year.)

Plotting the values of the metrics as functions of the number of posting stations, we obtain:

(Pearson correlation coefficient is 0.75; rate of increase is about 4.8 per poster.)

(Pearson correlation coefficient is 0.94; rate of increase is about 0.078 per poster.)

From these plots, I conclude that there is likely no point to looking at the third category of possible scatter metric (described as "[m]etrics based indirectly on a distance metric" in the prior post on this subject). [By this phrase, I intended to indicate the use of more complicated measures of the distribution of the distance-based measurements, rather than just the mean separation: for example, looking at higher moments of the distribution, or seeing if anything useful could be gained by looking at the fractal dimensionality.] The G(15,100) scatter metric is easy to calculate, it seems to correspond closely to the intuitive understanding of "scatter", and it appears to represent at least as good a metric as a directly-calculated metric based on mean separation of posting stations.

Thus, in future postings that require the use of a scatter metric, I will use the G(15, 100) metric unless there appears to be a good reason to do otherwise.