The complete set of RBN data for 2009 to the end of 2020, after uncompression, exceeds 100GB in size. As not all analyses need the complete dataset, I have constructed a summary file
(rbn-summary-data.xz) that contains an overview of the data and which is sufficient for many
kinds of analysis that do not depend on the details of individual posts
to the RBN. (The basic script used to generate this summary file may be found here;
the actual summary file is created by running this basic script for each
individual year from 2009 to 2020 and concatenating the results after
removing the header line from all except the first year.)
The
summary file, after being uncompressed, comprises a single large table
of values separated by white space. The name of each column (there are
twelve columns in all) is on the first row. The columns are:
- band:
a string that identifies the band pertaining to this row. Typical
values are "15m" or "160m"; if a row contains data that are not
distinguished by band, then the characters "NA" are used.
- mode: a string that identifies the mode pertaining to this row. Typical values
are "CW" or "RTTY"; if a row contains data that are not distinguished
by mode, then the characters "NA" are used.
- type:
a single character that identifies whether the data on this row are for
a period of a year ("A"), a month ("M") or a day ("D").
- year: the numeric four-digit value of the year to which the current row pertains.
- month: the numeric value of the month (January = 1, etc.) of the data in this row. If the data are of type A or D, then this element has the value "NA".
- doy: the numeric value of the day number of the year (January 1st = 1, etc.).
The maximum value in each year is 366 (even if the year is not a leap
year). In the event that the year is not a leap year, the data in
columns 7, 8 and 9 will be set to 0 when doy is 366. If the data are of type A or M, then this element has the value "NA".
- posts: the total number of posts recorded by the RBN for the band, mode and period identified by the first six columns.
- calls:
the total number of distinguishable calls recorded by the RBN for the
band, mode and period identified by the first six columns.
- posters: the total number of distinguishable posters recorded by the RBN for
the band, mode and period identified by the first six columns.
- scatter:
the value of a scatter metric that characterises the geography of the
RBN for the band, mode and period identified by the first six columns.
The scatter metric is the sum of all possible distance pairs of good posters (measure in km), divided by the number of distance pairs.
- good posters:
the total number of distinguishable posters recorded by the RBN for
the band, mode and period identified by the first six columns, and for which location data are available from the RBN.
- grid metric: the total number of G(15, 100) grid cells that contain good posters.
For example, the first two lines of the summary file are (presented here
as a table, in order to make it easier to view on mobile devices):
band |
NA |
mode |
NA |
type |
A |
year |
2009 |
month |
NA |
doy |
NA |
posts |
5007040 |
calls |
143724 |
posters |
151 |
scatter |
5541 |
good_posters |
150 |
grid_metric |
22 |
This
tells us that the first line of actual data in the file comprises
annual data for
the year 2009, with no separation by band or mode. In 2009, we see that
there were 50,007,040 posts of 143,724 callsigns by 151 posters; the
scatter metric, which is a measure of the geographic dispersion of the
posters on the RBN., was 5,541; 150 different posters contributed the
data, spread across 22 distinct G(15, 100) grid cells.
The summary file allows rather rapid analysis of many RBN overview
statistics. For example, a plot of the daily number of posts covering
the period from the inception of the RBN to the end of 2020 --
-- can be generated on an ordinary desktop PC in a few seconds. From this plot, for example,
we can immediately see that the largest number of daily posts occurred
during the 2020 running of the CQ WW CW contest in late November (the second-highest cluster of peaks is for the CQ WPX contest, and the third is for the ARRL DX CW contest); also, the burst of activity that coincides with weekends is unmistakable.
For what it's worth, this is the code I used to generate the above plot (I apologise for the awful layout caused by the wrapping of long lines as they are forced into the narrow format used by blogger.com):
#!/usr/bin/Rscript
# generate a plot of the diurnal number of posts by the RBN, stacked by year
MIN_YEAR <- 2009
MAX_YEAR <- 2020
filename <- "/zd1/rbn/rbn-summary-data" # the local location of the RBN summary data file
# first two lines of the file:
#band mode type year month doy posts calls posters scatter good_posters grid_metric
#NA NA A 2009 NA NA 5007040 143724 151 5541 150 22
# rounding function
round_n <- function(x, n) { return ( ( as.integer( (x - 1) / n) +1 ) * n ) } # function to return next higher integral multiple of n, unless value is already such a multiple
data <- read.table(filename, header=TRUE)
# select diurnal data
diurnal_data <- subset(data, type=='D')
# drop the per-band and per-mode data
diurnal_all_bands_and_modes_data <- subset(diurnal_data, is.na(band) & is.na(mode))
# drop a bunch of columns that we don't want from the summary file
diurnal_all_bands_and_modes_data\$band <- NULL
diurnal_all_bands_and_modes_data\$mode <- NULL
diurnal_all_bands_and_modes_data\$type <- NULL
diurnal_all_bands_and_modes_data\$month <- NULL
diurnal_all_bands_and_modes_data\$calls <- NULL
diurnal_all_bands_and_modes_data\$posters <- NULL
diurnal_all_bands_and_modes_data\$scatter <- NULL
diurnal_all_bands_and_modes_data\$good_posters <- NULL
diurnal_all_bands_and_modes_data\$grid_metric <- NULL
# get ready to start to plot
graphics.off()
png(filename=paste(sep="", "/tmp/rbn-posts-from-summary.png"), width=800, height=600)
x_lab <- 'DOY'
# 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
clrs <- c("black", "red", rgb(0.1, 0.1, 0.5), "yellow", "green", "blue", "violet", rgb(0.6, 0.2, 0.2), "white", "cornflowerblue", "gold1", "darkorange")
# create a frame to map between year and days in the year
days_in_year <- data.frame(seq(MIN_YEAR, MAX_YEAR), 365)
names(days_in_year) <- c("year", "days")
days_in_year\$days[days_in_year\$year %% 4 == 0] <- 366
# set boundaries
plot(0, 0, xlim = c(0.5, 366.5), ylim = c(0, round_n(max(diurnal_all_bands_and_modes_data\$posts), 1000000)), xaxt = "n", yaxt = "n", xlab = x_lab, ylab = "", type = 'n', yaxs="i") # define the plotting region, but don't actually plot anything
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4], col = 'grey')
# now generate the plot, superimposing each year
for (this_year in seq(MIN_YEAR, MAX_YEAR))
{ this_years_data <- subset(diurnal_all_bands_and_modes_data, year==this_year)
max_element <- days_in_year\$days[this_year - MIN_YEAR + 1]
# set up so that this_years_data\$id_365[365] = this_years_data\$id_366[366] = 366, so either column can be used as a vector of abscissæ,
# depending on whether there are 365 or 366 ordinate values
this_years_data\$id_366<-seq.int(nrow(this_years_data))
this_years_data\$id_365<-((this_years_data\$id_366 - 1) * 365.0 / 364.0) + 1
# remove a couple of columns that we no longer need
this_years_data\$doy <- NULL
this_years_data\$year <- NULL
# move the two new columns of days to the left of the frame: 365, then 366
this_years_data <- (this_years_data[ c(ncol(this_years_data), ncol(this_years_data) - 1, 1:(ncol(this_years_data)-2))])
lines(this_years_data[,(max_element-364)][1:max_element], this_years_data\$posts[1:max_element], type = 'l', col = clrs[this_year - MIN_YEAR + 1], lwd = 1)
}
title_str <- paste(sep="", 'RBN POSTS PER DAY')
title(title_str)
title(ylab = '# OF POSTS (m)', line = 2.1, cex.lab = 1.0)
x_ticks_at <- c(1, 31, 61, 91, 121, 151, 181, 211, 241, 271, 301, 331, 361)
x_labels_at <- x_ticks_at
x_tick_labels <- x_ticks_at
axis(side = 1, at = x_ticks_at, labels = FALSE ) # ticks on x axis
axis(side = 1, at = x_labels_at, labels = x_tick_labels, tick = FALSE )
y_ticks_at <- seq(0, round_n(max(diurnal_all_bands_and_modes_data\$posts), 1000000), 100000)
y_labels_at <- seq(0, round_n(max(diurnal_all_bands_and_modes_data\$posts), 1000000), 1000000)
y_tick_labels <- seq(0, round_n(max(diurnal_all_bands_and_modes_data\$posts), 1000000) / 1000000, 1)
axis(side = 2, at = y_ticks_at, labels = FALSE )
axis(side = 2, at = y_labels_at, labels = y_tick_labels, tick = FALSE )
minx <- par("usr")[1]
maxx <- par("usr")[2]
miny <- par("usr")[3]
maxy <- par("usr")[4]
xrange <- maxx - minx
yrange <- maxy - miny
xpos <- minx + 0.025 * xrange
ypos <- miny + 0.975 * yrange
par(xpd=T, mar=c(0,0,4,0))
legend(x = xpos, y = ypos, legend = seq(MIN_YEAR, MAX_YEAR),
lty=c(1, 1), lwd=c(2,2), col = clrs,
bty = 'n', text.col = 'black')
graphics.off()
Of course, many other insights may be gleaned rather rapidly from the summary file.