2017-02-27

Versions of the LANL GPS Charged-Particle Dataset

The original dataset of GPS charged-particle measurements is quite cumbersome, comprising many thousand individual files. It is therefore useful to convert the data into a more-manageable schema, and also to remove what appears to be unnecessary and invalid information from the dataset.

For simplicity and ease of reference, I number each stage of the process so that others can more easily reproduce the various versions of the data. The initial complete dataset is stage 0.

Stage 1: All the data for each satellite in a single file

Because it is usually easier to operate on a single file for each satellite, we start by combining all the files for each satellite. Inside the directory for each satellite, ns<nn>, execute:

for file in *ascii; do grep -v "#" $file >> stage-1/ns<nn>; done


Satellite Records Start (YYYY-DOY) End (YYYY-DOY)
ns41 2,078,270 2001-007 2016-366
ns48 1,178,405 2008-083 2016-366
ns53 1,487,237 2005-275 2016-366
ns54 2,153,770 2001-049 2016-366
ns55 1,609,281 2007-308 2017-001
ns56 1,846,892 2003-040 2016-366
ns57 1,186,706 2008-013 2017-001
ns58 1,333,278 2006-337 2016-366
ns59 1,696,926 2004-088 2016-366
ns60 1,650,202 2004-193 2016-366
ns61 1,610,038 2004-319 2016-366
ns62 873,766 2010-164 2016-366
ns63 740,301 2011-198 2016-366
ns64 379,422 2014-054 2016-338
ns65 528,191 2012-288 2016-366
ns66 478,259 2013-153 2016-366
ns67 361,540 2014-138 2016-366
ns68 315,369 2014-222 2016-366
ns69 303,676 2014-320 2016-366
ns70 132,460 2016-038 2016-366
ns71 240,660 2015-088 2016-366
ns72 214,410 2015-200 2016-366
ns73 171,485 2015-305 2016-366


Stage 2: Remove all records marked as bad

For some reason, the dataset includes records that are known to be bad. A bad record is signalled by a value of unity for the dropped-data field (the documentation for this field says: if =1 it means something is wrong with the data record, do not use it). So we remove all records for which this field has the value one.

For ns41 and ns48:
awk 'BEGIN{FS=" "}; $25!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn>
For all other satellites:
awk 'BEGIN{FS=" "}; $26!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn>

Satellite Stage 2 Records % Good Stage 1 Records
ns41 1,991,249 95.8
ns48 1,105,991 93.8
ns53 1,331,687 89.5
ns54 1,939,151 90.0
ns55 1,055,328 65.6
ns56 1,680,887 91.0
ns57 1,082,626 91.2
ns58 1,175,519 88.2
ns59 1,516,528 89.4
ns60 1,495,541 90.6
ns61 1,470,445 91.3
ns62 775,535 88.8
ns63 652,110 88.1
ns64 344,702 90.8
ns65 480,935 91.0
ns66 446,801 93.4
ns67 327,994 90.7
ns68 306,513 97.2
ns69 262,992 86.6
ns70 110,971 83.8
ns71 221,336 92.0
ns72 182,332 85.0
ns73 145,858 85.0


Stage 3: Remove all records marked with invalid day of year

Ideally, the dataset would now contain only valid data; however, this turns out not to be the case. For example, the first value in each record is identified as the day of the year (decimal_day). The documentation for this field says: GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years. However, a simple sanity test shows that there are records that contain negative values for this field:

[HN:gps-stage-2] awk 'BEGIN{FS=" "}; $1<=0 { print }' < stage-2/ns41 | wc -l
177

[HN:gps-stage-2]

Accordingly, we can drop all records that have an invalid value for the day of year:

For ns41 and ns48: 

awk 'BEGIN{FS=" "}; ($1>=1 && (($22 == 2004 || $22 == 2008 || $22 == 2012 || \
$22 == 2016) ? $1<=367 : $1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn>

For all other satellites (this appears to be a null operation; nevertheless it is worth executing the filter so as to be certain that the values are in the acceptable range):

awk 'BEGIN{FS=" "}; ($1>=1 && (($23== 2004 || $23 == 2008 || $23 == 2012 || \
$23 == 2016) ? $1<=367 : $1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn>

Stage 4: Correct time information

Four fields pertain to time information:

decimal_day
double

1 GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years.
collection_interval double 1 dosimeter collection period (seconds)
year int 1 year (e.g. 2015)
decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)

Unfortunately, the actual values recorded in these fields are sometimes problematic:
  1. The decimal_day value is recorded with a precision of only 0.000683 days, or 0.0864 seconds. Since the value is recorded in terms of days rather than seconds, then even if the records are actually acquired according a clock whose period is an integral number of seconds, the precise value of the decimal_day field will jitter slightly around the correct time because of the quantisation error.
  2. The decimal_year value is not useful because it lacks sufficient precision. For example, the first two records of data from ns41 both contain the value 2.001016e+03 for this field, despite being acquired roughly 4 minutes apart. Therefore, this field cannot be used as an accurate representation of the time of the record. Since a monotonically increasing value that indicates time is undoubtedly useful, the values in the dataset should be replaced with corrected values with a larger number of significant figures.
  3. Inspection of the value of collection_interval suggests that it is intended to be an integer (despite being represented as a floating point number) taken from the set { 24, 120, 240, 4608 }. On one occasion, however, it has the value 367.0062. This I assume to be an error (the decimal_day value is consistent with a value of 240). Nevertheless, the presence of what seems to be an unreasonable number for this field suggests the need to filter the dataset further, so as to retain only those records with a value of collection_interval taken from the set of expected values.
The presence of occasional illegitimate values for fields such as decimal_day and collection_interval suggests that the data were not subjected to checking before being made available to the public (despite the presence of the dropped_data field; I have seen no documentation that describes what particular errors would cause this field to be set to 1). This is unfortunate, as it means that users have to process the data to remove errors before being able to move on to perform a defensible analysis based on the data.

Accordingly, our next step is to deal with the third item in the above list: to filter the data and retain only those records with values of collection_interval taken from the set { 24, 120, 240, 4608 }. Although I noticed the occurrence of only one record with an invalid value, it is worthwhile to process the complete dataset so as to be certain that only legitimate values remain.

For ns41 and ns48: 

awk 'BEGIN{FS=" "}; $21=="2.400000e+01" || $21=="1.200000e+02" || \
$21=="2.400000e+02" || $21=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>

And for the remaining spacecraft:

awk 'BEGIN{FS=" "}; $22=="2.400000e+01" || $22=="1.200000e+02" || \
 $22=="2.400000e+02" || $22=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>

As a kind of checkpoint, I have uploaded the stage 4 dataset. The MD5 checksum is 657e03a3fdbae517bd8eccf5f208b792.

For the record, here are the number of records for each satellite at the end of Stage 4 (the last couple of stages of filtering have affected only ns41):

Satellite Stage 4 Records
ns41 1,991,069
ns48 1,105,991
ns53 1,331,687
ns54 1,939,151
ns55 1,055,328
ns56 1,680,887
ns57 1,082,626
ns58 1,175,519
ns59 1,516,528
ns60 1,495,541
ns61 1,470,445
ns62 775,535
ns63 652,110
ns64 344,702
ns65 480,935
ns66 446,801
ns67 327,994
ns68 306,513
ns69 262,992
ns70 110,971
ns71 221,336
ns72 182,332
ns73 145,858


Further processing of the dataset will be described in subsequent posts.



2017-02-24

A Scatter Metric for the RBN

I recently posted a series of maps that showed the highly non-uniform geographic distribution of posting stations on the RBN as a function of time and frequency. While informative, these maps are very inconvenient, and do not lend themselves to any kind of quantitative analysis of the RBN's posting stations. What is needed is some kind of scatter metric, S, that in some way describes the non-uniform distribution of posting stations.

At least three basic approaches to creating such a metric suggest themselves (to me, anyway; there are probably others):
  1. Metrics based directly on a distance metric;
  2. Metrics based on grid occupancy;
  3. Metrics based indirectly on a distance metric.

Metrics based Directly on a Distance Metric


Several scatter metrics directly on a distance metric suggest themselves, perhaps the most obvious of which would be some number that compares the actual geographic dispersion of stations reporting at a particular frequency and over some period of time (such as a year or a month) to an "ideal" dispersion of the same number of stations equally spread around the world. This would allow calculation of an efficiency metric that could then be used to compare the geographic efficiency of the RBN as a function of time and frequency.

 

Definition of a Distance Metric


But before we go too far down this path, we need to consider what distance metric to us. For two points on the surface of a sphere (we'll assume that the Earth is a sphere -- and hence its surface is a 2-sphere -- which keeps things simple at little cost in accuracy), there are two obvious reasonable ways of defining the distance between them: the ordinary 3-space cartesian distance, and the length of the shortest path on the 2-sphere surface. We will denote these metrics by ℓC and ℓS respectively.

If we denote the radius of the Earth by RE, then it is easy to derive the following (equivalent) relationships between ℓC and ℓS in the domain 0 ≤ ℓC ≤ 2RE:

C2 = 2RE2(1 - cos (ℓS / RE))
S = RE × acos(1 - (ℓC2 / 2RE2))

We can plot this relationship with a trivial gnuplot program:
R = 6371
lc(ls) = sqrt(2) * R * sqrt(1 - cos(ls / R))

set xlabel "ℓ_S 2-sphere distance (km)"
set ylabel "ℓ_C 3-space distance (km)"
set title "ℓ_C as a Function of ℓ_S for Spherical Earth"

plot [ls=0:pi*R] lc(ls) notitle
which produces:

No surprise there, of course: ℓC is a strictly monotonically increasing non-linear function of ℓS. (Well, to be more precise, that is true throughout the domain except at the point ℓS = π × RE .)

Anyway, this tells us that if a scatter metric is to be based on distances between posting stations, it makes no fundamental difference whether we measure distances by ℓC or ℓS: the details of the calculations might (and in general would) change because of the non-linear relationship between the two ways of measuring distance, but there is no intrinsic reason to prefer one measurement over the other since there is a one-to-one mapping between the two measures.

 

Efficiency


Once a distance metric is defined, one can define some function based on that metric to express the amount of scatter of RBN posting stations across the globe. Ideally, as mentioned above, one could then compare that scatter to the scatter of the same number of stations uniformly spread over the surface of the Earth.

The most obvious scatter function defined along these lines is the mean separation between posting stations, using one of the two distance metrics described above. Since it makes no substantive difference which we distance metric choose, we will use ℓS, simply because that corresponds to the most common meaning of "distance" as used in amateur radio.

For N points (i.e., posting stations) P1, P2, ... PN we can define a plausible scatter function S by:

$$ S = {2 \over {N \times (N - 1) }}  \sum_{i=1}^{N-1}\sum_{j=i+1}^{N} {_i}{\Delta}_j$$
where  iΔj is the value of the distance metric ℓS for the points Pi and Pj.

We can then compare this to the ideal value obtained from points uniformly spread across the surface of the Earth.

Unfortunately, there is no known algebraic method for determining such an ideal "uniformly spread" dispersion, which is equivalent to finding a solution to what is known as the Tammes problem, which is a particular (unsolved) problem in the theory of spherical codes. There are various ways of computing solutions that are likely to be equal, or very close to, the ideal dispersion (for example: place N points on a sphere each the source of a repulsive force that decays with distance; add some friction, and then determine the location of all the points when they have all ceased to move). However, one could hardly regard this as a clean process, especially as it is not guaranteed to provide the true "most efficient" distribution.

Instead of a "uniformly spread" dispersion (which would maximise the minimum distance calculated across the set of points), though, we can compare the value of the metric S for the RBN to the value of the same metric for a random distribution of points. (By "a random distribution of points" here I mean that the probability of finding some number P points within any particular area A on the 2-sphere is independent of the position and shape of A.)

Symmetry arguments lead us to the (perhaps surprising) conclusion that the expectation value of the mean separation of such a random distribution of points must be independent of the number of points in the network, and will have a value equal to half the maximal value of the distance metric on the 2-sphere. (To put it in terms of the Earth: this means that the expectation value will be one quarter the length of the planet's circumference, or almost exactly 10,000 km.) Thus, we can immediately compare the scatter metric S of the RBN defined above to the "ideal" value of 10,000 -- or, since the ideal value is independent of the number of points, we can simply use the value of the scatter metric as-is, and mentally divide by 10,000 to obtain an "efficiency percentage".

Mensal values of the metric S show a gradual but sustained increase (the Pearson correlation coefficient is 0.78):
The red line is the best-fit linear regression to the data, which suggests that this scatter metric is increasing at the rate of about 145 per year.

Instead of plotting the metric as a function of time, it probably makes more sense to plot it as a function the number of posting stations (excluding those posting stations for which location information is unavailable from the RBN):
The correlation coefficient of the values on this plot is 0.75 -- almost identical to that of the plot based on time (the slope is about 4.8 per poster).

The story of both these plots is that the scatter metric has really changed very little since the RBN's inception: it has increased by only about 30%, even though the number of posters has increased by several hundred percent. This suggests that while the organisers of the RBN have been successful in persuading many additional stations to join the network, there has been but little improvement in the geographical diversity of those stations (as indeed is apparent if one looks at the maps; this in turn suggests that this metric is a not-unreasonable quantitative reflection of the amount of geographic scatter in the underlying network).

This post is now long enough... I'll move on to metrics based on grid occupancy in another post.







2017-02-21

HF Beacons

One of the purposes of the RBN is to allow stations to monitor the strength of their signals. Because of the moment-to-moment changes in propagation that normally accompany HF transmissions, as well as the RBN's unique definition of SNR, it makes sense to see how the RBN reports known stable transmissions over time.

In this post I take the first step, and determine which stations make reasonable candidates for further investigation.

The RBN can quickly tell us which beacons are posted most frequently, simply by counting posts of stations that transmit on a single frequency over a long period. If we look at the period from the RBN's inception in 2009 to the end of 2016 (using, for example, this file), we find the following "top twenty" beacons:

Position Station Frequency (kHz) Number of Posts Earliest Post Latest Post
1 I1MMR 7026 244,964 20090308 20161231
2 I1MMR 7027 182,465 20090308 20161231
3 4U1UN 14100 167,360 20140618 20161231
4 AA1K 1821 156,091 20090221 20161231
5 SK6RUD 10133 156,067 20100720 20160421
6 4X6TU 14100 146,961 20110215 20161231
7 DK0WCY 10144 145,174 20100601 20161231
8 W9ZN 7034144,399 20090225 20161231
9 W0ERE/B 10129 141,262 20100213 20161029
10 W0ERE 10129 135,741 20091105 20161029
11 EW7LO 7008 130,706 20090221 20161231
12 4X6TU 21150 118,790 20110303 20161231
13 IK1HGI/B 7039 111,146 20120516 20161231
14 YV5B 14100 107,564 20131228 20161231
15 DJ6UX 7037 101,942 20121210 20161229
16 DK0WCY 3579 97,002 20111027 20161231
17 DK7JI 7011 96,328 20100112 20161231
18 IK4VFD 7027 94,317 20091130 20161231
19 4X6TU 18110 92,867 20120127 20161231
20 CT1ZQ 7010 91,997 20090301 20161227

Notes:
  1. Frequencies are rounded to the nearest kHz;
  2. The first two positions are occupied by what is really the same station, which appears to transmit on 7026.5 kHz;
  3. Positions 9 and 10 appear to be occupied by a single station, using alternative versions of a single callsign;
  4. I am unsure how the U.S. stations in the list can be legal, since the FCC's regulations appear to limit [unattended] HF beacons to a portion of 10m;
  5. FCC regulations also appear to disallow the use of the "/B" indicator as used by station number 9, as the B series is allocated to China.
  6. It is my memory that the original HF beacons were all located on 28 MHz, so that listeners could be made aware of an opening. It is noticeable that not a single one of the stations on the list above is on 10m: the vast majority are on bands that can reasonably be expected to support some kind of non-local propagation at almost all times (which is probably the very reason that they are posted by the RBN so often -- but one does wonder what the putative purpose of such a beacon is);
  7. Of the twenty stations in the list, all but five were still active as of the end of 2016.

2017-02-13

Compressed Archive of LANL GPS Charged-Particle Data

Recently, the National Oceanic and Atmospheric Administration (NOAA) made available to the public a 16-year dataset of measurements of the charged-particle environment at 23 GPS satellites prepared  by the Combined X-Ray Dosimeter (CXD) team at Los Alamos National Laboratory. The dataset comprises a hierarchy of directories that total about 65 GB in size. As the dataset is uncompressed, I have made available a tarred, compressed mirror of the dataset, which is about 5.7 GB in size. This mirror has the MD5 checksum: 443c5787665a9ff7456ca037101b72e0. All the files in the original LANL release are contained in the mirror, without change.
The basic documentation for the dataset is contained in a README file; the following description of the data is taken from that file:

For all satellites except Space Vehicle Navstar (SVN) numbers 41 and 48, the data are:

Variable name type Dim. description
decimal_day
double


1 GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years.
Geographic_Latitude double 1 Latitude of satellite (deg)
Geographic_Longitude double 1 Longitude of satellite (deg)
Rad_Re double 1 (radius of satellite)/Rearth
rate_electron_measured double 11 Measured rate (Hz) in each of the 11 CXD electron channels
rate_proton_measured double 5 Measured rate (Hz) in each of the 5 CXD proton channels (P1-P5)
LEP_thresh double 1 LEP threshold in E1 channels (0 means low, 1 means high)
collection_interval double 1 dosimeter collection period (seconds)
year int 1 year (e.g. 2015)
decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)
SVN_number int 1 SVN number of satellite
dropped_data int 1 if =1 it means something is wrong with the data record, do not use it
b_coord_radius double 1 radius from earth's dipole axis (earth radii)
b_coord_height double 1 height above the earth's dipole equatorial plane (earth radii)
magnetic_longitude double 1 Magnetic longitude (degrees)
L_shell


double 1 L_shell (earth radii) – currently this is the same as L_LGM_T89IGRF but this is intended to be our suggested choice for the L shell calculation in the long run.
L_LGM_TS04IGRF double 1 LanlGeoMag L-shell McIlwain calculation, TS04 External Field, IGRF Internal Field.
L_LGM_OP77IGRF double 1 LanlGeoMag L-shell McIlwain calculation, OP77 External Field, IGRF Internal Field (not currently filled)
L_LGM_T89CDIP double 1 LanlGeoMag L-shell McIlwain calculation, T89 External Field, Centered Dipole Internal Field
L_LGM_T89IGRF double 1 LanlGeoMag L-shell McIlwain calculation, T89 External Field, IGRF Internal Field
bfield_ratio double 1 Bsatellite/Bequator
local_time double 1 magnetic local time (0-24 hours)
utc_lgm double 1 UTC (0-24 hours)
b_sattelite double 1 B field at satellite (gauss)
b_equator double 1 B field at equator (on this field line I think) (gauss)
electron_background double 11 estimated background in electron channels E1-E11 (Hz)
proton_background double 5 estimated background in proton channels P1-P5 (Hz)
proton_activity int 1 =1 if there is significant proton activity
proton_temperature_fit


double 1 characteristic momentum -- R0 in the expression given above (MeV/c)
proton_density_fit double 1 N0 parameter in fit to proton flux ((protons/(cm2 sec sr MeV))
electron_temperature_fit


double 1 electron temperature from a one Maxwellian fit (MeV)
electron_density_fit double 1 electron number density from a one Maxwellian fit (cm-3)
model_counts_electron_fit_pf double 11 E1-E11 rates due to proton background based on proton flux fit -- currently not filled (all -1's)
model_counts_proton_fit_pf double 5 P1-P5 rate from proton fit (using proton_temperature_fit, proton_density_fit)
model_counts_electron_fit double 11 E1-E11 rates from the 9-parameter electron flux model
model_counts_proton_fit double 5 P1-P5 rates from electron background -- currently not filled (all -1's)
proton_integrated_flux_fit double 6 integral of proton flux (based on fit) above 10, 15.85, 25.11, 30, 40, 79.43 MeV (proton kinetic energy)
proton_flux_fit


double 31 intended to be proton flux at 31 energies, not filled currently
proton_fluence_fit double 6
not filled currently


integral_flux_instrument


double 30 (based on 9 parameter fit) integral of electron flux above integral_flux_energy[i] particles/(cm2 sec)
integral_flux_energy double 30
energies for the integral of integral_flux_instrument (MeV)


electron_diff_flux_energy double 15 energies for the fluxes in electron_diff_flux_energy (MeV)
electron_diff_flux double 15 (based on 9 parameter fit) electron flux at energies electron_diff_flux[i] (particle/(cm2 sr MeV sec))
Efitpars double 9 fit parameters for 9 parameter electron fit
Pfitpars double 4 Fit parameters for 4 parameter proton fit. These are still a work in progress. The parameters are here as placeholders until we finalize the fit function and parameters.
 
 
For SVN numbers 41 and 48, the data are:


Variable name type Dim. Description
decimal_day
double


1 GPS time -- a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years
Geographic_Latitude double 1 Latitude of satellite (deg)
Geographic_Longitude double 1 Longitude of satellite (deg)
Rad_Re double 1 (radius of satellite)/Rearth
rate_electron_measured double 8 Measured rate (Hz) in each of the 8 BDD electron channels (E1-E8)
rate_proton_measured double 8 Measured rate (Hz) in each of the 8 BDD proton channels (P1-P8)
collection_interval double 1 dosimeter collection period (seconds)
year int 1 year (e.g. 2015)
decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)
svn_number int 1 SVN number of satellite
dropped_data int 1 if =1 it means something is wrong with the data record, do not use it
b_coord_radius double 1 radius from earth's dipole axis (earth radii)
b_coord_height double 1 height above the earth's dipole equatorial plane (earth radii)
magnetic_longitude double 1 Magnetic longitude (degrees)
L_shell


double 1 L_shell (earth radii) -- I do not clearly understand the origin of the calculation, but it seems to be a dipole field/T-89
bfield_ratio double 1 Bsatellite/Bequator
local_time double 1 magnetic local time (0-24 hours)
b_sattelite double 1 B field at satellite (gauss)
b_equator double 1 B field at equator (on this field line I think) (gauss)
Diffp double 1 No longer used
sigmap double 1 No longer used
electron_background double 8 estimated background in electron channels E1-E8 (Hz)
proton_background double 8 estimated background in proton channels P1-P8 (Hz)
proton_activity int 1 =1 if there is significant proton activity
electron_temperature double 1 electron temperature from a one Maxwellian fit (MeV)
electron_density_fit double 1 electron number density from a one Maxwellian fit (cm-3)
model_counts_electron_fit double 8 E1-E8 rates from the 2-parameter Maxwellian fit to the electron data
dtc_counts_electron double 8 Dead time corrected electron rates (from data, not fit)
integral_flux_instrument


double 30 (based on 2 parameter Maxwellian fit) integral of electron flux above integral_flux_energy[i] particles/(cm2 sec)
integral_flux_energy double 30 energies for the integral of integral_flux_instrument (MeV)
electron_diff_flux_energy double 15 energies for the fluxes in electron_diff_flux_energy (MeV)
electron_diff_flux double 15 (based on 2 parameter Maxwellian fit) electron flux at energies electron_diff_flux[i] (particle/(cm2 sr MeV sec))


Notes (confirmed with the CXD team):
  1. Some of the data are given in terms of Earth radii. For the purposes of those measurements, the Earth is assumed to be a sphere of radius 6371 km.
  2. A record's timestamp is the time of the centre of the collection interval.
  3. Magnetic local time is defined as the difference (in units of time, where 24 hours is equivalent to 360°) between the magnetic longitude of the record and the magnetic longitude of the anti-solar point at the time of the record.