2017-08-09

Further Sanity Checking of the LANL GPS Charged-Particle Dataset

Prior posts in this series may be found at:
In this post I continue looking at the values of fields for the CXD experiment carried on ns53 through ns73.

Sanity check for collection_interval

The values for collection_interval seem reasonable. (Recall that we processed these values earlier, at stage 4 and stage 10.)

Sanity check for year

The values for year seem reasonable.

Sanity check for decimal_year

The documentation for decimal_year states:

Column Variable name type Dim. description
24 decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)

But if we look at the values in the files, we see an immediate problem. The values are formatted as, for example:
  2.007879e+03
That is, the smallest increment is 0.000001e+03 years, which is 0.001 years, or somewhat less than nine hours. This is obviously useless.

The decimal_year field is redundant, as it can be derived, as stated in the documentation, from the year field and the decimal_day field. But if the decimal_year field is going to be present, it should at least have the same value as the result of the calculation in the documentation. There is obvious use for this field if it has an accurate value, so that brings us to define stage 26:

Stage 26: Recalculate decimal_year


The values of decimal_day are recorded to 10-6 day. So decimal_year should be recorded to (roughly) 10-9 year. We can accomplish the reformatting easily, by applying the following script, do-stage-26.py,  to each file:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import re
import sys

filename = sys.argv[1]

with open(filename) as records:
  for line in records:
    fields = line.split()
    decimal_day_str = fields[0]
    year_str = fields[22]
    decimal_day_value = float(decimal_day_str)
    year_value = int(year_str)
   
    days_in_year = 365
   
    if ( (year_value == 2004) or (year_value == 2008) or (year_value == 2012) or (year_value == 2016)):
      days_in_year = 366
     
    decimal_year_value = year_value + (decimal_day_value - 1.0) / days_in_year
     
    decimal_year_str = '{:14.9f}'.format(decimal_year_value)
   
    fields[23] = decimal_year_str
    newline = " ".join(fields)
  
    print newline


Applying this to each file:

 for file in ns[567]*
 do 
   ./do-stage-26.py $file | tr -s " " | sed 's/ $//' > ../gps-stage-26/$file
 done

The resulting maximum and minimum values of decimal_year look like this:

decimal_year
Satellite Minimum Maximum
53 2005.769864123 2016.999996964
54 2001.169868310 2016.999995825
55 2007.879454718 2017.000000000
56 2003.145206589 2016.999995066
57 2008.051914464 2017.000000000
58 2006.939732433 2016.999995825
59 2004.256832087 2016.999999620
60 2004.562844536 2016.999993169
61 2004.887981525 2016.999998104
62 2010.465755041 2016.999998325
63 2011.578088408 2016.999999653
64 2014.164385844 2016.923493000
65 2012.784155790 2016.999996205
66 2013.435620655 2016.999996964
67 2014.394527493 2016.999996396
68 2014.624659342 2016.999995003
69 2014.912332318 2016.999999716
70 2016.120224178 2016.999992601
71 2015.276714230 2016.999999874
72 2015.583565671 2016.999995825
72 2015.871238110 2016.999997628

Sanity check for SVN_number


The documentation for SVN_number states:

Column Variable name type Dim. description
25 SVN_number int 1 SVN number of satellite

The values for SVN_number seem to be correct, except for the obvious silliness that the field name, when expanded, is Satellite_Vehicle_Number_number, and the description includes the nonsensical term SVN number. Consequently I simply redefine this field as:

Column Variable name type Dim. description
25 SVN int 1 Satellite Vehicle Number

Sanity check for b_coord_radius


The maximum and minimum values for b_coord_radius are:

b_coord_radius
Satellite Minimum Maximum
53 6.828054e-01 4.200857e+00
54 8.312735e-01 4.211072e+00
55 8.557934e-01 4.198010e+00
56 7.165368e-01 4.202026e+00
57 7.242072e-01 4.171993e+00
58 7.577640e-01 4.188587e+00
59NA 4.214008e+00
60 NA 4.205339e+00
61 NA 4.223202e+00
62 1.304388e+00 4.193189e+00
63 6.851737e-01 4.209296e+00
64 1.461084e+00 4.209824e+00
65 1.361361e+00 4.202571e+00
66 1.049070e+00 4.182467e+00
67 7.584448e-01 4.211562e+00
68 7.835127e-01 4.211309e+00
69 7.074734e-01 4.169802e+00
70 8.317451e-01 4.207506e+00
71 1.880229e+00 4.171600e+00
72 1.366415e+00 4.170103e+00
72 7.233613e-01 4.213620e+00

There are two obvious problems with these values: some satellites contain invalid (NA) values; and the precision of the field varies as a function of the value of the field.

We can remove all the records that contain the an invalid value easily enough, which gives us stage 27.

Stage 27: Remove records with invalid values of b_coord_radius


for file in ns[567]*
do
  awk '$26!="NA" { print $0 }' $file > ../gps-stage-27/$file
done

This gives us:

b_coord_radius
Satellite Minimum Maximum
53 6.828054e-01 4.200857e+00
54 8.312735e-01 4.211072e+00
55 8.557934e-01 4.198010e+00
56 7.165368e-01 4.202026e+00
57 7.242072e-01 4.171993e+00
58 7.577640e-01 4.188587e+00
59 6.926215e-01 4.214008e+00
60 8.069324e-01 4.205339e+00
61 7.764630e-01 4.223202e+00
62 1.304388e+00 4.193189e+00
63 6.851737e-01 4.209296e+00
64 1.461084e+00 4.209824e+00
65 1.361361e+00 4.202571e+00
66 1.049070e+00 4.182467e+00
67 7.584448e-01 4.211562e+00
68 7.835127e-01 4.211309e+00
69 7.074734e-01 4.169802e+00
70 8.317451e-01 4.207506e+00
71 1.880229e+00 4.171600e+00
72 1.366415e+00 4.170103e+00
72 7.233613e-01 4.213620e+00

Stage 28: Reformat values of b_coord_radius


As noted above, the accuracy of the numbers reported varies as a function of the value of the field. This is not a desirable situation. Since values greater than unity are reported with a precision of a millionth of a terrestrial radius, there seems no point in reporting values at greater precision (i.e., less than about 6m), especially since such a precision surely vastly exceeds the accuracy of the magnetic field model on which the numbers are based.

Consequently, we can reformat the values so that they have a consistent precision of one millionth of the terrestrial radius. We define a script called do-stage-28.py:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import re
import sys

filename = sys.argv[1]

with open(filename) as records:
  for line in records:
    fields = line.split()
    value_str = fields[25]      # b_coord_radius
    value = float(value_str)
     
    output_format = '{:8.6f}'
         
    new_value_str = output_format.format(value)
   
    fields[25] = new_value_str
    newline = " ".join(fields)
  
    print newline

And operate on the files:

for file in ns[567]*
do 
  ./do-stage-28.py $file | tr -s " " | sed 's/ $//' > ../gps-stage-28/$file
done

Now we have:

b_coord_radius
Satellite Minimum Maximum
53 0.682805 4.200857
54 0.831273 4.211072
55 0.855793 4.198010
56 0.716537 4.202026
57 0.724207 4.171993
58 0.757764 4.188587
59 0.692622 4.214008
60 0.806932 4.205339
61 0.776463 4.223202
62 1.304388 4.193189
63 0.685174 4.209296
64 1.461084 4.209824
65 1.361361 4.202571
66 1.049070 4.182467
67 0.758445 4.211562
68 0.783513 4.211309
69 0.707473 4.169802
70 0.831745 4.207506
71 1.880229 4.171600
72 1.366415 4.170103
72 0.723361 4.213620

Sanity check for b_coord_height


The maximum and minimum values for b_coord_height are:

b_coord_height
Satellite Minimum Maximum
53 -4.105137e+00 4.082754e+00
54 -4.011254e+00 4.119900e+00
55 -4.097133e+00 3.981823e+00
56 -4.091484e+00 4.110250e+00
57 -4.076309e+00 4.113386e+00
58 -4.051769e+00 4.112167e+00
59 -4.092374e+00 4.107748e+00
60 -3.965876e+00 4.114727e+00
61 -4.095934e+00 4.074463e+00
62 -3.918038e+00 3.950894e+00
63 -4.087862e+00 4.103668e+00
64 -3.801516e+00 3.899932e+00
65 -3.946078e+00 3.755339e+00
66 -3.898498e+00 4.027861e+00
67 -4.095983e+00 4.105099e+00
68 -4.039725e+00 4.085420e+00
69 -4.077435e+00 4.103233e+00
70 -4.077986e+00 4.069141e+00
71 -3.667803e+00 3.713735e+00
72 -3.727146e+00 3.938838e+00
72 -4.134635e+00 4.132903e+00

These require the same kind of massaging as b_coord_radius.

Stage 29: Reformat values of b_coord_height


We define do-stage-29.py:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import re
import sys

filename = sys.argv[1]

with open(filename) as records:
  for line in records:
    fields = line.split()
    value_str = fields[26]      # b_coord_height
    value = float(value_str)
    
    output_format = '{:8.6f}'
    
    new_value_str = output_format.format(value)
  
    fields[26] = new_value_str
    newline = " ".join(fields)
 
    print newline


and apply it to the files:

for file in ns[567]*
do 
  ./do-stage-29.py $file | tr -s " " | sed 's/ $//' > ../gps-stage-29/$file
done

with the result:

b_coord_height
Satellite Minimum Maximum
53 -4.105137 4.082754
54 -4.011254 4.119900
55 -4.097133 3.981823
56 -4.091484 4.110250
57 -4.076309 4.113386
58 -4.051769 4.112167
59 -4.092374 4.107748
60 -3.965876 4.114727
61 -4.095934 4.074463
62 -3.918038 3.950894
63 -4.087862 4.103668
64 -3.801516 3.899932
65 -3.946078 3.755339
66 -3.898498 4.027861
67 -4.095983 4.105099
68 -4.039725 4.085420
69 -4.077435 4.103233
70 -4.077986 4.069141
71 -3.667803 3.713735
72 -3.727146 3.938838
72 -4.134635 4.132903

Sanity check for magnetic_longitude


The maximum and minimum values for magnetic_longitude are:

magnetic_longitude
Satellite Minimum Maximum
53 4.185175e-03 3.599997e+02
54 1.637927e-04 3.599992e+02
55 3.244386e-05 3.600000e+02
56 3.396840e-04 3.599996e+02
57 5.629913e-04 3.599997e+02
58 1.443975e-03 3.599997e+02
59 2.364732e-05 3.600000e+02
60 3.059084e-04 3.599989e+02
61 6.888084e-04 3.599998e+02
62 9.731220e-04 3.599994e+02
63 1.367992e-03 3.599995e+02
64 3.209574e-04 3.599982e+02
65 1.819354e-03 3.599992e+02
66 8.708465e-04 3.599990e+02
67 1.153824e-05 3.599956e+02
68 6.599287e-04 3.599986e+02
69 5.731681e-03 3.599997e+02
70 3.041434e-02 3.599747e+02
71 5.421115e-04 3.599966e+02
72 1.369661e-03 3.599986e+02
72 6.566200e-04 3.599996e+02


We apply changes similar to stage 23, as follows.

Stage 30: reformat values of magnetic_longitude

We define do-stage-30.py:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import re
import sys

filename = sys.argv[1]

field_nr = 27    # magnetic_longitude

with open(filename) as records:
  for line in records:
    fields = line.split()
    value_str = fields[field_nr]      # longitude
    value = float(value_str)
   
    if value < 0:
      value += 360
     
    if value >= 360:
      value -= 360
     
    new_value_str = '{:9.4f}'.format(value)
   
    fields[field_nr] = new_value_str
    newline = " ".join(fields)
  
    print newline

And apply it:

for file in ns[567]*
do 
  ./do-stage-30.py $file | tr -s " " | sed 's/ $//' > ../gps-stage-30/$file
done

With the result:


magnetic_longitude
Satellite Minimum Maximum
53 0.0042 359.9997
54 0.0002 359.9992
55 0.0000 359.9998
56 0.0003 359.9996
57 0.0006 359.9997
58 0.0014 359.9997
59 0.0000 359.9999
60 0.0003 359.9989
61 0.0007 359.9998
62 0.0010 359.9994
63 0.0014 359.9995
64 0.0003 359.9982
65 0.0018 359.9992
66 0.0009 359.9990
67 0.0000 359.9956
68 0.0007 359.9986
69 0.0057 359.9997
70 0.0304 359.9747
71 0.0005 359.9966
72 0.0014 359.9986
72 0.0007 359.9996


We  checkpoint the stage 30 dataset. The checkpoint file has the MD5 checksum: eba369c61f7974a636358758e1035ae9.

The data table for ns41 and ns48 still looks like this:


Column Variable name type Dim. Description
1 decimal_day double 1 GPS time -- a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years
2 Geographic_Latitude double 1 Latitude of satellite (deg)
3 Geographic_Longitude double 1 Longitude of satellite (deg)
4 Rad_Re double 1 (radius of satellite)/Rearth
5-12 rate_electron_measured double 8 Measured rate (Hz) in each of the 8 BDD electron channels (E1-E8)
13-20 rate_proton_measured double 8 Measured rate (Hz) in each of the 8 BDD proton channels (P1-P8)
21 collection_interval int 1 dosimeter collection period (seconds)
22 year int 1 year (e.g. 2015)
23 decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)
24 svn_number int 1 SVN number of satellite
25 b_coord_radius double 1 radius from earth's dipole axis (earth radii)
26 b_coord_height double 1 height above the earth's dipole equatorial plane (earth radii)
27 magnetic_longitude double 1 Magnetic longitude (degrees)
28 L_shell double 1 L_shell (earth radii) -- I do not clearly understand the origin of the calculation, but it seems to be a dipole field/T-89
29 bfield_ratio double 1 Bsatellite/Bequator
30 local_time double 1 magnetic local time (0-24 hours)
31 b_sattelite double 1 B field at satellite (gauss)
32 b_equator double 1 B field at equator (on this field line I think) (gauss)
33-40 electron_background double 8 estimated background in electron channels E1-E8 (Hz)
41-48 proton_background double 8 estimated background in proton channels P1-P8 (Hz)
49 proton_activity int 1 =1 if there is significant proton activity
50 electron_temperature double 1 electron temperature from a one Maxwellian fit (MeV)
51 electron_density_fit double 1 electron number density from a one Maxwellian fit (cm-3)
52-59 model_counts_electron_fit double 8 E1-E8 rates from the 2-parameter Maxwellian fit to the electron data
60-67 dtc_counts_electron double 8 Dead time corrected electron rates (from data, not fit)
68-97 integral_flux_instrument double 30 (based on 2 parameter Maxwellian fit) integral of electron flux above integral_flux_energy[i] particles/(cm2sec)
98-127 integral_flux_energy double 30 energies for the integral of integral_flux_instrument (MeV)
128-142 electron_diff_flux_energy double 15 energies for the fluxes in electron_diff_flux_energy (MeV)
143-157 electron_diff_flux double 15 (based on 2 parameter Maxwellian fit) electron flux at energies electron_diff_flux[i] (particle/(cm2 sr MeV sec))

And for the remaining satellites we now have (I have clarified the descriptions of the b_coord_radius and b_coord_height fields):


Column Variable name type Dim. description
1 decimal_day double 1 GPS time: a decimal number in the range [1, 367) in leap years or [1, 366) otherwise, representing the day of the year (1-Jan 00:00 to 31-Dec 24:00).
2 Geographic_Latitude double 1 Latitude of satellite (°, N +ve)
3 Geographic_Longitude double 1 Longitude of satellite (°, E +ve, measured from Greenwich meridian)
4 Rad_Re double 1 Distance from centre of Earth, in units of Earth radii.
5-15 rate_electron_measured double 11 Measured rate (Hz) in each of the 11 CXD electron channels
16-20 rate_proton_measured double 5 Measured rate (Hz) in each of the 5 CXD proton channels (P1-P5)
21 LEP_thresh double 1 LEP threshold in E1 channels, in keV
22 collection_interval int 1 dosimeter collection period (seconds)
23 year int 1 year (e.g. 2015)
24 decimal_year double 1 decimal year = year + (decimal_day-1.0) / (days in year)
25 SVN int 1 Satellite Vehicle Number
26 b_coord_radius double 1 Distance from dipole axis, in units of Earth radii.
27 b_coord_height double 1 Distance from dipole equatorial plane, in units of Earth radii (N +ve).
28 magnetic_longitude double 1 Magnetic longitude (degrees)
29 L_shell double 1 L shell: McIlwain calculation according to model with T89 External Field, IGRF Internal Field.
30 L_LGM_TS04IGRF double 1 LanlGeoMag L-shell McIlwain calculation, TS04 External Field, IGRF Internal Field.
31 L_LGM_OP77IGRF double 1 LanlGeoMag L-shell McIlwain calculation, OP77 External Field, IGRF Internal Field (not currently filled)
32 L_LGM_T89CDIP double 1 LanlGeoMag L-shell McIlwain calculation, T89 External Field, Centered Dipole Internal Field
33 bfield_ratio double 1 Bsatellite/Bequator
34 local_time double 1 magnetic local time (0-24 hours)
35 utc_lgm double 1 UTC (0-24 hours)
36 b_sattelite double 1 B field at satellite (gauss)
37 b_equator double 1 B field at equator (on this field line I think) (gauss)
38-48 electron_background double 11 estimated background in electron channels E1-E11 (Hz)
49-53 proton_background double 5 estimated background in proton channels P1-P5 (Hz)
54 proton_activity int 1 =1 if there is significant proton activity
55 proton_temperature_fit double 1 characteristic momentum -- R0 in the expression given above (MeV/c)
56 proton_density_fit double 1 N0 parameter in fit to proton flux ((protons/(cm2 sec sr MeV))
57 electron_temperature_fit double 1 electron temperature from a one Maxwellian fit (MeV)
58 electron_density_fit double 1 electron number density from a one Maxwellian fit (cm-3)
59-63 model_counts_proton_fit_pf double 5 P1-P5 rate from proton fit (using proton_temperature_fit, proton_density_fit)
64-74 model_counts_electron_fit double 11 E1-E11 rates from the 9-parameter electron flux model
75-80 proton_integrated_flux_fit double 6 integral of proton flux (based on fit) above 10, 15.85, 25.11, 30, 40, 79.43 MeV (proton kinetic energy)
81-110 integral_flux_instrument double 30 (based on 9 parameter fit) integral of electron flux above integral_flux_energy[i] particles/(cm2 sec)
111-140 integral_flux_energy double 30 energies for the integral of integral_flux_instrument (MeV)
141-155 electron_diff_flux_energy double 15 energies for the fluxes in electron_diff_flux_energy (MeV)
156-170 electron_diff_flux double 15 (based on 9 parameter fit) electron flux at energies electron_diff_flux[i] (particle/(cm2 sr MeV sec))
171-179 Efitpars double 9 fit parameters for 9 parameter electron fit


Finally, here are the number of records for each satellite at the end of the processing for stage 30:


Satellite Stage 30 Records
ns41 1,990,340
ns48 1,105,300
ns53 1,331,017
ns54 1,938,603
ns55 1,054,569
ns56 1,680,369
ns57 1,082,028
ns58 1,174,629
ns59 1,513,323
ns60 1,492,497
ns61 1,467,283
ns62 774,976
ns63 651,078
ns64 343,643
ns65 480,017
ns66 446,139
ns67 327,174
ns68 304,964
ns69 262,021
ns70 110,260
ns71 220,694
ns72 181,519
ns73 145,292

2017-07-27

Switching From Nouveau to the Proprietary NVIDIA driver in Debian Jessie

A couple of weeks ago, I started to experience random crashes on my 64-bit jessie desktop machine. Generally, these took the form of either a frozen desktop or a sudden blank screen, along with a complete lack of responsiveness to either mouse or keyboard.

Usually there was no obvious associated entry in any system log, but at last a series of messages appeared in the syslog file, starting with this one:

Jul 17 13:55:05 homebrew kernel: [24064.296254] nouveau E[ PFIFO][0000:01:00.0] write fault at 0x000029d000 [PTE] from GR/GPC0/GPCCS on channel 0x003fbad000 [Xorg[2071]] 

This was followed by several more messages that appeared to be related, the last of which was:

Jul 17 13:58:23 homebrew kernel: [24262.075187] nouveau E[ DRM] GPU lockup - switching to software fbcon

(On this occasion, although the desktop was non-responsive after the first message, I could still ssh into the machine, and shut it down cleanly from the ssh session, which is why there is a period of several minutes between these two messages.)

This suggested that the cause lay in the nouveau video driver, so I decided to switch to the proprietary NVIDIA driver. This turned out not to be as easy as one might expect, since there didn't seem to be a single place that defines the complete procedure  in detail. Hence this post.

Here are the steps that I followed:

1. Install the nvidia-driver package.

2. Install the nvidia-xconfig package.

3. Run nvidia-xconfig.

This complained about the lack of an xorg.conf file, but generated a default one with an nvidia entry for the driver.  There were several other errors, but rebooting at this point resulted in a system that booted and ran X.

So far so good, but during the boot sequence I noticed that the text on the system console was enormous. Similarly, if I switched to the console once the system had booted, the text appeared to be about 80x24, which is quite obnoxious on a 27-inch monitor.

Following the instructions at:

https://wiki.archlinux.org/index.php/GRUB/Tips_and_tricks#Setting_the_framebuffer_resolution

I added two lines to the file /etc/default/grub:

GRUB_GFXMODE=1280x1024x16,1024x768,auto 
GRUB_GFXPAYLOAD_LINUX=keep

DO NOT DO THIS.

After executing
  grub-mkconfig -o /boot/grub/grub.cfg
and rebooting, although the text on the console looked much better, I no longer had any X-based desktop. Switching to :0 merely gave me a blank screen. So I restored the grub.cfg file to the original version.

The above-named URL provides a deprecated mechanism for changing the console font, so that's what I ended up using. In particular, I changed one line of the /etc/default/grub file to read:

GRUB_CMDLINE_LINUX_DEFAULT="quiet vga=794"

and executed: 
  grub-mkconfig -o /boot/grub/grub.cfg

According to this documentation, this gives a 1280x1024 16-bit console, which is a somewhat lower resolution than I had with the nouveau driver, but is vastly better than the resolution without this line in the grub configuration file.

Now everything is working to my satisfaction. The only quirk I see is that at boot time, there is a LOT of disk activity for about 30 seconds after the desktop starts. I'm not sure what the reason for this might be, but at the end of it I have a fully-functioning system with a KDE desktop on :0 and i3 on :1, and can switch to a reasonable-looking console at will.

The best news is that, at least so far, I have experienced no system crashes since switching to the proprietary driver.



2017-07-24

Most-Logged Stations in CQ WW CW 2016

The public CQ WW CW logs allow us easily to tabulate the stations that appear in the largest number of entrants' logs. For 2016, the ten stations with the largest number of appearances were:

Callsign Appearances % logs
HK1NA 10,277 59
9A1A 9,921 63
TK0C 9,682 61
CN2R 9,675 63
PJ2T 9,569 53
CR3W 9,354 60
LZ9W 9,091 59
CN2AA 8,912 59
P33W 8,661 56
EF8R 8,409 57

The first column in the table is the callsign. The second column is the total number of times that the call appears in other stations' logs. That is, if a station worked HK1NA on six bands, that will increment the value in the second column of the HK1NA row by six. The third column is the percentage of logs that contain the callsign at least once.

Tables for prior years are available here.

We can also list the cumulative data for the ten-year span from 2007 to 2016:

Callsign Appearances % logs
LZ9W 91,382 67
PJ2T 84,001 58
9A1A 81,583 61
DF0HQ 80,298 63
PJ4A 72,262 55
D4C 71,571 49
W3LPL 69,447 52
LX7I 69,254 55
K3LR 68,908 53
CR3L 64,426 48

2017-07-17

Additional Information in Augmented Logs for CQ WW, 2005 to 2016

Now available are new augmented versions of the public logs for CQ WW CW and SSB for the period 2005 to 2016.

The cleaned logs are the result of processing the QSO: lines from the entrants' submitted Cabrillo files to ensure that all fields contain valid values and all the data match the format required in the rules. Any line containing illegal data in a field (for example, a zone number greater than 40, or a date/time stamp that is outside the contest period) has simply been removed. Also, only the QSO: lines are retained, so that each line in the file can be processed easily. The MD5 checksum for the file of cleaned logs is: 1b47059d1f2431b55d89a5eb954a05cc.

The augmented logs contain the same information as the cleaned logs, with the addition of some useful information on each line. The MD5 checksum for the compressed (~800 MB) file of augmented logs is: 7a728987fb8637ab8c156df3fa27d582. The information added to each line now includes two new fields: the callsign copied by the second party in the case that the second party bust the cull of the first party; amd the correct callsign of the second party in the case that the first party bust the second party's call.

In all, the addition fields in the augmented file comprise:
  1. The letter "A" or "U" indicating "assisted" or "unassisted"
  2. A four-digit number representing the time if the contact in minutes measured from the start of the contest. (I realise that this can be calculated from the other information on the line, but it saves a lot of time to have the number readily available in the file without having to calculate it each time.)
  3. Band
  4. A set of eleven flags, each -- apart from column k -- encoded as T/F: 
    • a. QSO is confirmed by a log from the second party 
    • b. QSO is a reverse bust (i.e., the second party appears to have bust the call of the first party) 
    • c. QSO is an ordinary bust (i.e., the first party appears to have bust the call of the second party) 
    • d. the call of the second party is unique 
    • e. QSO appears to be a NIL 
    • f. QSO is with a station that did not send in a log, but who did make 20 or more QSOs in the contest 
    • g. QSO appears to be a country mult 
    • h. QSO appears to be a zone mult 
    • i. QSO is a zone bust (i.e., the received zone appears to be a bust)
    • j. QSO is a reverse zone bust (i.e. the second party appears to have bust the zone of the first party)
    • k. This entry has three possible values rather than just T/F:
      • T: QSO appears to be made during a run by the first party
      • F: QSO appears not to be made during a run by the first party
      • U: the run status is unknown because insufficient frequency information is available in the first party's log 
  5. If the QSO is a reverse bust, the call logged by the second party; otherwise, the placeholder "-"
  6. If the QSO is an ordinary bust, the correct call that should have been logged by the first party; otherwise, the placeholder "-"
  7. If the QSO is a reverse zone bust, the zone logged by the second party; otherwise, the placeholder "-"
  8.  If the QSO is an ordinary zone bust, the correct zone that should have been logged by the first party; otherwise, the placeholder "-"
Notes:
  • The encoding of some of the flags requires subjective decisions to be made as to whether the flag should be true or false; consequently, and because CQ has yet to understand the importance of making their scoring code public, the value of a flag for a specific QSO line in some circumstances might not match the value that CQ would assign. (Also, CQ has more data available in the form of check logs, which are not made public.)
  • I made no attempt to deduce the run status of a QSO in the second party's log (if such exists), regardless of the status in the first party's log. This allows one cleanly to perform correct statistical analyses anent the number of QSOs made by running stations merely by excluding QSOs marked with a U in column k.
  • No attempt is made to detect the case in which both participants of a QSO bust the other station's call. This is a problematic situation because of the relatively high probability of a false positive unless both stations log the frequency as opposed to the band. (Also, on bands on which split-frequency QSOs are common, the absence of both transmit and receive frequency is a problem.) Because of the likelihood of false positives, it seems better, given the presumed rarity of double-bust QSOs, that no attempt be made to mark them.
  • The entries for the zones in the case of zone or reverse zone busts are normalised to two-digit values.

On Reporting Bust Rates in Contests

The usual metric used for comparing the bust rates for different stations is very simple: if a station makes a N QSOs, of which B are detected as busts, (usually by comparing the station's log with the logs submitted by other competitors) then the bust rate R is given simply by:
$$R = B / N$$
Admirably simple though this is, it suffers from at least two defects that seem to me to be fatal.

1. Allowing for the number of checkable QSOs


Suppose that we have two stations, 1 and 2, and we wish to compare their bust rates. Suppose that both stations make 100 QSOs, and suppose further that they both bust 5 QSOs, (as determined by an inspection of the logs of other competitors); then the usual inference is that the bust rate for the two stations is the same.

But one cannot conclude that, because the usual equation does not take into account the fact that, of the 100 QSOs made by each of the two stations, a very different number might have submitted their log to the contest sponsor in the two cases.

Suppose, for example, that all 100 of the QSOs made by the first station can be cross-checked with submitted logs, but only 50 of the QSOs made by the second station can be so checked. It is clear that the second station is likely not to be as good at copying calls as the first, and a better measure of the bust rate is made by using NV, the number of verifiable QSOs in place of the raw number N:
$$R = B / N_V$$
We can see whether this affects the ordering of, for example, the stations with the most busts in the CQ WW contest.

If we look at the stations with the most busts in the 2016 CQ WW CW contest, using the normal formula for calculating the percentage of busts, we find:

2016 CW -- Most Busts
Position Call QSOs Busts % Busts
1 PV8ADI 2,057 217 10.5
2 TK0C 12,521 171 1.4
3 LU2WA 2,048 169 8.3
4 TM1A 6,255 156 2.5
5 HG5F 3,062 147 4.8
6 HK1NA 14,037 145 1.0
7 CN2R 12,351 137 1.1
8 LZ9W 10,613136 1.3
9 PI4CC 7,736 119 1.5
10 NP2P 3,842 116 3.0

Using the revised formula, this becomes:

2016 CW -- Most Busts
Position Call Verified QSOs Busts % Busts
1 PV8ADI 1,666 217 13.0
2 TK0C 9,664 171 1.8
3 LU2WA 1,596 169 10.6
4 TM1A 5,307 156 2.9
5 HG5F 2,638 147 5.6
6 HK1NA 10,108 145 1.4
7 CN2R 9,544 137 1.4
8 LZ9W 8,946136 1.5
9 PI4CC 6,706 119 1.8
10 NP2P 3,162 116 3.7

2. Allowing for sampling distortions


The second issue with the normal calculation is more subtle, and is perhaps most easily seen with an example. In order to make the problem clearer, we will use an extreme example, but it should readily be seen that the problem exists to a lesser degree when the situation is less extreme.

Suppose we have two stations, the first of which makes 100 verified QSOs and busts no calls. Obviously, his verified bust rate is 0%. Now suppose that the second station makes 1,000 verified QSOs and also busts no calls. The problem now should be obvious: both stations have the same bust rate, and yet it is clear that the second operator is almost certainly better at copying than the first. We need a way to deal with this kind of situation.

The solution is to realise what we are trying to calculate, and how it relates to the data available to us. The real goal of calculating the bust percentage is to produce a measurement (or, rather, an estimate) of the rate at which an operator makes mistakes in copying calls.

So we make the simplifying assumption that each operator has a probability $p_n$ of busting a single call (if you think about it, this is not quite as silly as it might seem, since although the value of $p$ will doubtless vary under different reception conditions even for a single operator, all those variations can be taken into account by changing the meaning of $p$ so that it is a kind of mean value that reflects the conditions prevailing at the operator's location; this adjustment will vary from location to location, but that does not really matter, since we aren't trying to estimate some kind of ideal bust rate for a perfect location, but, rather, the bust rate that actually prevails for each operator at that operator's location). Now we can easily analyse how to compare the accuracy of competing operators.

Suppose that an operator has a probability $p_B$ of busting a call (and, hence, a probability $q = (1 - p_B)$ of copying a call correctly). Now if the operator makes $N_V$ verified QSOs, then the probability of there being a particular number of busts $B$ is given by the binomial distribution:

$$ {N_V \choose B} (p_B)^B (q^{(N_V - B)}) $$

where:

$$ {N_V \choose B} = { {N_V}! \over {B!(N_V-B)!} }$$

Call the actual number of busts $B_V$; then we have:

$$ \rm prob(B_V) = {N_V \choose B_V} (p_{B_V})^{B_V} (q^{(N_V - B_V)}) $$

Now, since we know that $B_V$ busts were measured out of a total of $N_V$ QSOs, we can determine what the distribution of $p_{B_V}$ looks like. 

(From this point on, we will drop the ${}_V$ suffix, since we will take it as read that we are discussing verified numbers.)

Looking at the table above, and plotting the relative probability of obtaining the actual number of busts as a function of $p$ for the ten stations listed, we find:



 We normalise each curve, so that the area under each is the same:



A couple of things are apparent from this plot:
  1. We can see that there is a non-negligible overlap between the curves for PV8ADI and LU2WA. This tells us that, despite the apparent substantial difference in the measured error rates in the logs of the two stations, in fact there is a not-completely-negligible probability that the base error rate for LU2WA is actually higher than that for PV8ADI. (We could calculate the actual probability, but at this point I just want to point out that it's obviously of the order of a few percent.)
  2. Although the original rates for PI4CC and TK0C are essentially identical, (and, hence, the peaks of the two curves occur at the same value of $p$) the curve for PI4CC is slightly broader than the curve for TK0C. This is a reflection of the fact that TK0C had more QSOs, and corresponds to the observation above about the two hypothetical stations that had no errors but who logged different numbers of QSOs.
Let us take a brief diversion prior to taking the next step.

Suppose that we somehow knew that a station $S$ had a probability of 0.1 of busting each QSO. And let us suppose that $S$ makes a total of 1,000 QSOs. Then it is easy to plot the probability of the number of actual QSOs $S$ would bust over the course of the contest (this is just the situation in the first equation above):


Now let us change the situation slightly. Suppose that we know that probability of S busting a QSO is either 0.09 or 0.11, but we don't know which of these it is, and each is equally likely. We can see that the difference in the probability curve for the expected number of busts is significant (the black curve is for probability = 0.09, the red for probability = 0.11):



With this information to hand, what is our best guess for the actual probability curve -- where, by "best guess" we mean "minimising the error"? Now, we know that the actual curve will be either the black curve or the red one, but since we don't know which it will be, our best guess will be the mean of the two -- i.e., the green curve. (You might remember from secondary school statistics that this is called the "expectation" -- a name that can be a bit confusing, since we expect this curve to be wrong! Similarly, a statistician will tell you that the "expectation" or "expected value" when rolling a fair 6-sided die is 3.5.)

Now suppose that we know that we know that the probability of a bust lies somewhere between 0.09 and 0.11, with a uniform distribution. Then we have:


where the white represents all the curves with a bust probability between 0.09 and 0.11, the black represents the expectation values taking into account just the two extreme curves, for bust probabilities of 0.09 and 0.11, and the green is the expectation curve for the entire range of probabilities between 0.9 and 0.11, uniformly distributed.

This is almost the situation that pertains in the case we are examining, with the exception that the curves we saw before we started this digression show that the values of $p$ are not distributed uniformly over a range. (To a good approximation, they are gaussian, at least for the higher values of the ratio $B / N$, although we won't take advantage of that approximation.)

So, if we take the example of PV8ADI, we can create a plot that shows the relative probability of obtaining a particular number of busts, given the distribution of probabilities $p$ that lead to the observed number of busts, $B$, in a total of $N$ QSOs. (If you think about it, you might be able to see that what we are doing here is to account for a second order perturbation on the first-order results. The size of this perturbation increases as the ratio $B/N$ decreases, as does the asymmetry of the first-order curve. The general effect will be to smear the first-order results to become more spread out.)

 For PV8ADI, we find:

where the black line is the basic binomial distribution for PV8ADI, with probability 217 / 1666, normalised to the number of busts for 1,000 QSOs. The green line is a similarly normalised line that takes into account all the binomial distributions for the various values of $p$, weighted by the probability of each value of $p$. As predicted, we see that the green line is more spread-out than the original black line.

We can subject each of the stations to the same treatment, normalising all results to 1,000 QSOs:

So what can we deduce from all this?

To start with, rather than simply quoting some kind of "bust rate", a range should be quoted for each station, representing, say, the 99% confidence limit for the rate.

If we do that, and reorder the stations in order of decreasing upper limit (which seems like the most reasonable ordering: it means that we are 99.5% sure that the actual bust rate is less than this number), then we find:

2016 CW -- 99% confidence limits for $p_B$
Position Call lower limit upper limit
1 PV8ADI 0.110 0.153
2 LU2WA 0.088 0.127
3 HG5F 0.045 0.068
4 NP2P 0.029 0.046
5 TM1A 0.024 0.036
6 PI4CC 0.014 0.022
7 TK0C 0.015 0.022
8 LZ9W 0.0120.019
9 HK1NA 0.012 0.018
10 CN2R 0.012 0.018

We can also perform more sophisticated comparisons between or among stations. For example, we might ask the question: what is the probability that LU2WA will have more busts than PV8ADI if both make 1,000 QSOs? [FYI, the answer is a little under 9%]

The important point here is that the single number that is usually quoted for a station's bust rate leaves much to be desired, and, in particular, may not be useful if one intends to use it to make comparisons to other stations' bust rates, unless both stations have a similar number of (verified) QSOs. A table, graph or chart is generally a much more useful guide when comparing bust rates between or amongst stations.


2017-06-30

Continued Rationalisation and Sanity Checking of the LANL GPS Charged-Particle Dataset

 Earlier posts in this series:
We continue to rationalise the data by removing fields that are unused or provide no useful information, and marking invalid values.

Stage 21: mark invalid values of L-shell fields for the CXD experiment


I have been informed in a private communication that, for satellites ns53 through ns73, values of the L_shell (i.e., L_LGM_T89IGRF; see stage 14) L_LGM_TS04IGRF, L_LGM_OP77IGRF and L_LGM_T89CDIP fields that are set to 999 indicate a value that is "too large to calculate"; at the very least, L-shell values more than ten or so would in any case seem to be too large to be useful. So we convert all the instances of the value 999 to NA:

  for file in ns[567]*
  do
    awk '{for(i=29;i<=32;++i) if ($i=="9.990000e+02") $i="NA"; {print $0}}' $file > ../gps-stage-21/$file
  done

Sanity checks


At this point we have modified what seem to me to be the most obvious problems with the datasets. However, the unfortunately large number of modifications that have been necessary suggest that it is worthwhile to perform basic sanity checks on the values of all the fields, in an attempt to uncover and correct other issues.

We start by considering, one at a time, the fields in the CXD datasets.

decimal_day


The documentation for decimal_day states:

Column Variable name type Dim. description
1 decimal_day double 1 GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years.

Reviewing the minimum and maximum values, we find:

decimal_day
Satellite Minimum Maximum
53 1.000394 366.998889
54 1.000405 366.999583
55 1.000000 366.998194
56 1.000972 366.998472
57 1.000000 366.997917
58 1.001250 366.999306
59 1.000532 366.999861
60 1.000278 366.998322
61 1.001782 366.999306
62 1.001424 366.999387
63 1.001157 366.999873
64 1.001215 365.999803
65 1.001389 366.998611
66 1.001667 366.998889
67 1.001458 366.998681
68 1.000660 366.998171
69 1.000868 366.999896
70 45.002049 366.997292
71 1.002720 366.999954
72 1.001470 366.998472
72 1.001910 366.999132

Pending more detailed review, these values look not unreasonable; there is therefore no need to change or mark any of the values.

However, we can edit the description of the field in order to make it a little clearer:

Column Variable name type Dim. description
1 decimal_day double 1 GPS time: a decimal number in the range [1, 367) in leap years or [1, 366) otherwise, representing the day of the year (1-Jan 00:00 to 31-Dec 24:00).

Geographic_Latitude


The documentation for Geographic_Latitude states:

Column Variable name type Dim. description
2 Geographic_Latitude double 1 Latitude of satellite (deg)

Reviewing the minimum and maximum values, we find:

Geographic_Latitude
Satellite Minimum Maximum
53 -5.607510e+01 5.607590e+01
54 -5.527530e+01 5.527490e+01
55 -5.488720e+01 5.488530e+01
56 -5.678880e+01 5.679100e+01
57 -5.613660e+01 5.613600e+01
58 -5.676120e+01 5.676080e+01
59 -5.591630e+01 5.591610e+01
60 -5.567280e+01 5.567290e+01
61 -5.500000e+01 5.486250e+01
62 -5.608950e+01 5.608750e+01
63 -5.537850e+01 5.537940e+01
64 -5.499830e+01 5.499820e+01
65 -5.497930e+01 5.497810e+01
66 -5.578240e+01 5.578390e+01
67 -5.536860e+01 5.536670e+01
68 -5.498410e+01 5.498420e+01
69 -5.500030e+01 5.500240e+01
70 -5.502090e+01 5.501900e+01
71 -5.505770e+01 5.505750e+01
72 -5.531860e+01 5.531810e+01
72 -5.501500e+01 5.501500e+01

These values look reasonable (i.e., the absolute values of the maxima and minima are roughly equal; and, of course they correspond to the inclination of the satellites). However, the data could be more sensibly reported as numbers with four deicmal places: this would be both more compact and more accurately represent the precision of the numbers. So that suggests:

Stage 22: reformat values of Geographic_Latitude


We can accomplish the reformatting easily, by applying the following script, do-stage-22.py,  to each file:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import re
import sys

filename = sys.argv[1]

with open(filename) as records:
  for line in records:
    fields = line.split()
    value_str = fields[1]      # latitude
    value = float(value_str)
  
    new_value_str = '{:10.4f}'.format(value)
  
    fields[1] = new_value_str
    newline = " ".join(fields)
 
    print newline


Applying this to each file:

  for file in ns[567]*
  do
    ./do-stage-22.py $file | tr -s " " | sed 's/ $//' > ../gps-stage-22/$file
  done

The resulting maxima and minima now look like this:

Geographic_Latitude
Satellite Minimum Maximum
53 -56.0751 56.0759
54 -55.2753 55.2749
55 -54.8872 54.8853
56 -56.7888 56.7910
57 -56.1366 56.1360
58 -56.7612 56.7608
59 -55.9163 55.9161
60 -55.6728 55.6729
61 -55.0000 54.8625
62 -56.0895 56.0875
63 -55.3785 55.3794
64 -54.9983 54.9982
65 -54.9793 54.9781
66 -55.7824 55.7839
67 -55.3686 55.3667
68 -54.9841 54.9842
69 -55.0003 55.0024
70 -55.0209 55.0190
71 -55.0577 55.0575
72 -55.3186 55.3181
72 -55.0150 55.0150

We can also improve the clarity of the description:

Column Variable name type Dim. description
2 Geographic_Latitude double 1 Latitude of satellite (°, N +ve)

Geographic_Longitude


The documentation for Geographic_Longitude states

Column Variable name type Dim. description
3 Geographic_Longitude double 1 Longitude of satellite (deg)

Reviewing the minimum and maximum values, we find:

Geographic_Longitude
Satellite Minimum Maximum
53 -1.799853e+02 3.599998e+02
54 -1.799993e+02 3.599999e+02
55 6.000000e-04 3.599999e+02
56 -1.799998e+02 3.600000e+02
57 1.000000e-04 3.599983e+02
58 -1.798043e+02 3.599996e+02
59 -1.799988e+02 3.599999e+02
60 -1.800000e+02 3.600000e+02
61 -1.799970e+02 3.599998e+02
62 2.000000e-03 3.599986e+02
63 1.700000e-03 3.599990e+02
64 0.000000e+00 3.599998e+02
65 9.000000e-04 3.599995e+02
66 2.900000e-03 3.599995e+02
67 5.000000e-04 3.599987e+02
68 2.000000e-04 3.600000e+02
69 5.000000e-04 3.599997e+02
70 2.500000e-03 3.599793e+02
71 3.900000e-03 3.599997e+02
72 2.100000e-03 3.599992e+02
72 1.000000e-03 3.599954e+02

These numbers raise two issues:

1. Not all values are correctly mapped to the principal domain (i.e., [0, 360))
2. The accuracy to which a value is reported is a function of its value.

The second of these is not unreasonable when reporting some kinds of values (for example, counts), but it makes no sense when applied to geographic longitude. (If this isn't obvious, think about the difference in accuracy for the two values 0+ε and 0-ε; or, in the alternative, consider that the reported accuracy will change under a mere rotation transformation.)

Thus we are led to:

Stage 23: reformat values of Geographic_Longitude


We can accomplish the reformatting easily, by applying the following script, do-stage-23.py,  to each file:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import re
import sys

filename = sys.argv[1]

with open(filename) as records:
  for line in records:
    fields = line.split()
    value_str = fields[2]      # longitude
    value = float(value_str)
   
    if value < 0:
      value += 360
     
    if value >= 360:
      value -= 360
     
    new_value_str = '{:9.4f}'.format(value)
   
    fields[2] = new_value_str
    newline = " ".join(fields)
  
    print newline


Applying this to each file:

  for file in ns[567]*
  do
    ./do-stage-23.py $file | tr -s " " | sed 's/ $//' > ../gps-stage-23/$file
  done

The resulting maxima and minima now look like this:

Geographic_Longitude
Satellite Minimum Maximum
53 0.0000 359.9998
54 0.0002 359.9999
55 0.0006 359.9999
56 0.0000 359.9999
57 0.0001 359.9983
58 0.0010 359.9996
59 0.0001 359.9999
60 0.0000 359.9999
61 0.0015 359.9998
62 0.0020 359.9986
63 0.0017 359.9990
64 0.0000 359.9998
65 0.0009 359.9995
66 0.0029 359.9995
67 0.0005 359.9987
68 0.0000 359.9965
69 0.0005 359.9997
70 0.0025 359.9793
71 0.0039 359.9997
72 0.0021 359.9992
72 0.0010 359.9954

We can also improve the clarity of the description:

Column Variable name type Dim. description
3 Geographic_Longitude double 1 Longitude of satellite (°, E +ve, measured from Greenwich meridian)

Rad_Re


The documentation for Rad_Re states:

Column Variable name type Dim. description
4 Rad_Re double 1 (radius of satellite)/Rearth

Reviewing the minimum and maximum values, we find:

Rad_Re
Satellite Minimum Maximum
53 4.120330e+00 4.217263e+00
54 4.096529e+00 4.241498e+00
55 4.131586e+00 4.206511e+00
56 4.131644e+00 4.206146e+00
57 4.152958e+00 4.184506e+00
58 4.142916e+00 4.194755e+00
59 4.120665e+00 4.218718e+00
60 4.122245e+00 4.215404e+00
61 4.100667e+00 4.237281e+00
62 4.144375e+00 4.209335e+00
63 4.143611e+00 4.213319e+00
64 4.158991e+00 4.213424e+00
65 4.146987e+00 4.206427e+00
66 4.151798e+00 4.186115e+00
67 4.166345e+00 4.215303e+00
68 4.165405e+00 4.214924e+00
69 4.164891e+00 4.173357e+00
70 4.166151e+00 4.211028e+00
71 4.162719e+00 4.175120e+00
72 4.159129e+00 4.178379e+00
72 4.160420e+00 4.217236e+00

While these values look fine, the use of scientific notation wastes space. Accordingly:

Stage 24: reformat values of Rad_Re


We can accomplish the reformatting easily, by applying the following script, do-stage-24.py,  to each file:

#!/usr/bin/env python
# -*- coding: utf8 -*-

import re
import sys

filename = sys.argv[1]

FIELD_NR = 3

with open(filename) as records:
  for line in records:
    fields = line.split()
    value_str = fields[FIELD_NR]      # Rad_Re
    value = float(value_str)
   
    new_value_str = '{:10.6f}'.format(value)
   
    fields[FIELD_NR] = new_value_str
    newline = " ".join(fields)
  
    print newline


Applying this to each file:

  for file in ns[567]*
  do
    ./do-stage-24.py $file | tr -s " " | sed 's/ $//' > ../gps-stage-24/$file
  done

The resulting maxima and minima look like this:

Rad_Re
Satellite Minimum Maximum
53 4.120330 4.217263
54 4.096529 4.241498
55 4.131586 4.206511
56 4.131644 4.206146
57 4.152958 4.184506
58 4.142916 4.194755
59 4.120665 4.218718
60 4.122245 4.215404
61 4.100667 4.237281
62 4.144375 4.209335
63 4.143611 4.213319
64 4.158991 4.213424
65 4.146987 4.206427
66 4.151798 4.186115
67 4.166345 4.215303
68 4.165405 4.214924
69 4.164891 4.173357
70 4.166151 4.211028
71 4.162719 4.175120
72 4.159129 4.178379
72 4.160420 4.217236

We can also improve the clarity of the description:

Column Variable name type Dim. description
4 Rad_Re double 1 Distance from centre of Earth, in units of Earth radii.

Stage 25: reformat values of LEP_thresh


The documentation for LEP_thresh states:

Column Variable name type Dim. description
21 LEP_thresh double 1 LEP threshold in E1 channels (0 means low, 1 means high)

Checking the actual values in the data files shows that the values are, as expected, always either zero or one. However, as the documentation states, the values are stores as doubles. It seems silly to store a simple integer value as a double, so we can reformat this field so that it explicitly contains an integer. Further, I have received via private communication an explanation of the meaning of zero and one in this field: zero means that the threshold is 75 keV, and one means that it is 120 keV. So we can write these values (as integers) directly into the data files:

  for file in ns[567]*
  do 
    awk '$21=="0.000000e+00" {$21="75"}; $21=="1.000000e+00" {$21="120"}; \
 {print $0};' $file > ../gps-stage-25/$file
  done

We can also amend the documentation for the field:

Column Variable name type Dim. description
21 LEP_thresh int 1 LEP threshold in E1 channels, in keV


Time to checkpoint the dataset again. The checkpoint file has the MD5 checksum: 00b632a911316fa9e44093450f56c3cb.

The data table for ns41 and ns48 still looks like this:


Column Variable name type Dim. Description
1 decimal_day double 1 GPS time -- a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years
2 Geographic_Latitude double 1 Latitude of satellite (deg)
3 Geographic_Longitude double 1 Longitude of satellite (deg)
4 Rad_Re double 1 (radius of satellite)/Rearth
5-12 rate_electron_measured double 8 Measured rate (Hz) in each of the 8 BDD electron channels (E1-E8)
13-20 rate_proton_measured double 8 Measured rate (Hz) in each of the 8 BDD proton channels (P1-P8)
21 collection_interval int 1 dosimeter collection period (seconds)
22 year int 1 year (e.g. 2015)
23 decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)
24 svn_number int 1 SVN number of satellite
25 b_coord_radius double 1 radius from earth's dipole axis (earth radii)
26 b_coord_height double 1 height above the earth's dipole equatorial plane (earth radii)
27 magnetic_longitude double 1 Magnetic longitude (degrees)
28 L_shell double 1 L_shell (earth radii) -- I do not clearly understand the origin of the calculation, but it seems to be a dipole field/T-89
29 bfield_ratio double 1 Bsatellite/Bequator
30 local_time double 1 magnetic local time (0-24 hours)
31 b_sattelite double 1 B field at satellite (gauss)
32 b_equator double 1 B field at equator (on this field line I think) (gauss)
33-40 electron_background double 8 estimated background in electron channels E1-E8 (Hz)
41-48 proton_background double 8 estimated background in proton channels P1-P8 (Hz)
49 proton_activity int 1 =1 if there is significant proton activity
50 electron_temperature double 1 electron temperature from a one Maxwellian fit (MeV)
51 electron_density_fit double 1 electron number density from a one Maxwellian fit (cm-3)
52-59 model_counts_electron_fit double 8 E1-E8 rates from the 2-parameter Maxwellian fit to the electron data
60-67 dtc_counts_electron double 8 Dead time corrected electron rates (from data, not fit)
68-97 integral_flux_instrument double 30 (based on 2 parameter Maxwellian fit) integral of electron flux above integral_flux_energy[i] particles/(cm2sec)
98-127 integral_flux_energy double 30 energies for the integral of integral_flux_instrument (MeV)
128-142 electron_diff_flux_energy double 15 energies for the fluxes in electron_diff_flux_energy (MeV)
143-157 electron_diff_flux double 15 (based on 2 parameter Maxwellian fit) electron flux at energies electron_diff_flux[i] (particle/(cm2 sr MeV sec))


And for the remaining satellites we now have:


Column Variable name type Dim. description
1 decimal_day double 1 GPS time: a decimal number in the range [1, 367) in leap years or [1, 366) otherwise, representing the day of the year (1-Jan 00:00 to 31-Dec 24:00).
2 Geographic_Latitude double 1 Latitude of satellite (°, N +ve)
3 Geographic_Longitude double 1 Longitude of satellite (°, E +ve, measured from Greenwich meridian)
4 Rad_Re double 1 Distance from centre of Earth, in units of Earth radii.
5-15 rate_electron_measured double 11 Measured rate (Hz) in each of the 11 CXD electron channels
16-20 rate_proton_measured double 5 Measured rate (Hz) in each of the 5 CXD proton channels (P1-P5)
21 LEP_thresh double 1 LEP threshold in E1 channels, in keV
22 collection_interval int 1 dosimeter collection period (seconds)
23 year int 1 year (e.g. 2015)
24 decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)
25 SVN_number int 1 SVN number of satellite
26 b_coord_radius double 1 radius from earth's dipole axis (earth radii)
27 b_coord_height double 1 height above the earth's dipole equatorial plane (earth radii)
28 magnetic_longitude double 1 Magnetic longitude (degrees)
29 L_shell double 1 L shell: McIlwain calculation according to model with T89 External Field, IGRF Internal Field.
30 L_LGM_TS04IGRF double 1 LanlGeoMag L-shell McIlwain calculation, TS04 External Field, IGRF Internal Field.
31 L_LGM_OP77IGRF double 1 LanlGeoMag L-shell McIlwain calculation, OP77 External Field, IGRF Internal Field (not currently filled)
32 L_LGM_T89CDIP double 1 LanlGeoMag L-shell McIlwain calculation, T89 External Field, Centered Dipole Internal Field
33 bfield_ratio double 1 Bsatellite/Bequator
34 local_time double 1 magnetic local time (0-24 hours)
35 utc_lgm double 1 UTC (0-24 hours)
36 b_sattelite double 1 B field at satellite (gauss)
37 b_equator double 1 B field at equator (on this field line I think) (gauss)
38-48 electron_background double 11 estimated background in electron channels E1-E11 (Hz)
49-53 proton_background double 5 estimated background in proton channels P1-P5 (Hz)
54 proton_activity int 1 =1 if there is significant proton activity
55 proton_temperature_fit double 1 characteristic momentum -- R0 in the expression given above (MeV/c)
56 proton_density_fit double 1 N0 parameter in fit to proton flux ((protons/(cm2 sec sr MeV))
57 electron_temperature_fit double 1 electron temperature from a one Maxwellian fit (MeV)
58 electron_density_fit double 1 electron number density from a one Maxwellian fit (cm-3)
59-63 model_counts_proton_fit_pf double 5 P1-P5 rate from proton fit (using proton_temperature_fit, proton_density_fit)
64-74 model_counts_electron_fit double 11 E1-E11 rates from the 9-parameter electron flux model
75-80 proton_integrated_flux_fit double 6 integral of proton flux (based on fit) above 10, 15.85, 25.11, 30, 40, 79.43 MeV (proton kinetic energy)
81-110 integral_flux_instrument double 30 (based on 9 parameter fit) integral of electron flux above integral_flux_energy[i] particles/(cm2 sec)
111-140 integral_flux_energy double 30 energies for the integral of integral_flux_instrument (MeV)
141-155 electron_diff_flux_energy double 15 energies for the fluxes in electron_diff_flux_energy (MeV)
156-170 electron_diff_flux double 15 (based on 9 parameter fit) electron flux at energies electron_diff_flux[i] (particle/(cm2 sr MeV sec))
171-179 Efitpars double 9 fit parameters for 9 parameter electron fit