## 2017-02-27

### Versions of the LANL GPS Charged-Particle Dataset

The original dataset of GPS charged-particle measurements is quite cumbersome, comprising many thousand individual files. It is therefore useful to convert the data into a more-manageable schema, and also to remove what appears to be unnecessary and invalid information from the dataset.

For simplicity and ease of reference, I number each stage of the process so that others can more easily reproduce the various versions of the data. The initial complete dataset is stage 0.

#### Stage 1: All the data for each satellite in a single file

Because it is usually easier to operate on a single file for each satellite, we start by combining all the files for each satellite. Inside the directory for each satellite, ns<nn>, execute:

for file in *ascii; do grep -v "#" $file >> stage-1/ns<nn>; done Satellite Records Start (YYYY-DOY) End (YYYY-DOY) ns41 2,078,270 2001-007 2016-366 ns48 1,178,405 2008-083 2016-366 ns53 1,487,237 2005-275 2016-366 ns54 2,153,770 2001-049 2016-366 ns55 1,609,281 2007-308 2017-001 ns56 1,846,892 2003-040 2016-366 ns57 1,186,706 2008-013 2017-001 ns58 1,333,278 2006-337 2016-366 ns59 1,696,926 2004-088 2016-366 ns60 1,650,202 2004-193 2016-366 ns61 1,610,038 2004-319 2016-366 ns62 873,766 2010-164 2016-366 ns63 740,301 2011-198 2016-366 ns64 379,422 2014-054 2016-338 ns65 528,191 2012-288 2016-366 ns66 478,259 2013-153 2016-366 ns67 361,540 2014-138 2016-366 ns68 315,369 2014-222 2016-366 ns69 303,676 2014-320 2016-366 ns70 132,460 2016-038 2016-366 ns71 240,660 2015-088 2016-366 ns72 214,410 2015-200 2016-366 ns73 171,485 2015-305 2016-366 #### Stage 2: Remove all records marked as bad For some reason, the dataset includes records that are known to be bad. A bad record is signalled by a value of unity for the dropped-data field (the documentation for this field says: if =1 it means something is wrong with the data record, do not use it). So we remove all records for which this field has the value one. For ns41 and ns48: awk 'BEGIN{FS=" "};$25!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn>
For all other satellites:
awk 'BEGIN{FS=" "}; $26!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn> Satellite Stage 2 Records % Good Stage 1 Records ns41 1,991,249 95.8 ns48 1,105,991 93.8 ns53 1,331,687 89.5 ns54 1,939,151 90.0 ns55 1,055,328 65.6 ns56 1,680,887 91.0 ns57 1,082,626 91.2 ns58 1,175,519 88.2 ns59 1,516,528 89.4 ns60 1,495,541 90.6 ns61 1,470,445 91.3 ns62 775,535 88.8 ns63 652,110 88.1 ns64 344,702 90.8 ns65 480,935 91.0 ns66 446,801 93.4 ns67 327,994 90.7 ns68 306,513 97.2 ns69 262,992 86.6 ns70 110,971 83.8 ns71 221,336 92.0 ns72 182,332 85.0 ns73 145,858 85.0 #### Stage 3: Remove all records marked with invalid day of year Ideally, the dataset would now contain only valid data; however, this turns out not to be the case. For example, the first value in each record is identified as the day of the year (decimal_day). The documentation for this field says: GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years. However, a simple sanity test shows that there are records that contain negative values for this field: [HN:gps-stage-2] awk 'BEGIN{FS=" "};$1<=0 { print }' < stage-2/ns41 | wc -l
177

[HN:gps-stage-2]

Accordingly, we can drop all records that have an invalid value for the day of year:

For ns41 and ns48:

awk 'BEGIN{FS=" "}; ($1>=1 && (($22 == 2004 || $22 == 2008 ||$22 == 2012 || \
$22 == 2016) ?$1<=367 : $1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn> For all other satellites (this appears to be a null operation; nevertheless it is worth executing the filter so as to be certain that the values are in the acceptable range): awk 'BEGIN{FS=" "}; ($1>=1 && (($23== 2004 ||$23 == 2008 || $23 == 2012 || \$23 == 2016) ? $1<=367 :$1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn>

#### Stage 4: Correct time information

Four fields pertain to time information:

 decimal_day double 1 GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years.
 collection_interval double 1 dosimeter collection period (seconds) year int 1 year (e.g. 2015) decimal_year double 1 decimal year = year + (decimal_day-1.0)/(days in year)

Unfortunately, the actual values recorded in these fields are sometimes problematic:
1. The decimal_day value is recorded with a precision of only 0.000683 days, or 0.0864 seconds. Since the value is recorded in terms of days rather than seconds, then even if the records are actually acquired according a clock whose period is an integral number of seconds, the precise value of the decimal_day field will jitter slightly around the correct time because of the quantisation error.
2. The decimal_year value is not useful because it lacks sufficient precision. For example, the first two records of data from ns41 both contain the value 2.001016e+03 for this field, despite being acquired roughly 4 minutes apart. Therefore, this field cannot be used as an accurate representation of the time of the record. Since a monotonically increasing value that indicates time is undoubtedly useful, the values in the dataset should be replaced with corrected values with a larger number of significant figures.
3. Inspection of the value of collection_interval suggests that it is intended to be an integer (despite being represented as a floating point number) taken from the set { 24, 120, 240, 4608 }. On one occasion, however, it has the value 367.0062. This I assume to be an error (the decimal_day value is consistent with a value of 240). Nevertheless, the presence of what seems to be an unreasonable number for this field suggests the need to filter the dataset further, so as to retain only those records with a value of collection_interval taken from the set of expected values.
The presence of occasional illegitimate values for fields such as decimal_day and collection_interval suggests that the data were not subjected to checking before being made available to the public (despite the presence of the dropped_data field; I have seen no documentation that describes what particular errors would cause this field to be set to 1). This is unfortunate, as it means that users have to process the data to remove errors before being able to move on to perform a defensible analysis based on the data.

Accordingly, our next step is to deal with the third item in the above list: to filter the data and retain only those records with values of collection_interval taken from the set { 24, 120, 240, 4608 }. Although I noticed the occurrence of only one record with an invalid value, it is worthwhile to process the complete dataset so as to be certain that only legitimate values remain.

For ns41 and ns48:

awk 'BEGIN{FS=" "}; $21=="2.400000e+01" ||$21=="1.200000e+02" || \
$21=="2.400000e+02" ||$21=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>

And for the remaining spacecraft:

awk 'BEGIN{FS=" "}; $22=="2.400000e+01" ||$22=="1.200000e+02" || \
$22=="2.400000e+02" ||$22=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>

As a kind of checkpoint, I have uploaded the stage 4 dataset. The MD5 checksum is 657e03a3fdbae517bd8eccf5f208b792.

For the record, here are the number of records for each satellite at the end of Stage 4 (the last couple of stages of filtering have affected only ns41):

Satellite Stage 4 Records
ns41 1,991,069
ns48 1,105,991
ns53 1,331,687
ns54 1,939,151
ns55 1,055,328
ns56 1,680,887
ns57 1,082,626
ns58 1,175,519
ns59 1,516,528
ns60 1,495,541
ns61 1,470,445
ns62 775,535
ns63 652,110
ns64 344,702
ns65 480,935
ns66 446,801
ns67 327,994
ns68 306,513
ns69 262,992
ns70 110,971
ns71 221,336
ns72 182,332
ns73 145,858

Further processing of the dataset will be described in subsequent posts.