For simplicity and ease of reference, I number each stage of the process so that others can more easily reproduce the various versions of the data. The initial complete dataset is stage 0.
Stage 1: All the data for each satellite in a single file
Because it is usually easier to operate on a single file for each satellite, we start by combining all the files for each satellite. Inside the directory for each satellite, ns<nn>, execute:for file in *ascii; do grep -v "#" $file >> stage-1/ns<nn>; done
Satellite | Records | Start (YYYY-DOY) | End (YYYY-DOY) |
---|---|---|---|
ns41 | 2,078,270 | 2001-007 | 2016-366 |
ns48 | 1,178,405 | 2008-083 | 2016-366 |
ns53 | 1,487,237 | 2005-275 | 2016-366 |
ns54 | 2,153,770 | 2001-049 | 2016-366 |
ns55 | 1,609,281 | 2007-308 | 2017-001 |
ns56 | 1,846,892 | 2003-040 | 2016-366 |
ns57 | 1,186,706 | 2008-013 | 2017-001 |
ns58 | 1,333,278 | 2006-337 | 2016-366 |
ns59 | 1,696,926 | 2004-088 | 2016-366 |
ns60 | 1,650,202 | 2004-193 | 2016-366 |
ns61 | 1,610,038 | 2004-319 | 2016-366 |
ns62 | 873,766 | 2010-164 | 2016-366 |
ns63 | 740,301 | 2011-198 | 2016-366 |
ns64 | 379,422 | 2014-054 | 2016-338 |
ns65 | 528,191 | 2012-288 | 2016-366 |
ns66 | 478,259 | 2013-153 | 2016-366 |
ns67 | 361,540 | 2014-138 | 2016-366 |
ns68 | 315,369 | 2014-222 | 2016-366 |
ns69 | 303,676 | 2014-320 | 2016-366 |
ns70 | 132,460 | 2016-038 | 2016-366 |
ns71 | 240,660 | 2015-088 | 2016-366 |
ns72 | 214,410 | 2015-200 | 2016-366 |
ns73 | 171,485 | 2015-305 | 2016-366 |
Stage 2: Remove all records marked as bad
For some reason, the dataset includes records that are known to be bad. A bad record is signalled by a value of unity for the dropped-data field (the documentation for this field says: if =1 it means something is wrong with the data record, do not use it). So we remove all records for which this field has the value one.For ns41 and ns48:
awk 'BEGIN{FS=" "}; $25!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn>For all other satellites:
awk 'BEGIN{FS=" "}; $26!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn>
Satellite | Stage 2 Records | % Good Stage 1 Records |
---|---|---|
ns41 | 1,991,249 | 95.8 |
ns48 | 1,105,991 | 93.8 |
ns53 | 1,331,687 | 89.5 |
ns54 | 1,939,151 | 90.0 |
ns55 | 1,055,328 | 65.6 |
ns56 | 1,680,887 | 91.0 |
ns57 | 1,082,626 | 91.2 |
ns58 | 1,175,519 | 88.2 |
ns59 | 1,516,528 | 89.4 |
ns60 | 1,495,541 | 90.6 |
ns61 | 1,470,445 | 91.3 |
ns62 | 775,535 | 88.8 |
ns63 | 652,110 | 88.1 |
ns64 | 344,702 | 90.8 |
ns65 | 480,935 | 91.0 |
ns66 | 446,801 | 93.4 |
ns67 | 327,994 | 90.7 |
ns68 | 306,513 | 97.2 |
ns69 | 262,992 | 86.6 |
ns70 | 110,971 | 83.8 |
ns71 | 221,336 | 92.0 |
ns72 | 182,332 | 85.0 |
ns73 | 145,858 | 85.0 |
Stage 3: Remove all records marked with invalid day of year
Ideally, the dataset would now contain only valid data; however, this turns out not to be the case. For example, the first value in each record is identified as the day of the year (decimal_day). The documentation for this field says: GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years. However, a simple sanity test shows that there are records that contain negative values for this field:[HN:gps-stage-2] awk 'BEGIN{FS=" "}; $1<=0 { print }' < stage-2/ns41 | wc -l
177
[HN:gps-stage-2]
Accordingly, we can drop all records that have an invalid value for the day of year:
For ns41 and ns48:
awk 'BEGIN{FS=" "}; ($1>=1 && (($22 == 2004 || $22 == 2008 || $22 == 2012 || \
$22 == 2016) ? $1<=367 : $1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn>
For all other satellites (this appears to be a null operation; nevertheless it is worth executing the filter so as to be certain that the values are in the acceptable range):
awk 'BEGIN{FS=" "}; ($1>=1 && (($23== 2004 || $23 == 2008 || $23 == 2012 || \
$23 == 2016) ? $1<=367 : $1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn>
Stage 4: Correct time information
Four fields pertain to time information:decimal_day |
double
|
1 | GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years. |
collection_interval | double | 1 | dosimeter collection period (seconds) |
year | int | 1 | year (e.g. 2015) |
decimal_year | double | 1 | decimal year = year + (decimal_day-1.0)/(days in year) |
Unfortunately, the actual values recorded in these fields are sometimes problematic:
- The decimal_day value is recorded with a precision of only 0.000683 days, or 0.0864 seconds. Since the value is recorded in terms of days rather than seconds, then even if the records are actually acquired according a clock whose period is an integral number of seconds, the precise value of the decimal_day field will jitter slightly around the correct time because of the quantisation error.
- The decimal_year value is not useful because it lacks sufficient precision. For example, the first two records of data from ns41 both contain the value 2.001016e+03 for this field, despite being acquired roughly 4 minutes apart. Therefore, this field cannot be used as an accurate representation of the time of the record. Since a monotonically increasing value that indicates time is undoubtedly useful, the values in the dataset should be replaced with corrected values with a larger number of significant figures.
- Inspection of the value of collection_interval suggests that it is intended to be an integer (despite being represented as a floating point number) taken from the set { 24, 120, 240, 4608 }. On one occasion, however, it has the value 367.0062. This I assume to be an error (the decimal_day value is consistent with a value of 240). Nevertheless, the presence of what seems to be an unreasonable number for this field suggests the need to filter the dataset further, so as to retain only those records with a value of collection_interval taken from the set of expected values.
Accordingly, our next step is to deal with the third item in the above list: to filter the data and retain only those records with values of collection_interval taken from the set { 24, 120, 240, 4608 }. Although I noticed the occurrence of only one record with an invalid value, it is worthwhile to process the complete dataset so as to be certain that only legitimate values remain.
For ns41 and ns48:
awk 'BEGIN{FS=" "}; $21=="2.400000e+01" || $21=="1.200000e+02" || \
$21=="2.400000e+02" || $21=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>
And for the remaining spacecraft:
awk 'BEGIN{FS=" "}; $22=="2.400000e+01" || $22=="1.200000e+02" || \
$22=="2.400000e+02" || $22=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>
As a kind of checkpoint, I have uploaded the stage 4 dataset. The MD5 checksum is 657e03a3fdbae517bd8eccf5f208b792.
For the record, here are the number of records for each satellite at the end of Stage 4 (the last couple of stages of filtering have affected only ns41):
Satellite | Stage 4 Records |
---|---|
ns41 | 1,991,069 |
ns48 | 1,105,991 |
ns53 | 1,331,687 |
ns54 | 1,939,151 |
ns55 | 1,055,328 |
ns56 | 1,680,887 |
ns57 | 1,082,626 |
ns58 | 1,175,519 |
ns59 | 1,516,528 |
ns60 | 1,495,541 |
ns61 | 1,470,445 |
ns62 | 775,535 |
ns63 | 652,110 |
ns64 | 344,702 |
ns65 | 480,935 |
ns66 | 446,801 |
ns67 | 327,994 |
ns68 | 306,513 |
ns69 | 262,992 |
ns70 | 110,971 |
ns71 | 221,336 |
ns72 | 182,332 |
ns73 | 145,858 |
Further processing of the dataset will be described in subsequent posts.