D. R. Evans (N7DR): Versions of the LANL GPS Charged-Particle Dataset

The original dataset of GPS charged-particle measurements is quite cumbersome, comprising many thousand individual files. It is therefore useful to convert the data into a more-manageable schema, and also to remove what appears to be unnecessary and invalid information from the dataset.

For simplicity and ease of reference, I number each stage of the process so that others can more easily reproduce the various versions of the data. The initial complete dataset is stage 0.

Stage 1: All the data for each satellite in a single file

Because it is usually easier to operate on a single file for each satellite, we start by combining all the files for each satellite. Inside the directory for each satellite, ns<nn>, execute:

for file in *ascii; do grep -v "#" $file >> stage-1/ns<nn>; done

Satellite	Records	Start (YYYY-DOY)	End (YYYY-DOY)
ns41	2,078,270	2001-007	2016-366
ns48	1,178,405	2008-083	2016-366
ns53	1,487,237	2005-275	2016-366
ns54	2,153,770	2001-049	2016-366
ns55	1,609,281	2007-308	2017-001
ns56	1,846,892	2003-040	2016-366
ns57	1,186,706	2008-013	2017-001
ns58	1,333,278	2006-337	2016-366
ns59	1,696,926	2004-088	2016-366
ns60	1,650,202	2004-193	2016-366
ns61	1,610,038	2004-319	2016-366
ns62	873,766	2010-164	2016-366
ns63	740,301	2011-198	2016-366
ns64	379,422	2014-054	2016-338
ns65	528,191	2012-288	2016-366
ns66	478,259	2013-153	2016-366
ns67	361,540	2014-138	2016-366
ns68	315,369	2014-222	2016-366
ns69	303,676	2014-320	2016-366
ns70	132,460	2016-038	2016-366
ns71	240,660	2015-088	2016-366
ns72	214,410	2015-200	2016-366
ns73	171,485	2015-305	2016-366

Stage 2: Remove all records marked as bad

For some reason, the dataset includes records that are known to be bad. A bad record is signalled by a value of unity for the dropped-data field (the documentation for this field says: if =1 it means something is wrong with the data record, do not use it). So we remove all records for which this field has the value one.

For ns41 and ns48:

awk 'BEGIN{FS=" "}; $25!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn>

For all other satellites:

awk 'BEGIN{FS=" "}; $26!="1" { print }' < stage-1/ns<nn> > stage-2/ns<nn>

Satellite	Stage 2 Records	% Good Stage 1 Records
ns41	1,991,249	95.8
ns48	1,105,991	93.8
ns53	1,331,687	89.5
ns54	1,939,151	90.0
ns55	1,055,328	65.6
ns56	1,680,887	91.0
ns57	1,082,626	91.2
ns58	1,175,519	88.2
ns59	1,516,528	89.4
ns60	1,495,541	90.6
ns61	1,470,445	91.3
ns62	775,535	88.8
ns63	652,110	88.1
ns64	344,702	90.8
ns65	480,935	91.0
ns66	446,801	93.4
ns67	327,994	90.7
ns68	306,513	97.2
ns69	262,992	86.6
ns70	110,971	83.8
ns71	221,336	92.0
ns72	182,332	85.0
ns73	145,858	85.0

Stage 3: Remove all records marked with invalid day of year

Ideally, the dataset would now contain only valid data; however, this turns out not to be the case. For example, the first value in each record is identified as the day of the year (decimal_day). The documentation for this field says: GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years. However, a simple sanity test shows that there are records that contain negative values for this field:

[HN:gps-stage-2] awk 'BEGIN{FS=" "}; $1<=0 { print }' < stage-2/ns41 | wc -l
177
[HN:gps-stage-2]

Accordingly, we can drop all records that have an invalid value for the day of year:

For ns41 and ns48:

awk 'BEGIN{FS=" "}; (

$1>=1 && (($ 22 == 2004 ||

$22 == 2008 ||$ 22 == 2012 || \

$22 == 2016) ?$ 1<=367 : $1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn>

For all other satellites (this appears to be a null operation; nevertheless it is worth executing the filter so as to be certain that the values are in the acceptable range):

awk 'BEGIN{FS=" "}; (

$1>=1 && (($ 23== 2004 ||

$23 == 2008 ||$ 23 == 2012 || \

$23 == 2016) ?$ 1<=367 : $1<=366)) { print }' < ns<nn> > ../stage-3/ns<nn>

Stage 4: Correct time information

Four fields pertain to time information:

decimal_day

double

GPS time, a number from 1 (1-Jan 00:00) to 366 (31-Dec 24:00) or 367 in leap years.

collection_interval	double	1	dosimeter collection period (seconds)
year	int	1	year (e.g. 2015)
decimal_year	double	1	decimal year = year + (decimal_day-1.0)/(days in year)

Unfortunately, the actual values recorded in these fields are sometimes problematic:

The decimal_day value is recorded with a precision of only 0.000683 days, or 0.0864 seconds. Since the value is recorded in terms of days rather than seconds, then even if the records are actually acquired according a clock whose period is an integral number of seconds, the precise value of the decimal_day field will jitter slightly around the correct time because of the quantisation error.
The decimal_year value is not useful because it lacks sufficient precision. For example, the first two records of data from ns41 both contain the value 2.001016e+03 for this field, despite being acquired roughly 4 minutes apart. Therefore, this field cannot be used as an accurate representation of the time of the record. Since a monotonically increasing value that indicates time is undoubtedly useful, the values in the dataset should be replaced with corrected values with a larger number of significant figures.
Inspection of the value of collection_interval suggests that it is intended to be an integer (despite being represented as a floating point number) taken from the set { 24, 120, 240, 4608 }. On one occasion, however, it has the value 367.0062. This I assume to be an error (the decimal_day value is consistent with a value of 240). Nevertheless, the presence of what seems to be an unreasonable number for this field suggests the need to filter the dataset further, so as to retain only those records with a value of collection_interval taken from the set of expected values.

The presence of occasional illegitimate values for fields such as decimal_day and collection_interval suggests that the data were not subjected to checking before being made available to the public (despite the presence of the dropped_data field; I have seen no documentation that describes what particular errors would cause this field to be set to 1). This is unfortunate, as it means that users have to process the data to remove errors before being able to move on to perform a defensible analysis based on the data.

Accordingly, our next step is to deal with the third item in the above list: to filter the data and retain only those records with values of collection_interval taken from the set { 24, 120, 240, 4608 }. Although I noticed the occurrence of only one record with an invalid value, it is worthwhile to process the complete dataset so as to be certain that only legitimate values remain.

For ns41 and ns48:

awk 'BEGIN{FS=" "};

$21=="2.400000e+01" ||$ 21=="1.200000e+02" || \

$21=="2.400000e+02" ||$ 21=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>

And for the remaining spacecraft:

awk 'BEGIN{FS=" "};

$22=="2.400000e+01" ||$ 22=="1.200000e+02" || \

$22=="2.400000e+02" ||$ 22=="4.608000e+03" { print }' < ns<nn> > ../stage-4/ns<nn>

As a kind of checkpoint, I have uploaded the stage 4 dataset. The MD5 checksum is 657e03a3fdbae517bd8eccf5f208b792.

For the record, here are the number of records for each satellite at the end of Stage 4 (the last couple of stages of filtering have affected only ns41):

Satellite	Stage 4 Records
ns41	1,991,069
ns48	1,105,991
ns53	1,331,687
ns54	1,939,151
ns55	1,055,328
ns56	1,680,887
ns57	1,082,626
ns58	1,175,519
ns59	1,516,528
ns60	1,495,541
ns61	1,470,445
ns62	775,535
ns63	652,110
ns64	344,702
ns65	480,935
ns66	446,801
ns67	327,994
ns68	306,513
ns69	262,992
ns70	110,971
ns71	221,336
ns72	182,332
ns73	145,858

Further processing of the dataset will be described in subsequent posts.

D. R. Evans (N7DR)

2017-02-27

Versions of the LANL GPS Charged-Particle Dataset

Stage 1: All the data for each satellite in a single file

Stage 2: Remove all records marked as bad

Stage 3: Remove all records marked with invalid day of year

Stage 4: Correct time information

No comments:

Post a Comment