ACS 2015 5-year PUMS for database/IT professional
Continue with the ACS PUMS database project, the task is to import the 5-year ACS PUMS product of 2015.
Comparing Census's 2015 Data Dictionary file (PUMS_Data_Dictionary_2011-2015.txt) to that of 2014, we, again, noticed that Census' added leading spaces to a lot of lines to, possibly, make the file more readable for 'human' users. Following the step of processing the 2015 1-year file, I removed those leading spaces via sed before processing it with my definition processor.
Processing the Data Dictionary file with my program, it yields the following parsing errors. Some of them are clearly unintended errors, others may just because Census did not spend time and efforts to establish clear syntax rules so that their products can be machine friendly. Here are the parsing errors:
ADJINC
value 1001264, the blank after '1001264' is actually an A0h instead of
20h.
TEN
just before TEN, a two line 'NOTE:'
PERSON RECORD
- no blank line after the 'PERSON RECORD' section mark
ADJINC
value 100264, the blank after '1001264' is actually an A0h instead of
20h.
GCL
just before GCL, a two line 'NOTE:'
FPINCP
no empty line before FPINCP
The A0h one is really interesting. For those of interest, A0h is a NBSP character used in HTML. Without a good hex editor, it takes me a lot of efforts to figure out what is going wrong.
Comparing Census's 2015 Data Dictionary file (PUMS_Data_Dictionary_2011-2015.txt) to that of 2014, we, again, noticed that Census' added leading spaces to a lot of lines to, possibly, make the file more readable for 'human' users. Following the step of processing the 2015 1-year file, I removed those leading spaces via sed before processing it with my definition processor.
Processing the Data Dictionary file with my program, it yields the following parsing errors. Some of them are clearly unintended errors, others may just because Census did not spend time and efforts to establish clear syntax rules so that their products can be machine friendly. Here are the parsing errors:
ADJINC
value 1001264, the blank after '1001264' is actually an A0h instead of
20h.
TEN
just before TEN, a two line 'NOTE:'
PERSON RECORD
- no blank line after the 'PERSON RECORD' section mark
ADJINC
value 100264, the blank after '1001264' is actually an A0h instead of
20h.
GCL
just before GCL, a two line 'NOTE:'
FPINCP
no empty line before FPINCP
The A0h one is really interesting. For those of interest, A0h is a NBSP character used in HTML. Without a good hex editor, it takes me a lot of efforts to figure out what is going wrong.
Labels: ACS, Census, Csv, Data Dictionary, PUMS
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home