Things about society.

Tuesday, August 23, 2016

2014-2010 ACS PUMS for Data/IT professionals #1

For a large part, social scientist/statistician had handled large amount of data without considering/using/knowing about relational database. Survey data published by various government entities has been gearing toward statistical software instead of relational database. One important factor is that, even though relational database is totally capable of handling the variable and value labels, there isn't a standard practice of how best to handle these labels. Each developer is free to implement whatever approaches they see fit.

As an IT professional that involved with social science data, I continue to encounter and try to handling this same kind of problem - how to extract labels from various sources that is statistical software friendly but isn't database friendly.

For the ACS PUMS data, the 2014-2010 5-year data published by Census Bureau came in two formats: The csv file format and the SAS file format. Without statistical software, the only useful format is the csv file format. However, the csv file format is missing all the important data type and label information. To work around that, the possible data sources for variable and value labels are: The data dictionary published in pdf format, the data dictionary published in txt format and the possible info from the saved session file when working with Census Data Ferret. The pdf isn't the best machine readable format as far as extracting the dictionary information is concerned.

The session file from Census Data Ferret is in the form of XML, which, in theory is an idea format to publish the dictionary information. After working with the file for a while, few things were learned.

The first problem I run into with the session file is that is isn't a 'correctly coded XML file'. One obvious problem is that the symbol '&' isn't escaped.- '&' is considered special characters in XML standard and have to be treated with care. The second and bigger problem with the session file is that it does not contain all the variables contained in the csv file, which can be understand since the session file isn't intended to be a data dictionary file for ACS PUMS product to begin with. However, this diminish the reason to use the session file to derive the label information - even though it could have been an easy route.

The second approach adopted is to wrote program to parse the text version of the data dictionary file. I was able to pursuit this route with relative success. However, it isn't without problems. First of all, it isn't clear from the Census what this file is intended for - were it is intended for human reading or computer processing - even though, with the first sight, it may seems that the file is structured enough for computer processing.

A program is written with certain assumptions about the file format. The processing was relatively painless except the following issues, which I believe were unintended errors on Census' part:
    line 226: GASP 3
    line 778: WORKSTAT 2
    line 3439: ESP 1
    line 7265: RACAS 1 -> RACASN 1
The file processed was the file: PUMS_Data_Dictionary_2010-2014.txt downloaded from Census with a release date of Jan. 14, 2016.

A program is also written to derive the data type information from the data dictionary file.

Labels: , , , , , , , , , ,


Post a Comment

Subscribe to Post Comments [Atom]

<< Home