Things about society.

Tuesday, August 23, 2016

2014-2010 ACS PUMS for Data/IT professionals #2

Continue with previous article.

Before you can parsing out the Data Dictionary file, you may like to see if the file is in *nix format (i.e. lines are terminated by new line only) or in Windows/DOS format (i.e. each line is terminated by carriage-return/line-feed pair). You can achieve this using various kind of text editors. For example, jEdit allows you to save using different coding and sed allows you to insert carriage-return character.
After parsing out the Data Dictionary file, one obtained the information on variables and labels. These include the names, the descriptions and possible values for each variable. With the information on possible values for each variable, programs were written to determine the appropriate data type to assign to each variable.

The data type can then be used to create database tables with the correct data type assigned. The last piece of the process is to import the csv file into the table created. Depend on the database used and the tools available, one may run into other problem with the csv file download from the Census.

In my process, one problem I ran into is the notion of 'no value'. As we know about survey data, a 'no response' is usually considered special for other valid value. From the database side, a Null value is similar to the 'no response' or 'no value' and should be different from a 0 or a blank/space character. In the scope of a csv file, a consecutive comma ',,' can better be interpreted as null while a space surround with two commas may better be considered as a space character. However, depend on the importing program/utilities you have, you may or may not run into problems.

For Census' ACS PUMS csv file, some 'no value' were coded as space between two commas. For programs/tools that require two consecutive commas to indicate 'no value', pre-process of csv file is required. With the size of these csv files, it is impractical to editing these csv files manually. A well know tool for this kind of task is the *nix sed program.

By processing the csv file with sed first, I was able to import the data into Microsoft SQL server using the bcp program of Microsoft.


Labels: , , , , , , ,


Post a Comment

Subscribe to Post Comments [Atom]

<< Home