NCES IPEDS data for Database/IT professionals #2
This is a continuation of my previous blog on importing the NCES IPEDS data into database for IT/Database professionals.
As described in my last blog, I was able to apply few IT know-how and managed to download all IPEDS data automatically to my desktop. The next mission is to import those data automatically into database.
One of major task to support social science research, as described in my previous articles about the ACS PUMS data, is the ability to import and provide the label information for both variables and values. In current case of NCES IPEDS data, these information can be obtained from various places.One possible place is to use the 'Dictionary zip file' download provided by the IPEDS. Browsing through few of the information provided by the dictionary zip file, you realized that is really not the idea place to get these kind of information. The zip file provided by IPEDS do not have consistent format. In some years and surveys, the information is provided in Excel format. In other years and surveys, these information were provided as web-page html formats.
After studying how IPEDS made their data available to statistics software, I, actually, think it is quite unique and workable - IPEDS provide a single, or 2, .csv file to support all statistics software. It then provide a 'program/script' for each statistics software. The program or the script, basically, provide instructions to import the .csv file, to assign variables labels to each column and to assign value labels to each values. So, basically, all the label information for both variables and values are embedded in the program/script file. The question, of cause, is how to extract those information - For programmers went through bachelor degree training, they understand that this is a topic under the compiler study.
Based on my work long time ago, I was able to modify the parser and parse the info in the .sps file into database. The code to pull data in the csv to database is largely the same as I did with the ACS PUMS excepting few reformatting and unzipping. Some of the IPEDS csv files using the dot '.' to mean NA and some of the quoting isn't very consistent. I decide to convert tabs in the file to space first, then transform the file to tab delimited and that seems to work fine.
Sam Barbett at IPEDS was also contact about the release of 'Final' data - IPEDS, in general, release data in two phases. The first one is called provisional while the second one is called final. According to Sam, when the final data released, the new/final csv file will have a '_rv' suffix added to the file name, even though the .sps file isn't updated.
The other problem I run into is line termination used in the downloaded .sps file. Most of my processes is run/executed under Windows environment which means that lines in a file are terminated by the Carriage-Return/Line-Feed pair instead of a single Line-Feed character used in most Unix systems. By modifying my code to accommodate that, I was currently be able to process couple years of data without run into errors.
Overall, the process is smooth and the next task would be how to use these data effectively and efficiently.