SocialPond

Things about society.

Monday, May 29, 2017

NCES IPEDS data for Database/IT professionals #2


This is a continuation of my previous blog on importing the NCES IPEDS data into database for IT/Database professionals.

As described in my last blog, I was able to apply few IT know-how and managed to download all IPEDS data automatically to my desktop. The next mission is to import those data automatically into database. 

One of major task to support social science research, as described in my previous articles about the ACS PUMS data, is the ability to import and provide the label information for both variables and values. In current case of NCES IPEDS data, these information can be obtained from various places.One possible place is to use the 'Dictionary zip file' download provided by the IPEDS. Browsing through few of the information provided by the dictionary zip file, you realized that is really not the idea place to get these kind of information. The zip file provided by IPEDS do not have consistent format. In some years and surveys, the information is provided in Excel format. In other years and surveys, these information were provided as web-page html formats.

After studying how IPEDS made their data available to statistics software, I, actually, think it is quite unique and workable - IPEDS provide a single, or 2, .csv file to support all statistics software. It then provide a 'program/script' for each statistics software. The program or the script, basically, provide instructions to import the .csv file, to assign variables labels to each column and to assign value labels to each values. So, basically, all the label information for both variables and values are embedded in the program/script file. The question, of cause, is how to extract those information - For programmers went through bachelor degree training, they understand that this is a topic under the compiler study.

Based on my work long time ago, I was able to modify the parser and parse the info in the .sps file into database. The code to pull data in the csv to database is largely the same as I did with the ACS PUMS excepting few reformatting and unzipping. Some of the IPEDS csv files using the dot '.' to mean NA and some of the quoting isn't very consistent. I decide to convert tabs in the file to space first, then transform the file to tab delimited and that seems to work fine.

Sam Barbett at IPEDS was also contact about the release of 'Final' data - IPEDS, in general, release data in two phases. The first one is called provisional while the second one is called final. According to Sam, when the final data released, the new/final csv file will have a '_rv' suffix added to the file name, even though the .sps file isn't updated.

The other problem I run into is line termination used in the downloaded .sps file. Most of my processes is run/executed under Windows environment which means that lines in a file are terminated by the Carriage-Return/Line-Feed pair instead of a single Line-Feed character used in most Unix systems. By modifying my code to accommodate that, I was currently be able to process couple years of data without run into errors.

Overall, the process is smooth and the next task would be how to use these data effectively and efficiently.


Labels: , , , , , ,

Wednesday, May 17, 2017

NCES IPEDS data for Database/IT professionals


Personally, I am an IT professional worked in an education agency. (Have) Been in this position long enough and dealing with social science researchers a lot. One of the interesting observation is that even though the social science researcher dealing with data all the time, without the IT background still limited their ability to handle large amount of data efficiently. A lot of time, these staffs relied on expensive commercial software and computer hardware to perform their tasks. When leading projects, often times, they are limited by their vision to provide and deliver efficient data products.

On the other hand, people with strong IT training can have better visions on how things work and know the real limits of things and set the goals that others can't - I love this Elon Musk story Simple math is why Elon Musk’s companies keep doing what others don’t even consider possible, where Physics is said to be applied first, which is the fundamental that dictated the limits. The value of a real STEM training is the vision and the know of limits. Apply to the data processing, the IT is the know.

The Integrated Postsecondary Education Data System (IPEDS) refers to a set of data collected from a large set of Postsecondary Education Institutions of United States. The survey is conducted by the National Center for Education Statistics. The data collected is available for anyone's use. For causal use, you can easily obtained the data you are interested in, manually. However, as we all know, the real power of data multiplied if you can have all the data in one place in a readily to be used state. Yes, most likely we are talking about a database.

Glance over the data retrieval option offered by NCES/IPEDS, the 'Complete data files' option seems to be the best way to retrieve the whole IPEDS data set. Practicing a bit manually, you soon realize that manually select and download will still take you a long time to even download the file let alone importing them.

With enough IT knowledge, a reasonable approach to this problem could be: 1) Save the download page; 2) Make minimal fixes to the page so that it conform to XML; 3) Device a short XSLT translation script; 4) Copy the translated page into database; 5) With the list of file to download on hand, wrote scripts to download files automatically.

In addition to the above implementation, to facilitate the continuity of time available for download, a scheduling mechanism is also implemented.
  


Labels: , , , , , , , , , , ,