SocialPond

Things about society.

Wednesday, October 18, 2017

NCES IPEDS data for Database/IT professionals #3

If you were reading my previous post about importing the NCES IPEDS data, you might be wondering what will be the next.

Well, it turns out that nothing is easy. I have spent most of the past 3 months devised R functions that can help me retrieve data from the few IPEDS data I have imported. I finally reached a point where I have a reasonable approach that can allow me to search interested variables and check related information concerning that variable. For example, if I am interested in the head count of American Indian, I would like to know how the head count can be broken down - for example, by students level, by full-time/part-time enrollment status, or by the program enrolled. In addition to be able to see how the head count can be broken down, user may also interested in if that variable also available for other years or if there is a definition changes over the years.

The R program also allows advanced user to retrieve aggregated data instead of the raw data which will improve the efficiency in processing and analyzing the data.

After the prototype of the R program is done, I resumed the process of importing more data. As all experienced data analysts should know, there are always show stopper when you obtained data from external source - no matter how reputable the source is. If you are in a data chain that received clean data from your colleague, you should appreciate the work they have done.

Just to demonstrate how things can go wrong even with an well established IPEDS operation, I will list few, just few, problem I ran into. It is true, if you can envision all possible errors or situation and devised all solutions for it, you can argue that all thing should be automated and manual work will never be needed. But, aren't you too unrealistic? Most likely, to achieving automation in that level, you will first run into a whole sets of unexpected situation and you worked you butt out to devise the alternates. But who is to say, the latest unexpected is the last unexpected?

Here are few issues that I ran into with the IPEDS data:
1.  Extra comma are at the end of the data lines
2.  Column names is in the wrong order/column or places
3.  Variable names do not match the column names.
4.  Leading and trailing blanks around the real data.
Beside the above, the use of dot as the indication of NA/NULL is also not the easiest to handle with some software tools. Speaking of the software tools, while is it totally possible to develop tools from beginning to the end, it is usually not the fasted way to get things done. On the other hand, no matter what tools you used, it is not likely to meet all your need and you will be limited and restrained by the tool you can find and used.

In my case, my database importing tool has very restrict rule on the csv file it will accept - for example, extra comma behind the data is a no-no. Dot will not be treated as NULL... etc.

At this point, I can import about 10 years of data in less than 2 hours. But it is those few exceptions that will take me days and days to figure out why it won't import and devise solution around it. As to this day, I am still fighting exceptions.

Traditional data analyst/researchers are satisfied to clean up a data set by hand and proceed to analyze the data. They do not have the ambition and fore-sights of engineer or science worker - generalization is to be sought and the goal is to handle many many data that is to come - a system is what is sought not a special/single case.

People of general are ignorant in the work involved in achieving automation. It seems that the automation is every where and anything less than total automation is considered failure or defect and unthinkable. They haven't realize that the beautiful automation they see is the work of a major software company - not the work of an individual or a less comprehensive team of software worker. The reason they are commercial software should have put you to think straight - but still, people are naive - discount the efforts of a less comprehensive team - they believe there would be no manual work for the less comprehensive solution. The especial sad thing is these are people received benefits while refuse to credit works been done by others.

No work and knowledge should ever be worth to share with these ignorant.





Labels: , , ,