SocialPond

Things about society.

Wednesday, December 06, 2017

Adult College Enrollment - IPEDS data


As described in my previous articles, I have been working on importing all IPEDS (Integrated Postsecondary Education Data System) data into my database. After spending years on this project, I was able to import about 15 years' worth of IPEDS into the database with verification. There are still few files that, without major efforts, would be hard to handle properly - these are basically long lines with embedded new line characters in the line.

That being said, I was eager to run a test case with these IPEDS data. As the fate has presented itself, my previous articles were about adult college enrollment, and it happens that the college enrollment can be approximated from the IPEDS enrollment data.


For this article, the IPEDS fall (semester) enrollment data were first examined via my R interface code, which allows searching and checking definitions across data years. In this particular case, the R code reveals that, at 2009, the 'first professional' enrollment level disappeared from the level definition. By examining the IPEDS documentation, it is verified that from 2009 and on, the  first professional enrollment is to be reported in the graduate enrollment level. Since I am a kind of familiar with the IPEDS data collection, I knew the enrollment age data were not mandatory for even-number years. If not, a quick R code that checking the total for each year should have revealed that.

Since in IPEDS, data were only tagged with college id (unitid), extra steps were needed to tag the data with attributes from the college. These attributes are made available through the so called 'Institutional Chararcteristic' survey. Whit this survey, colleges can be tagged with control (Public/Private), level (4 or 2 year college), location (state/address...). For this project, it was found that, in 2011, there were 3 institutions did not reported appropriate information for the 'institutional characteristic' survey. Luckily, two of them were available from other years. To preserve most data, we fixed the two with info from other year and coded the third one with special code so that we can include them if we so desire. For this article, we include all institutions that were collected by IPEDS and this include institutions that located on US territories and miscellaneous islands. To list a few, this includes AS (America Samoa), GU (Guam), PR (Puerto Rico), MH (Marshall Island) ... etc.


With previous adult college enrollment article in mind, under-graduate enrollment from the IPEDS was considered a better approximation to those from the ACS data.


Examining the IPEDS age data, it is noticed that not all data were collected with equal age span. For example, data are collected with age categories like 18 to 19, 22 to 24, 25 to 29 ... etc. Presenting age data directly with with these age categories results in the following chart and the chart can trick reader to think that there is a bump in the age distribution which sure not look like the age distribution presented in my previous article.

Age distribution using IPEDS age categories

A better approach to resolve this would be using the average head count for each age category instead. Better yet, you can assign the average to each age in the category to provide a better representation in terms of age axis.

In this article, an average assigned to the category is used. To approaching the college enrollment data in my previous ACS based article, we presented the age distribution with the total enrollment, the sum of both full-time and part-time students. As shown in the following graph, it can be seen that the curve exhibits a familiar monotonic decreases after the primary peak around college graduation.

Age distribution for total fall enrollment using average for each age category

Since the IPEDS data also allow the separation of data with full-time, and part-time, it is worth the efforts to examine these characters too. The overall (sum over states) full-time distribution can be seen in the chart below.
Age distribution for full-time fall college enrollment using average for each age category

A typical age distribution for a state (NE) can be seen below. For most state, the only difference is whether the age group 18 to 19 or the age group 20 to 22 is the highest. The full-time fall enrollment age distribution for Utah, however, show a very different distribution - see chart below. This may related to the Mormon missionary program but more evidence from other survey or data elements may be needed.
A typical state (NE) age distribution for full-time enrollment using average for each category

Full-Time fall enrollment age distribution for the state of Utah
The IPEDS universal total for part-time fall enrollment can be seen in the chart below. Comparing to full-time and total enrollment, it clearly show a distinctive age distribution. For some states, their part-time fall enrollment are similar to that of the IPEDS universal total as shown below (NE). There are, however, another set of states (e.g. CA, FL, GA, ... etc.) that shows a quite different age distribution pattern. Part-time students in these states seem to take a break from school (to work?) and come back to enroll in school later.
IPEDS universal age distribution for part-time fall enrollment

Part-Time fall enrollment for the State of Nebraska

Part-Time fall enrollment for the State of California

Examining the IPEDS universe part-time enrollment, trending by years, we noticed that there were more younger kids in recent years. By presenting these same data in percentages, it shows that, proportionally, elder adult were taking smaller share of the part-time enrollment in recent years.
Age distribution of part-time students in percents

Updated 20180525: A Kickstarter project has been created that will allow average data user to obtain these kind of IPEDS data.

Labels: , , , , , , , , , , ,

Saturday, November 04, 2017

Continued struggle with IPEDS data



As mentioned earlier, works in automate data importing take enormous amount of dedication and efforts. For those do not appreciate, there is really no need to share the knowledge with them.

Here is an example that demonstrate the kind of work and dedication is needed to solve just one problem that I run into while importing IPEDS data.

One problem format I run into in some IPEDS csv file is: 
...,""some text quoted with two double quote"", ...

As a human, we know this line break the csv convention and most likely any csv file importing program is going to fail.

As a data user, I got few resolutions to consider. If I am only dealing with this file, the fastest way is to just open the file in text editor and modify the line so that the csv file can be imported into my application. If you are thinking this way, most likely you are a data analyst and probably think this is how things should be handled. Since you are higher up in the data food chain, likely, have not appreciate the work and thought of IT professions.

IT professions are likely to view the situation from a much broad point of view and ask questions like: What if this is an error exist in ACS' csv file? - If you know the size of a general ACS' csv file, you will realize that there probably very few text editor can effectively open the file, let alone to locate the error line and fix it.

IT professions may also ask: What if there are other csv files also have this problem? How can I handle this automatically?

One tool a lot of IT profession know about is the sed program. To use the sed to fix this problem it is straight forward:
  sed 's/,"("[^"]{2,}")"/,\1/g; ' InCsv

Unfortunately, if you want to invoke this with VBA, the command become much more complicate:
  Cmd.exe /c ^"sed ^-r ^-n ^'^{s^/^,^"^(^"^[^^^^^"^]^{2^,^}^"^)^"^/^,^\1^/g^; p^;^}^' InCsv ^"

Let's just say this, if you have no clue what we are talking about here, you should appreciate the work of IT professions.

Labels: , , , , , , , , ,

Wednesday, October 18, 2017

NCES IPEDS data for Database/IT professionals #3

If you were reading my previous post about importing the NCES IPEDS data, you might be wondering what will be the next.

Well, it turns out that nothing is easy. I have spent most of the past 3 months devised R functions that can help me retrieve data from the few IPEDS data I have imported. I finally reached a point where I have a reasonable approach that can allow me to search interested variables and check related information concerning that variable. For example, if I am interested in the head count of American Indian, I would like to know how the head count can be broken down - for example, by students level, by full-time/part-time enrollment status, or by the program enrolled. In addition to be able to see how the head count can be broken down, user may also interested in if that variable also available for other years or if there is a definition changes over the years.

The R program also allows advanced user to retrieve aggregated data instead of the raw data which will improve the efficiency in processing and analyzing the data.

After the prototype of the R program is done, I resumed the process of importing more data. As all experienced data analysts should know, there are always show stopper when you obtained data from external source - no matter how reputable the source is. If you are in a data chain that received clean data from your colleague, you should appreciate the work they have done.

Just to demonstrate how things can go wrong even with an well established IPEDS operation, I will list few, just few, problem I ran into. It is true, if you can envision all possible errors or situation and devised all solutions for it, you can argue that all thing should be automated and manual work will never be needed. But, aren't you too unrealistic? Most likely, to achieving automation in that level, you will first run into a whole sets of unexpected situation and you worked you butt out to devise the alternates. But who is to say, the latest unexpected is the last unexpected?

Here are few issues that I ran into with the IPEDS data:
1.  Extra comma are at the end of the data lines
2.  Column names is in the wrong order/column or places
3.  Variable names do not match the column names.
4.  Leading and trailing blanks around the real data.
Beside the above, the use of dot as the indication of NA/NULL is also not the easiest to handle with some software tools. Speaking of the software tools, while is it totally possible to develop tools from beginning to the end, it is usually not the fasted way to get things done. On the other hand, no matter what tools you used, it is not likely to meet all your need and you will be limited and restrained by the tool you can find and used.

In my case, my database importing tool has very restrict rule on the csv file it will accept - for example, extra comma behind the data is a no-no. Dot will not be treated as NULL... etc.

At this point, I can import about 10 years of data in less than 2 hours. But it is those few exceptions that will take me days and days to figure out why it won't import and devise solution around it. As to this day, I am still fighting exceptions.

Traditional data analyst/researchers are satisfied to clean up a data set by hand and proceed to analyze the data. They do not have the ambition and fore-sights of engineer or science worker - generalization is to be sought and the goal is to handle many many data that is to come - a system is what is sought not a special/single case.

People of general are ignorant in the work involved in achieving automation. It seems that the automation is every where and anything less than total automation is considered failure or defect and unthinkable. They haven't realize that the beautiful automation they see is the work of a major software company - not the work of an individual or a less comprehensive team of software worker. The reason they are commercial software should have put you to think straight - but still, people are naive - discount the efforts of a less comprehensive team - they believe there would be no manual work for the less comprehensive solution. The especial sad thing is these are people received benefits while refuse to credit works been done by others.

No work and knowledge should ever be worth to share with these ignorant.





Labels: , , ,

Monday, May 29, 2017

NCES IPEDS data for Database/IT professionals #2


This is a continuation of my previous blog on importing the NCES IPEDS data into database for IT/Database professionals.

As described in my last blog, I was able to apply few IT know-how and managed to download all IPEDS data automatically to my desktop. The next mission is to import those data automatically into database. 

One of major task to support social science research, as described in my previous articles about the ACS PUMS data, is the ability to import and provide the label information for both variables and values. In current case of NCES IPEDS data, these information can be obtained from various places.One possible place is to use the 'Dictionary zip file' download provided by the IPEDS. Browsing through few of the information provided by the dictionary zip file, you realized that is really not the idea place to get these kind of information. The zip file provided by IPEDS do not have consistent format. In some years and surveys, the information is provided in Excel format. In other years and surveys, these information were provided as web-page html formats.

After studying how IPEDS made their data available to statistics software, I, actually, think it is quite unique and workable - IPEDS provide a single, or 2, .csv file to support all statistics software. It then provide a 'program/script' for each statistics software. The program or the script, basically, provide instructions to import the .csv file, to assign variables labels to each column and to assign value labels to each values. So, basically, all the label information for both variables and values are embedded in the program/script file. The question, of cause, is how to extract those information - For programmers went through bachelor degree training, they understand that this is a topic under the compiler study.

Based on my work long time ago, I was able to modify the parser and parse the info in the .sps file into database. The code to pull data in the csv to database is largely the same as I did with the ACS PUMS excepting few reformatting and unzipping. Some of the IPEDS csv files using the dot '.' to mean NA and some of the quoting isn't very consistent. I decide to convert tabs in the file to space first, then transform the file to tab delimited and that seems to work fine.

Sam Barbett at IPEDS was also contact about the release of 'Final' data - IPEDS, in general, release data in two phases. The first one is called provisional while the second one is called final. According to Sam, when the final data released, the new/final csv file will have a '_rv' suffix added to the file name, even though the .sps file isn't updated.

The other problem I run into is line termination used in the downloaded .sps file. Most of my processes is run/executed under Windows environment which means that lines in a file are terminated by the Carriage-Return/Line-Feed pair instead of a single Line-Feed character used in most Unix systems. By modifying my code to accommodate that, I was currently be able to process couple years of data without run into errors.

Overall, the process is smooth and the next task would be how to use these data effectively and efficiently.


Labels: , , , , , ,

Wednesday, May 17, 2017

NCES IPEDS data for Database/IT professionals


Personally, I am an IT professional worked in an education agency. (Have) Been in this position long enough and dealing with social science researchers a lot. One of the interesting observation is that even though the social science researcher dealing with data all the time, without the IT background still limited their ability to handle large amount of data efficiently. A lot of time, these staffs relied on expensive commercial software and computer hardware to perform their tasks. When leading projects, often times, they are limited by their vision to provide and deliver efficient data products.

On the other hand, people with strong IT training can have better visions on how things work and know the real limits of things and set the goals that others can't - I love this Elon Musk story Simple math is why Elon Musk’s companies keep doing what others don’t even consider possible, where Physics is said to be applied first, which is the fundamental that dictated the limits. The value of a real STEM training is the vision and the know of limits. Apply to the data processing, the IT is the know.

The Integrated Postsecondary Education Data System (IPEDS) refers to a set of data collected from a large set of Postsecondary Education Institutions of United States. The survey is conducted by the National Center for Education Statistics. The data collected is available for anyone's use. For causal use, you can easily obtained the data you are interested in, manually. However, as we all know, the real power of data multiplied if you can have all the data in one place in a readily to be used state. Yes, most likely we are talking about a database.

Glance over the data retrieval option offered by NCES/IPEDS, the 'Complete data files' option seems to be the best way to retrieve the whole IPEDS data set. Practicing a bit manually, you soon realize that manually select and download will still take you a long time to even download the file let alone importing them.

With enough IT knowledge, a reasonable approach to this problem could be: 1) Save the download page; 2) Make minimal fixes to the page so that it conform to XML; 3) Device a short XSLT translation script; 4) Copy the translated page into database; 5) With the list of file to download on hand, wrote scripts to download files automatically.

In addition to the above implementation, to facilitate the continuity of time available for download, a scheduling mechanism is also implemented.
  


Labels: , , , , , , , , , , ,