Things about society.

Thursday, June 01, 2017

A thought on business openess

I have been looking for web hosting company constantly even though I am happy with my current web hosting company: the SiteGround.

Things have changed a lot since 10 years ago. At the time, SiteGround run a forum and people/user can questioning/complaining ... etc. at the forum. I remember reading through those posts before I signed up SiteGround. I think what really drive me to singed up with SiteGround is the openness of the forum. Through the interactions in the forum, you can see SiteGround are genuinely running the business - not all customer can be satisfied, but all viewers will have a measure in their mind, reasonable response will be recognized.

Just today, I happen to browse through SiteGround website, I notice that the forum is gone and there is a 'Client Review' section, which missed the interaction nature of the forum and, to a point, can't really stand out from reviews all over the internet that promote web hosting site. To me, that is kind of sad for customers - lost in all the reviews and can't really tell the truth from the false.

To me, SiteGround have a lot to offer. Reading through the tutorial and Webinars, you see all the features SiteGround offered. However, on the other hand, hosting site can reproduce those tutorials while with lousy support of them.

The point is that people can advertise and promote products all they want, but what is the authentication pieces of that? I saw people went to sites and complained about certain hosting web sites and would be countered with some posting that complement that same hosting company. A lot of these are useless information without the hosting company responding to the complainer's issue.

Monday, May 29, 2017

NCES IPEDS data for Database/IT professionals #2

This is a continuation of my previous blog on importing the NCES IPEDS data into database for IT/Database professionals.

As described in my last blog, I was able to apply few IT know-how and managed to download all IPEDS data automatically to my desktop. The next mission is to import those data automatically into database. 

One of major task to support social science research, as described in my previous articles about the ACS PUMS data, is the ability to import and provide the label information for both variables and values. In current case of NCES IPEDS data, these information can be obtained from various places.One possible place is to use the 'Dictionary zip file' download provided by the IPEDS. Browsing through few of the information provided by the dictionary zip file, you realized that is really not the idea place to get these kind of information. The zip file provided by IPEDS do not have consistent format. In some years and surveys, the information is provided in Excel format. In other years and surveys, these information were provided as web-page html formats.

After studying how IPEDS made their data available to statistics software, I, actually, think it is quite unique and workable - IPEDS provide a single, or 2, .csv file to support all statistics software. It then provide a 'program/script' for each statistics software. The program or the script, basically, provide instructions to import the .csv file, to assign variables labels to each column and to assign value labels to each values. So, basically, all the label information for both variables and values are embedded in the program/script file. The question, of cause, is how to extract those information - For programmers went through bachelor degree training, they understand that this is a topic under the compiler study.

Based on my work long time ago, I was able to modify the parser and parse the info in the .sps file into database. The code to pull data in the csv to database is largely the same as I did with the ACS PUMS excepting few reformatting and unzipping. Some of the IPEDS csv files using the dot '.' to mean NA and some of the quoting isn't very consistent. I decide to convert tabs in the file to space first, then transform the file to tab delimited and that seems to work fine.

Sam Barbett at IPEDS was also contact about the release of 'Final' data - IPEDS, in general, release data in two phases. The first one is called provisional while the second one is called final. According to Sam, when the final data released, the new/final csv file will have a '_rv' suffix added to the file name, even though the .sps file isn't updated.

The other problem I run into is line termination used in the downloaded .sps file. Most of my processes is run/executed under Windows environment which means that lines in a file are terminated by the Carriage-Return/Line-Feed pair instead of a single Line-Feed character used in most Unix systems. By modifying my code to accommodate that, I was currently be able to process couple years of data without run into errors.

Overall, the process is smooth and the next task would be how to use these data effectively and efficiently.

Labels: , , , , , ,

Wednesday, May 17, 2017

NCES IPEDS data for Database/IT professionals

Personally, I am an IT professional worked in an education agency. (Have) Been in this position long enough and dealing with social science researchers a lot. One of the interesting observation is that even though the social science researcher dealing with data all the time, without the IT background still limited their ability to handle large amount of data efficiently. A lot of time, these staffs relied on expensive commercial software and computer hardware to perform their tasks. When leading projects, often times, they are limited by their vision to provide and deliver efficient data products.

On the other hand, people with strong IT training can have better visions on how things work and know the real limits of things and set the goals that others can't - I love this Elon Musk story Simple math is why Elon Musk’s companies keep doing what others don’t even consider possible, where Physics is said to be applied first, which is the fundamental that dictated the limits. The value of a real STEM training is the vision and the know of limits. Apply to the data processing, the IT is the know.

The Integrated Postsecondary Education Data System (IPEDS) refers to a set of data collected from a large set of Postsecondary Education Institutions of United States. The survey is conducted by the National Center for Education Statistics. The data collected is available for anyone's use. For causal use, you can easily obtained the data you are interested in, manually. However, as we all know, the real power of data multiplied if you can have all the data in one place in a readily to be used state. Yes, most likely we are talking about a database.

Glance over the data retrieval option offered by NCES/IPEDS, the 'Complete data files' option seems to be the best way to retrieve the whole IPEDS data set. Practicing a bit manually, you soon realize that manually select and download will still take you a long time to even download the file let alone importing them.

With enough IT knowledge, a reasonable approach to this problem could be: 1) Save the download page; 2) Make minimal fixes to the page so that it conform to XML; 3) Device a short XSLT translation script; 4) Copy the translated page into database; 5) With the list of file to download on hand, wrote scripts to download files automatically.

In addition to the above implementation, to facilitate the continuity of time available for download, a scheduling mechanism is also implemented.

Labels: , , , , , , , , , , ,

Thursday, February 23, 2017

Brain Drain - 2015 ACS State Migration for Working Age

Updated on Feb. 27, 2017:
An attempt has been made to rank states by brain drain indexes based on migration. See article Brain drain - ranking and analysis with 2015 ACS data.

As will be discussed in my upcoming article about various possible conception/perceptions about the idea of 'brain' in the context of 'brain drain', one of the definition or interpretation would be the education attainment of the workforce, hence, the age range of 22 to 64 years old.

As mentioned in my previous article: Population migration derived from ACS 2015 5-year PUMS dataset, I was in the process of producing a more detailed result from the ACS 2015 5-year PUMS dataset and here it is. For this article, we ignored the in-migration from foreign counties, which was included in the previous article. Time allowed, we will look into foreign country migration in details.

Basically, we look at all samples with age between 22 and 64 years old in the PUMS file along with each sample's education attainment level and the state of residing a year ago. By analyzing these data, we can estimate the number of people moving in and out of a state and with what kind of education attainment level.

It happened that I was attending a Tableau promotion meeting recently and decided to give it a try even though I would have preferred an open source solution, which I did try to look up, but did not have enough time to evaluate them yet.

The rest of this article will simply provide notes to the presentation since I have the baggage of an old IT worker that abbreviates almost everything.

First, the cite of the data source: ACS 2015-2011 5-year PUMS file processed by Dr. Duncan Hsu.

The Brain Drain Migration between states presentation can be found at Tableau Public and below are some of the summaries: 

The first tab/page/slide, "Map - Migrated to To_State", is the in-migration map for the state of interest specified by the right-hand side dropdown control: the To_State. The map will show the number of people moving from each state to the To_State, with the education attainment level you specified by the second dropdown list labeled EdAttnmnt. To see the numbers, hover your mouse above a state of interest. For example, the following chart show that there were 724 people moved from Kansas to Nebraska, which was selected as the 'To_State'. Possible values for the EdAttnmnet dropdown are: Less than High School Degree (LssHsDgr), High School Degree or Equivalent (HsDgrEqv), Some College Experience/Course-work but no degree (SomeCllg), Associate Degree (AssctDgr), Bachelor Degree (BchlrDgr), Master Degree (Mstr), First Professional Degree (FP), and Doctor's Degree (Drs).
In-Migration to Nebraska

The second tab, "Map - Migrated out From_State", is the out-migration map for the state of interest. Operational wise, this is very similar to the first tab. In the map below, it shows that there were 421 people migrated to Iowa from Nebraska, which was selected as the 'From_State'.
Out-Migration from Nebraska

The third tab, "Map - Net migration", shows the net migration. By hover over each state, it shows three numbers, the HdCnt (head count; negative for out-migration and positive for in-migration) and the upper and lower bound for the 90% confidence level. The dropdown to the right displays 6 education attainment levels with the Graduates (Grdts) encompassing  Master, First Professional, and Doctor's degree.
Net-Migration for Nebraska

The fourth tab, is the net migration bar chart for each state where the bar indicated the Margin of Error (MOE) at the 90% confidence level. Again, dropdowns are on the  right.
Net-Migration with 90% MOE

The fifth tab provides the data used in the fourth tab in table format with the upper and lower bound of the 90% MOE.
Net-Migration for Nebraska's neighboring states

The Sixth tab allows selecting states with the map.
Selecting States with Map

Labels: , , , , , , , , ,

Monday, January 30, 2017

Population migration derived from ACS 2015 5-year PUMS dataset

Now that we have got all the data imported, let's have some fun.

For those of you who knows me, I have been an advocate for open source movement for a while now. The statistic software I preferred to use have been the R. However, I had not spent a lot of my time on R - I think we all understand that people got a lot of things to do and we revisit a tool when we needed to.

Couple months ago, I spent my spare time and wrote quite a bit of code in R and I thought that I will be right at home when I decided to take on this migration project. Boy, am I wrong about this... gosh. Well, spent almost whole day and end up fixing some of the bugs - well, not really a bug but because I have decided to include the NA definition into my definition database, it caused some problem when referencing these definitions from my old code. Anyway, got it fixed but did not really use the R.

Well - my IT training kicked in - I realized that instead of using the statistic software for this project, a few SQL statement will largely simplify the task to nothing. Come to think about this, the SQL not only easier, it actually run much faster - Database is designed to run from hard disk, it is not like most statistic software will load all the data into memory and tied up the computer resources. By the way, a while back I have this idea of using database as my statistic software. I actually check out MS SQL documentation on customer functions and, do you know what, it is totally possible. Now, the question is who is going to take on this project.

Anyway, I end up running few SQL statements and dumping it into Excel with a bunch of formula - sorry, I haven't really invested in the Open Office yet.

OK. Let's get back to the topic. American Community Survey is conducted by US Census Bureau in an annually basis. The PUMS file is sampled from the collected data and allows user to use these sample to derive results that weren't readily tabulated by the US Census Bureau.

Inside the ACS survey, there is a question that asked respondents where they lived a year ago. Based on this question, we can look into the PUMS data and derive some useful information from it. One of the interesting application of this question is when it is combined with the education attainment info of the respondents. This allowed data analysts to see that, for people moving out of a state, what kind of education these people acquired and, hence, the brain drain if highly educated people left a state.

Click here for the resulting file - please noted that for any result derived from sampling, there are associated errors - this file does not come with the 'margin of errors', which describes the range the real value may lie. In our case, with large enough margin of errors, the real value for an in-migration could end up in negative and, hence, associated with the idea of  an out-migration. So, the file is for references only. The author is working on consolidate some of the categories and, hopefully, can report some data with reasonable 'margin of error'.


Labels: , , , , , , ,

Wednesday, January 25, 2017

ACS 2015 5-year PUMS for database/IT professional

Continue with the ACS PUMS database project, the task is to import the 5-year ACS PUMS product of 2015.

Comparing Census's 2015 Data Dictionary file (PUMS_Data_Dictionary_2011-2015.txt) to that of 2014, we, again, noticed that Census' added leading spaces to a lot of lines to, possibly, make the file more readable for 'human' users. Following the step of processing the 2015 1-year file, I removed those leading spaces via sed before processing it with my definition processor.

Processing the Data Dictionary file with my program, it yields the following parsing errors. Some of them are clearly unintended errors, others may just because Census did not spend time and efforts to establish clear syntax rules so that their products can be machine friendly. Here are the parsing errors:

        value 1001264, the blank after '1001264' is actually an A0h instead of
        just before TEN, a two line 'NOTE:'
        - no blank line after the 'PERSON RECORD' section mark
        value 100264, the blank after '1001264' is actually an A0h instead of
        just before GCL, a two line 'NOTE:'
        no empty line before FPINCP

The A0h one is really interesting. For those of interest, A0h is a NBSP character used in HTML. Without a good hex editor, it takes me a lot of efforts to figure out what is going wrong.

Labels: , , , ,

Wednesday, December 28, 2016

ACS 2015 1-year PUMS for database/IT professional

Continue with the ACS PUMS database project, the task is to import the 1-year ACS PUMS product of 2015.

Comparing Census's 2015 Data Dictionary file (PUMSDataDict15.txt) to that of 2014, we noticed that Census' added leading spaces to a lot of lines to, possibly, make the file more readable for 'human' users. After few thoughts, I decide to simply remove those leading spaces via sed before processing it with my definition processor. It is also noticed that in this year's file, there are a lot of value definitions spanning more than one line and I decided to modify my program to adapt to that.

Processing the Data Dictionary file with my program, it yields the following parsing errors. Some of them are clearly unintended errors, others may just because Census did not spend time and efforts to establish clear syntax rules so that their products can be machine friendly. Here are the parsing errors:

      A not recognized line before DIALUP
        - it read 's line intentionally blank; content continues.'
      Two line Note: before TEN
      A not recognized line before PARTNER
it read 's line purposely blank; content continues.'
      NWAB, NWAV, NWLA, NWLK, NWRE variables deviated from the format of others.

        - extra text after the variable size
      PAP blank line before 00001..99999
        - this is an obvious error
      NAICSP bbbbbbbb no space before description
        - most value definition, there is a separation space between value and description.

       NAICSP 928110P1 to 928110P7 no space before description
        - same as above
      Just before end(*) a Note line instead of Note: line
        - a Note, not a Note:

Since there are only few issues, I was able to manually editing the file and made it machine parse-able/process-able.