SocialPond

Things about society.

Monday, January 30, 2017

Population migration derived from ACS 2015 5-year PUMS dataset


Now that we have got all the data imported, let's have some fun.

For those of you who knows me, I have been an advocate for open source movement for a while now. The statistic software I preferred to use have been the R. However, I had not spent a lot of my time on R - I think we all understand that people got a lot of things to do and we revisit a tool when we needed to.

Couple months ago, I spent my spare time and wrote quite a bit of code in R and I thought that I will be right at home when I decided to take on this migration project. Boy, am I wrong about this... gosh. Well, spent almost whole day and end up fixing some of the bugs - well, not really a bug but because I have decided to include the NA definition into my definition database, it caused some problem when referencing these definitions from my old code. Anyway, got it fixed but did not really use the R.

Well - my IT training kicked in - I realized that instead of using the statistic software for this project, a few SQL statement will largely simplify the task to nothing. Come to think about this, the SQL not only easier, it actually run much faster - Database is designed to run from hard disk, it is not like most statistic software will load all the data into memory and tied up the computer resources. By the way, a while back I have this idea of using database as my statistic software. I actually check out MS SQL documentation on customer functions and, do you know what, it is totally possible. Now, the question is who is going to take on this project.

Anyway, I end up running few SQL statements and dumping it into Excel with a bunch of formula - sorry, I haven't really invested in the Open Office yet.

OK. Let's get back to the topic. American Community Survey is conducted by US Census Bureau in an annually basis. The PUMS file is sampled from the collected data and allows user to use these sample to derive results that weren't readily tabulated by the US Census Bureau.

Inside the ACS survey, there is a question that asked respondents where they lived a year ago. Based on this question, we can look into the PUMS data and derive some useful information from it. One of the interesting application of this question is when it is combined with the education attainment info of the respondents. This allowed data analysts to see that, for people moving out of a state, what kind of education these people acquired and, hence, the brain drain if highly educated people left a state.

Click here for the resulting file - please noted that for any result derived from sampling, there are associated errors - this file does not come with the 'margin of errors', which describes the range the real value may lie. In our case, with large enough margin of errors, the real value for an in-migration could end up in negative and, hence, associated with the idea of  an out-migration. So, the file is for references only. The author is working on consolidate some of the categories and, hopefully, can report some data with reasonable 'margin of error'.




 








Labels: , , , , , , ,

Wednesday, January 25, 2017

ACS 2015 5-year PUMS for database/IT professional

Continue with the ACS PUMS database project, the task is to import the 5-year ACS PUMS product of 2015.

Comparing Census's 2015 Data Dictionary file (PUMS_Data_Dictionary_2011-2015.txt) to that of 2014, we, again, noticed that Census' added leading spaces to a lot of lines to, possibly, make the file more readable for 'human' users. Following the step of processing the 2015 1-year file, I removed those leading spaces via sed before processing it with my definition processor.

Processing the Data Dictionary file with my program, it yields the following parsing errors. Some of them are clearly unintended errors, others may just because Census did not spend time and efforts to establish clear syntax rules so that their products can be machine friendly. Here are the parsing errors:


      ADJINC
        value 1001264, the blank after '1001264' is actually an A0h instead of
          20h.
      TEN
        just before TEN, a two line 'NOTE:'
      PERSON RECORD
        - no blank line after the 'PERSON RECORD' section mark
      ADJINC
        value 100264, the blank after '1001264' is actually an A0h instead of
          20h.
      GCL
        just before GCL, a two line 'NOTE:'
      FPINCP
        no empty line before FPINCP


The A0h one is really interesting. For those of interest, A0h is a NBSP character used in HTML. Without a good hex editor, it takes me a lot of efforts to figure out what is going wrong.


Labels: , , , ,