/* This do-files is about OLS/GLS estimations and basic test commands. There are five meta steps in this do file: 1- Load the data with the command "use" 2- Code the variables with commands "gen" and "replace" 3- Perform the regression witht the command "reg". 4- Perform some statistical tests. 5- Draw some graph as examples. */ clear //Clear all that was previously in memory set mem 50m //Set stata's memory to 50 megabytes. global DIRECTORY = "/Users/pabsta/Documents/2-Enseignement/ECON452/tutorial2/" cd $DIRECTORY //LOAD THE DATA //SLID : Survey of Labor and Income Dynamics (StatCan Survey availlable through ODESI) use http://www.pabsta.qc.ca/files/ECON452/DATA/slid2008.dta //Downloading this may take a while so its better to put it on a common directory in DUN350 /*The name must specify the full directory. The loaded database is the Survey on Labor and Income Dynamics (SLID). This is a monthly survey performed on canadians about their income and their occupation. */ //CODING SOME NEW VARIABLES /* Now that the database is loaded, we generate new variables to perform the mincer regression. Now would be a good time to see how stata represents the data in memory with the data browser. It is basically a giant 'excel file' (a matrix) that represents the data. Each line is an observation, each column is a variable. The following code then builds the dependent and the independent variables. One will notice some similar patterns. */ //The command below replaces some values: replace ttinc42 = . if(ttinc42 >= 99999996) /* Here, all observations of the variable ttinc42 (total income) that are bigger or equal than 99999996 are replaced by a point ".". The command "if(ttinc42 >= 99999996)" is used to indicate Stata that this replacement should be done only if the requirement is met. The "point value" is the code used by Stata to indicate that there is actually no observation. The value 99999996 or higher is used by Statistics Canada to indicate that the respondant did not answer this question. Hence, this commands is a simple conversion of conventions. It replaces the code used by StatCan with the code used by Stata. */ //The command "gen" generates a new variable. E.g: a new column in Stata's matrix: gen lnttinc42 = ln(ttinc42) /* The name of the variable is lnttinc42 and is equal to the logarithm of ttinc42, which is the TOTAL INCOME of each observations. This is the dependant variable in the regression we wish to perform. The following commands, mostly replaces, change the variables in the dataset in a similar fashion. */ /* The variable cmphi18 encodes high-school graduation of each observation. If the variable is equal to one, the observation is a high-school graduate. If the variable is equal to two, the observation os not a high-school graduate. For any other value higher than 5, this indicates that the respondant refused to answer. Hence, the following code adapts the variable into an "indicator variable": */ replace cmphi18 = . if(cmphi18 >= 6) replace cmphi18 = 0 if(cmphi18 == 2) //The story is exactly the same for the following variables. //The variable encodes if an observation ever received a post-secondary but non-university degree. replace dgcoll18 = . if(dgcoll18 >= 6) replace dgcoll18 = 0 if(dgcoll18 == 2) //The variable dguniv18 encodes if an observation ever received a university degree. replace dguniv18 = . if(dguniv18 >= 6) replace dguniv18 = 0 if(dguniv18 == 2) //The variable ecsex99 encodes the sex of each observation replace ecsex99 = . if(ecsex99 >= 6) replace ecsex99 = 0 if(ecsex99 == 2) //The variable ecage26 encodes the age of each observation //Values higher than 996 are used by StatsCan to indicate no response. replace ecage26 = . if(ecage26 >= 996) //GENERATING NEW VARIABLES gen age2 = ecage26^2 // Age squared gen ageddgcoll18 = ecage26*dgcoll18 //Interaction variable between age and college gen ageddguniv18 = ecage26*dguniv18 //Interaction variable between age and university gen aged2DGUNI18 = age2*dguniv18 //Interaction variable between age squared and college gen aged2dgcoll18 = age2*dgcoll18 //Interaction variable between age squared and university //PERFORMING THE REGRESSION /* The command below ("reg") performs the regression and displays the output in stata. The first variable is the dependant variable and the others are independant variables. The use of a star ("*") after some text indicates a wildcard. Hence, all variables that start with "aged" will be added to the regression. This is useful if one wishes to include many variables without typing all of them. The part of the command between brackets indicates which weights are used and which variable in the dataset acts as weights. (in this case wtcsld26) */ reg lnttinc42 cmphi18 dguniv18 dgcoll18 ecage26 age2 aged* [aw= wtcsld26] if(pvreg25 == 24) /* A great document to help for the interpretation of a regression can be found here : http://www.pabsta.qc.ca/sites/default/files/351memo1.pdf */ //Test some coefficients to see if they are equal to zero. test ageddguniv18 aged2DGUNI18 //We should drop them reg lnttinc42 cmphi18 dguniv18 dgcoll18 ecage26 age2 ageddgcoll18 aged2dgcoll18 [aw= wtcsld26] if(pvreg25 == 24) test ageddgcoll18 aged2dgcoll18 //we should keep them test ageddgcoll18 //Drop it (this could be seen directly from the output) test dguniv18 = dgcoll18 //Same effect for university and college? No. //'Final' regression: reg lnttinc42 cmphi18 dguniv18 dgcoll18 ecage26 age2 ageddgcoll18 [aw= wtcsld26] if(pvreg25 == 24) //AN EXAMPLE OF LIKELIHOOD RATIO TEST. estimates store estRestricted //Store the restricted estimates (last regression) reg lnttinc42 cmphi18 dguniv18 dgcoll18 ecage26 age2 aged* [aw= wtcsld26] if(pvreg25 == 24) //First regression (unrestricted estimates since all coefficients are there) estimates store estUnrestricted lrtest estUnrestricted estRestricted //Note that in the case of regressions, this equivalent to: //di (ln(8000.21448)-ln(7997.76395))*8956 //GENERATING A NICE GRAPH AS IN THE FIRST WEEK SLIDE //This commands generates a new variable based on the estimated values. predict estimatedWages, xb //Since we are interested in the wages (and not the log of wages), we exponentiate the variable. replace estimatedWages = exp(estimatedWages) // Graph of the predicted wages for male in QuŽbec graph twoway scatter estimatedWages ecage26 if(pvreg25 == 24 & ecsex99==1 & cmphi18==1) /* It would be nice to have those different curves in differenct colors to see which is rich. Let's build different variables. */ gen prWageUniv = estimatedWages if(pvreg25==24 & ecsex99==1 & cmphi18 == 1 & dgcoll18 ==1 & dguniv==1) gen prWageUnivNoColl = estimatedWages if(pvreg25==24 & ecsex99==1 & cmphi18 == 1 & dgcoll18 ==0 & dguniv==1) gen prWageColl = estimatedWages if(pvreg25==24 & ecsex99==1 & cmphi18 == 1 & dgcoll18 ==1 & dguniv==0) gen prWageHS = estimatedWages if(pvreg25==24 & ecsex99==1 & cmphi18 == 1 & dgcoll18 ==0 & dguniv==0) //Same graph with different colors sort ecage26 graph twoway line prWageUniv prWageUnivNoColl prWageColl prWageHS ecage26 if(pvreg25==24 & ecsex99==1) //Legend could be clarified: use the graph editor. //Save the graph: capture mkdir OUTPUT /* The capture command deletes Stata's output even in case of error. If the directory OUTPUT has already been created, mkdir ("make directory") yields an error. If not, there is no error. Since my goal is to save in the directory OUTPUT, this error means nothing here. Hence, I remove it from the output. Capture should be used with parcimony, though. */ graph export OUTPUT/predictedWageMen.gph, replace