Assignment2_2016.docx Page 1 of 5
RESEARCH SCHOOL OF FINANCE, ACTUARIAL STUDIES AND STATISTICS College of Business & Economics, The Australian National University
REGRESSION MODELLING (STAT2008/STAT4038/STAT6038)
Assignment 2 for 2016
Instructions This assignment is worth 20% of your overall marks for your course (for all students, enrolled in STAT2008, STAT4038 or STAT6038). If you wish, you may work together with another student in doing the analyses and present a single (joint) report. If you choose to do this then both of you will be awarded the same total mark. Students enrolled under different course codes may work together. You may NOT work in groups of more than two students and the usual ANU examination rules on plagiarism still apply with respect to people not in your group. Research School of Finance, Actuarial Studies and Statistics assignment cover sheets are available on Wattle. Please complete and attach a copy of the cover sheet to the front of your report. Remember to keep a copy of your assignment. Assignments should be written, typed or printed on sheets of A4 paper stapled together at the top left-hand corner (do NOT submit the assignment in plastic covers or envelopes). Your assignment may include some carefully edited computer output (e.g. graphs) showing the results of your data analysis and a discussion of those results. Please be selective about what you present only include as many pages and as much computer output as necessary to justify your solution and be concise in your discussion of the results. Clearly label each part of your report with the question number and the part of the question that it refers to. Unless otherwise advised, use a significance level of 5%. Marks may be deducted if these instructions are not strictly adhered to, and marks will certainly be deducted if the total report is of an unreasonable length, i.e. more than 12 pages including graphs. You may include as an appendix, any R commands you used to produce your computer output. This appendix and the cover sheet are in aIDition to the above page limits; but the appendix will generally not be marked, only checked if there is some question about what you have actually done. Assignments will be marked by your allocated tutor. Assignments should be submitted in the assignment box labelled with your course code (STAT2008 or STAT4038/STAT6038) and your tutors name located next to the Research School of Finance, Actuarial Studies and Statistics office by 3 pm on Friday 20 May 2016. You may ask one of the tutors or me (Ian McDermid) questions about this assignment, in person, up to the deadline (3 pm on Friday 20 May 2016), after which we will NOT answer any further questions about this assignment, until after the marked assignments have been returned to students. Answers to questions in writing sent to me via e-mail or posted on Wattle, will be posted on Wattle, but must be received no later than 12 noon on Thursday 19 May 2016. Late assignments will NOT be accepted after the deadline without an extension. Extensions will usually be granted on medical or compassionate grounds on production of appropriate evidence, but must have my permission by no later than 12 noon on Thursday 19 May 2016. Even with an extension, all assignments must be received no later than 9 am on Monday 23 May 2016, when the solutions to this assignment will be released on Wattle. These solutions will be discussed in the tutorials in week 13 (Monday 23 to Friday 27 May 2016).
Assignment2_2016.docx Page 2 of 5
Data Many of the projects I have worked on as a statistician have involved data that was considered private (such as health data) or data to which access was restricted (for example, data that was designated commercial-in-confidence). For these reasons, it is not always easy to source realistic data for use in teaching statistics and so groups of statisticians maintain repositories of examples of real data that are in the public domain. In many countries, there are Internet repositories of data available for use in the teaching of introductory statistics. The data to be used in this years assignments come from two such repositories: the data archive associated with the Journal of Statistics Education (JSE), a publication of the American Statistical Association (www.amstat.org/publications/jse/jse_data_archive.htm); and OzDASL, the Australasian Data and Story Library, which is part of the Statistical Science Web (www.statsci.org/data/index.html), maintained by Gordon Smyth, a member of the Statistical Society of Australia. Datasets in the JSE data archive are typically accompanied by a file which give a description of the variables included in the data (the meta-data) and are also often accompanied by an associated article in the journal (and occasionally even by references to other sources). The
cigarettes data, which we will be using in question 1 of this years assignments, includes both of the above accompanying documents. You can download a text file containing the cigarettes data and the associated documents from the JSE website (www.amstat.org/publications/jse/jse_data_archive.htm) or the data is also available on Wattle in the file cigarettes.csv, which includes a header row with the variable names. I have also downloaded a copy of the meta-data text file (cigarettes.txt), and made this file available on Wattle. Datasets in OzDASL also typically come with some resources. The meta-data for the stroke data, which we will be using in question 2 of this years assignments, is available on the Statistical Science Web (http://www.statsci.org/data/oz/strokeass.html), where you can also download a text file containing the data. The stroke data is also available on Wattle in the file
stroke.csv and the meta-data description is available in the file stoke.pdf.
Assignment2_2016.docx Page 3 of 5
Question 1 (20 marks) The United States Federal Trade Commission annually rates varieties of domestic cigarettes according to their tar, nicotine, and carbon monoxide content. The US Surgeon General considers each of these substances hazardous to a smokers health. Past studies have shown that increases in the tar and nicotine content of a cigarette are accompanied by an increase in the carbon monoxide emitted from the cigarette smoke. The file cigarettes.csv, available on Wattle, contains a summary of the data collected in one year, for 25 brands of cigarettes and includes the variables: Brand name (Brand); Tar content in milligrams (Tar); Nicotine content in milligrams (Nicotine); Weight in grams (Weight); and Carbon monoxide content in milligrams (CO). In this assignment, we would like to use all of the available variables to build a multiple regression model to examine the factors that affect the amount of carbon monoxide emitted by these cigarettes. (a) Produce a pair-wise scatterplot matrix for the cigarettes data. Comment on the relationships shown in this scatterplot, assuming that CO is going to be the response variable for your multiple regression analysis. What type of variable is Brand name (Brand)? Can Brand be sensibly included as an explanatory variable in your multiple regression model? (3 marks) (b) Using CO as the response variable and Tar, Nicotine and Weight as explanatory variables, fit a multiple regression model. Experiment with the order in which you fit the explanatory variables in the model and examine the ANOVA table for each model. Why do the models change so dramatically depending on the order in which you include the explanatory variables? Present an appropriate ANOVA table and conduct a nested F test to determine if Nicotine and Weight are a significant aIDition to a model that already includes Tar? (3 marks) (c) Find a multiple regression model with CO as the response variable, which includes all three of Tar, Nicotine and Weight as predictors, but in an order where all three have significant sequential F-tests in the ANOVA table. Call this model A. Present the ANOVA table and the summary output for model A. Interpret the coefficients of the explanatory variables in model A and the various F-tests and t-tests shown in the summary output for model A. What does model A suggest is happening to the emitted carbon monoxide, as weight and nicotine increase? Does this make sense? (4 marks) (d) For model A, present a plot of the internally Studentised residuals against the fitted values, a normal quantile plot and a bar plot of Cooks distances. Is there a problem with potential outliers with this model? What other diagnostics can you produce to investigate these potential outliers? (Note this is not an invitation to produce a large amount of R output choose just one or two aIDitional relevant plots or summaries and discuss any output that you do produce). (3 marks) (e) Now apply a natural log transformation to all of the variables included in model A and re-fit model A to these log transformed variables. Present the same plots and other diagnostics you produced for model A in part (d), for this new transformed model. Discuss the differences between this new output and the output in part (d). Has the transformation solved the problems with potential outliers? (4 marks) (f) Finally, without presenting a lot of aIDitional R output, discuss which of the possible models (and including the model we fitted and discussed in Assignment 1) would you recommend for these data and why? What was probably the underlying research question when these data were collected (i.e. what do you think the researchers were really interested in)? Does your chosen model really aIDress this underlying question?(3 marks)Assignment2_2016.docx Page 4 of 5Question 2 (20 marks) The dataset stroke contains data from a pilot study of twenty patients selected from two large public hospitals in Brisbane. All twenty patients had recently suffered a cerebrovascular accident resulting in hemiplegia lasting at least 24 hours, had not previously been incapacitated from stroke or other disease and were currently receiving occupational therapy. The pilot study collected a number of variables which could be used as evaluation tools for assessing the recovery of patients who had recently suffered a stroke. Following on from Assignment 1, the client is now interested in building a series of multiple regression models to assess recovery from stroke with Barthel as the response variable. The client believes that a patients age (variable Age, measured in years), sex and the side of the brain affected by the stroke are all important factors that potentially affect recovery. These variables must be included in any multiple regression model that you fit, to control for their possible effects and so the client can assess these effects. The client is also interested in the areas measured by the Bobath Assessment Form, but (following your report on Assignment 1) would prefer to use the related Goteburg Assessment Form, which is divided into seven components (variables Arms, Legs, Hands, Balance, Sensation,JointPain and JointMotion; which are further described in the file stoke.pdf). (a) Create two new indicator variables, Female (which equals 1 if Sex = F and 0 otherwise) and Right (which equals 1 if Side = R and 0 otherwise) and fit the multiple regression model of Barthel on Age, Female and Right. Do the residual plots for this model show any obvious problems? Check the plots and answer the question, but only present one of the plots if you want to argue that there is a problem. Present a bar plot of the leverage values for this model. Which observation stands out as having relatively high leverage? What is different about this observation? (3 marks) (b) Present the ANOVA table and the summary output for the model in part (a). Interpret the various F-tests and t-tests shown in this output (you do not need to present formal hypothesis tests). What do the signs of the estimated regression coefficients suggest about the effects of age, sex and location of the stroke as predictors of Barthel as a measure of recovery from stroke? (3 marks) (c) We could potentially include any of the variables in the stroke data as explanatory variables in a multiple regression model, but what is the obvious problem with including the variable Subject? Given the results of Assignment 1, is there a problem with including Kenny as a possible predictor for Barthel? If we are going to fit a multiple regression that already includes some demographic variables (Age and Female), some classification variables (Right) and some of the seven components of the Goteburg Assessment as predictors, is there a problem with also including Bobath as a predictor?(3 marks) (d) The variable Lapse (the time since the occurrence of the stroke in weeks) might be another factor that affects recovery. Present an aIDed variable plot for Lapse as a possible aIDition to the model in part (a). Also present the aIDed variable plot for log(Lapse). Use these plots to comment on the inclusion of Lapse as a possible predictor.(3 marks)Assignment2_2016.docx Page 5 of 5Question 2 continued (e) Now experiment with including some or all of the seven components of the Goteburg Assessment as aIDitional predictors in the model in part (a) (note that the model should already include Age, Female and Right, regardless of their significance). You should treat the seven variables from the Goteburg Assessment (Arms, Legs, Hands, Balance, Sensation,JointPain and JointMotion) as observational covariates and only include them if they significantly improve the model. Remember that the order in which you include them will be important, so experiment with including these variables in different orders. [To make this a little easier, we will observe a few restrictions. Do NOT include the variables Subject, Sex, Side, Kenny, Bobath or Lapse in your model. Use variables only on their original scales, i.e. do NOT use transformations. Do NOT include any higher order terms (e.g. interactions or quadratic terms). Do NOT delete
any observations as potential outliers.] Present full details of a nested model F-test that compares your chosen model with the full multiple regression model of Barthel on Age, Female, Right, Arms, Legs, Hands,Balance, Sensation, JointPain and JointMotion. (3 marks) (f) Now delete one (but only one) potential outlier from your chosen model in part (e). Which observation do you choose to delete and why? Present the details of a test to decide if this observation is really an influential outlier. Choose one of the residual plots and present two versions of this plot; both before and after the deletion of this observation. Discuss your results. (3 marks) (g) Present the ANOVA table and the summary output for your chosen model from either part (e) or part (f), depending on whether or not you consider the deletion of the potential outlier to be warranted. Interpret the various F-tests and t-tests shown in this output (again, you do not need to present formal hypothesis tests). Discuss the overall fit of this model and what the model suggests about the underlying research question.(2 marks) (h) Finally, suppose we forget about all of the restrictions in part (e). See if you can find a better model for these data and present a concise summary of your results. This last bit is optional and not worth any marks, but if you do a good job and dont bore the client in the process (in this instance, your tutor, who has to mark your assignment), they may be impressed and give you back some of the marks that you may have lost elsewhere in this question (you will also need to respect the overall page limit, as your tutor may simply stop reading once you have exceeded that limit). (0 marks) _____________
find the cost of your paper
Is this question part of your assignment?
Posted on May 14, 2016Author TutorCategories Question, Questions
Assignment2_2016.docx Page 1 of 5