Statistics

NB: The excel sheet with explanation must follow as attachment

Q.1 Let X, Z be independent and normally distributed, N(0, 1). Set Y= Z- X. Show that

E(XY) = ?1

And that the (theoretical) correlation coefficient between X and Y is

? (X,Y) = = 0.707

Q.2 Use Excel (Menu: Data-> Data Analysis -> Random Number Generation) to simulate (pull) n = 20 observations of both X and Z: (x1, z1), (x2, z2),
. , (xn, zn). Xs and Zs one place side by side in two columns. Plot Z towards X (create a scatter plot). X and Z should after its construction be independent.

Provides the plot impression of this? Then calculate the Y ( yi- = zi xi for = i = 1.2, .., 20) and plotting Y against X. Provides the plot impression of dependence between X and Y? Positive or negative dependence?

Q.3 Use the same method as above to simulate n = 20 observation pair of (X, Y) when

? (X, Y) = -0.2 and when ? (X, Y) = 0.9. Create scatterplot in both cases.

Hint: Put Y = Z + aX. Show that

Then choose a so ? get the desired value. It may be wise to resolve expression for ? with respect to a. Note that a and ? must have the same sign.

Q.4 In the above examples the theoretical correlation coefficient, ? is known. In practice, with data from reality, the population sizes var(X), var(Y) and cov(X,Y) and thus ? be unknown and must be estimated. If (x1, y1), (x2, y2), , (xn, yn) is n coherent observations of X and Y, are natural estimators (intuitively justified by the law of large numbers) for var(X) var(Y), and cov(X,Y) based on the range, respectively:

where ? means approximately equal. The expressions to the left of ? is estimators (ie stochastic variables behind estimates) for the unknown population sizes in right. It can be shown that the estimators tend to produce better estimates (estimated values) for estimands (the expressions on the right) the greater n is.

Since

a natural estimator for ? (the classic Pearsons product moment Corr.koeff.).

Excel calculates this by function CORREL (under statistical functions).

Assume the data from Q.2, where ? = is known. We pretend we do not know ? and estimates it out from the data at r. Do it and reports how large estimation error | r-? |, was. Comment on the result.

Q.5 To get idea of how good (or bad) r are as estimator, well look at how r behave itself by repeated use. Repeat the experiment in Q.2 «25 times» and compute r everytime. (There may be work saving to simulate the 500 observation pairs you need for X and Z at once and compute r for the 25 data sets by exploiting copying functions in Excel.) Gather eventually the 25 observations directly below each other in a column.

You have now created you a sample of 25 observations of r. Describe the empirical distribution (by histogram), mean, median, quartiles and standard deviation for this sample. Seems r to be normally distributed? Seems r reliable as an estimator for ??

Q.6 Repeat the experiment in Q.5, but now with n = 50 observations (instead n = 20) of X and Y for each calculation of r (ie you need now 1250 observation pair (ie 2500 observations in total) of X and Y to get 25 observations of r). What significance seems to have rs properties estimator that the number of observations of X and Y have become bigger?

Q.7 Now let Y = Z 3X2, where X and Z are independent and normally distributed N (0, 1), as in Q.1. Simulate (pull) n = 50 observations (X, Y) and Y plot against X. Estimate ? (X, Y) and comment on the result. What is the true ? in this case? Construe your results on that X and Y are stochastically independent?

(Hint: You may find it useful to know that if X is normally distributed with anticipation 0, then it appears that the E (X 3) = 0, which incidentally applies to any symmetrical distribution with anticipation equal to 0)

Q.8 The dangers of smoking have been studied and documented by many statistical studies since the war. This has led to a ban on advertising, ordinance printed warnings on tobacco products and a certain attitude change. We will look at some numbers from the 60s which provides average cigarette consumption and mortality from cardiovascular disease (HKS) for n = 21 countries

Year

Country

cigarett-consum pr. adult pr. år HKS mortality pr. 100 000

(age 35-64)

1962 USA

3900

256,9

1962 Canada

3350

211,6

1962 Australia

3220

238,1

1962 New Zealand

3220

211,8

1963 Great Britain

2790

194,1

1962 Switzerland

2780

124,5

1962 Ireland

2770

187,3

1962 Island

2290

110,5

1962 Finland

2160

233,1

1963 west Germany

1890

150,3

1962 Netherlands

1810

124,7

1962 Greece

1800

41,2

1962 Austria

1770

182,1

1962 Belgium

1700

118,1

1962 Mexico

1680

31,9

1963 Italy

1510

114,3

1961 Denmark

1500

144,9

1962 France

1410

59,7

1962 Sweden

1270

126,9

1961 Spain

1200

43,9

1962 Norway

1090

136,3

(i) Calculate where x stands for cigarette consumption and

y for HKS mortality. Estimate the correlation coefficient, ?, between cigarette consumption and HKS mortality.

(ii) Suppose we changed the designation for HKS mortality from pr. 100,000 to pr. 10 000. (For example, for the United States: 256.9 pr. 100 000 = 25.69 pr. 10, 000). What effect has this change for standard deviation, covariance, correlation coefficient and their estimates? Why is the correlation coefficient untouched?

Q.9 Introduction. The numbers in Q.8 regarded as a representative sample of observations drawn from a larger population (which may well include more countries and other times) of unknown correlation coefficient, ?. It is this ? we really are interested in. The problem we will look at the following

How strong is the evidence in the data that the unknown ? is actually positive (> 0), when we take into account that there are uncertainties in the estimate?

There are in fact conceivable that we got what we got (estimate) from coincidence although the true value actually is ? = 0 in that the numbers in data represent only a sample drawn from the population. If, on the other hand, the likelihood of this occurring is sufficiently small, so we take it as evidence that the assumption ? = 0 is incorrect. Nevertheless this probability is, the stronger is the evidence.

This is one of the fundamental ideas in hypothesis testing

More precisely: Let r0 denote your estimate (a concrete tall). This tall is now considered as an observation of a stochastic variable, r defined in Q.4. We are therefore interested to calculate the probability,

P (to get at least as large as r what we got) = P (r? r0) calculated under the assumption

? = 0

If this probability (also called p-value in the hypothesis testing theory) is small enough (for example if it is about 0.05 or less), will be considered (conventional) as strong evidence against the assumption ? = 0 ie strong evidence in the data for ?> 0. If the p-value is even smaller (for example, approximately 0,001) is regarded as an even stronger evidence for ?> 0. If, on the other hand, the p-value is slightly larger (for example, 0.09), will be considered not usually strong enough evidence.

In order to calculate P r (? r0) we need to know the probability distribution of r. This distribution is complicated, but can be shown to be approximately a normal distribution if n = number of observation pairs is large enough. Now is n = 21 on the small side, which also simulation experiment in Q.5 may indicate. However, there are from classical statistical theory a famous transformation (Fishers z-transformation) which improves the approximation to the normal distribution for moderate n:

Under general conditions (which we shall not take up here)

is approximately the standard normal distribution N (0,1), when ? = 0, which we assume applies here for n = 21.

Answer:

(i) Consider the function

defines for

It is not difficult to show by differentiation that h(x) is a strictly increasing function of x

1 <x <1. You do not need to do it here, but instead get Excel to create a graph of the function shows that it is growing.

[Hint. Choose some values of x (e.g. values

-0.99, -0.9, -0.8,
, 0.1,0,0.1,0.2,
., 0.8,0.9,0.99, laid in a column. In the next column to calculate the associated values of h(x). Mark the area with numbers and a scatter plot layer with lines between points, ie using the menu

Insert -> Scatter -> Scatter with smooth lines]

(ii) Explain from your graph why events (r ? 0) and (Z ? h (r0)) is logically equivalent (ie if one event hit, hit the others as well, and vice versa) and therefore equally probable

(iii) Use (ii) and (1) to calculate the p-value, P (r ? r0), virtually under the assumption that ? = 0, and comment to evidence in the data that ?> 0.

[Hint. Use NORM.DIST function in Excel (2010) (NORMDIST in Excel 2007).]