Sample Size Estimation Skewed Continuous Sas

October 12, 2022 Post a Comment

Problem Statement

In our sample dataset, students reported their writing placement test scores, and whether or not they were male or female. Suppose we want to know if the average writing score is different for males versus females. This involves testing whether the sample means for writing scores among males and females in your sample are statistically different (and by extension, inferring whether the means for writing scores in the population are significantly different between these two groups). You can use an Independent Samplest Test to compare the mean writing scores for males and females.

Before the Test

State the Null and Alternative Hypotheses

The hypotheses for this example can be expressed as:

H ₀: µ_males = µ_females("the mean writing scores in the population of males is identical to the mean writing score in the population of females")
H ₁: µ_males ≠ µ_females("the two population means are not equal")

where µ_males and µ_femalesare the population means for males and females, respectively.

Before we perform our hypothesis tests, we should decide on asignificance level (denoted α). The significance level is the threshold we will use to decide whether a test result is significant. For this example, let's useα = 0.05, or 5%.

Data Set-Up

In the sample data, we will use two variables:Gender andWriting. The variableGender has values of either "1" or "0" which correspond to females and males, respectively. It will function as the independent variable in thist test. The variableWriting is a numeric variable, and it will function as the dependent variable. In SAS, the first few rows of data look like this (if variable and value labels have been applied):

Exploratory Data Analysis + Check the Assumptions

Recall that the Independent Samples t Test has several assumptions that we must take into account:

The dependent variable should be normally distributed in both groups
The variance of the dependent variable should be the same in both groups; if it isn't, use the alternate version of the test statistic

So before we jump into the Independent Samplest Test, it is a good idea to look at descriptive statistics and graphs to get an idea of what to expect, and to see if the assumptions of the test have been reasonably met. To do this, we'll want to look at the means and standard deviations of Writing for males and females, as well as graphs that compare the distribution of Writing for males versus females. PROC TTEST automatically runs descriptive statistics and graphs for us, but we can also use PROC MEANS to produce descriptive statistics by group:

          PROC MEANS DATA=sample;    VAR Writing;    CLASS Gender; RUN;

PROC MEANS tells us several important things. First, there were 204 males and 222 females in the dataset, but only 191 males and 204 females reported a writing score. (This is important to know, because PROC TTEST can only use cases with nonmissing values for both gender and writing score. So our effective sample size for this test is 191+204 = 395, which is less than the total number of rows in the sample dataset (435).) Second, the mean writing score for males is 77.14 points, while the mean writing score for females is 81.73 points. This is a difference of more than four points. Third, the standard deviations for males' and females' writing scores are very similar: 4.88 for males, and 5.09 for females.

For graphs, we can use the two graphs that PROC TTEST produces for an independent samples t test:

The first graph contains histograms (top 2 panels) and boxplots (bottom panel) comparing the distributions of males' writing scores and females' writing scores. From the histograms, we can see that the distribution of writing scores for both the males and the females are roughly symmetric, but the distribution of females' writing scores is "shifted" slightly to the right of the males. From the boxplots, we can see that the total length of the boxplots and the inter-quartile range (distance between the 1st and 3rd quartiles, i.e. the edges of the boxes) is similar for males and females. This is what we would expect to see if the two groups had the same variance. By contrast, when we look at the center lines in the boxplot (which represent the median score), we see that they do not line up: the center line for the females' box plot is to the right of the center line for the males' boxplot. Additionally, the diamond shape in each box plot represents the mean score; we see that the mean score for the females is to the right of the mean score for the males. If the two groups had the same mean, we would expect these center lines and/or diamonds to "line up" vertically.

The second graph contains Q-Q plots of the writing scores for males (left panel) versus females (right panel). The Q-Q plots produced by PROC TTEST can be used to check if a variable's observed values are consistent with what we would expect them to be if the variable was truly normally distributed. To read a Q-Q plot, we look to see if the dots (the observed values) match up with the expected values for a normal distribution (the diagonal line). If the points fall along the line, then the values are consistent with what we would expect them to be if the data were truly normally distributed. In this case, we see that the values in the middle of the range are consistent with a normal distribution, for both males and females. Both groups have slight deviations from normality in the tails. Therefore, the normality assumption required for the independent samples t test appears to be satisfied.

Running the Test

SAS Program

          PROC TTEST DATA=work.sample ALPHA=.05;    VAR Writing;    CLASS Gender; RUN;

Output

Tables

Four tables appear in the PROC TTEST output; let's go through each one in the order they appear.

The first table contains descriptive statistics for both groups, including the valid sample size (n), mean, standard deviation, standard error (s/sqrt(n)), minimum, and maximum. Much of this we already saw in the PROC MEANS output, but this table also contains the computed difference between the two means. In this case, the first mean (male) was 4.5961 points lower than the second mean (females). In plain English, this means that, on average, females scored over 4 points higher on their writing placement test than males. Keep in mind that the independent samples t test is testing whether or not this difference is statistically different from zero.

The second table contains confidence limits for the group means, confidence limits for the group standard deviations, and confidence limits for the difference in the means (which is what we're interested in). Notice that there are two different confidence interval formulas for the difference. The first, Pooled, assumes that both groups have the same variance in writing scores. The second, Satterthwaite, does not make this assumption (i.e., it takes into account that one group has a different variance in writing scores than the other). We know from our exploratory data analysis that males and females have similar standard deviations, so we should look at the Pooled confidence interval. The 95% confidence interval for the difference in the writing scores is (-5.5846, -3.6076).

The third table contains the actual t-test results, and the fourth table contains the "Equality of Variances" test results:

Previously, we had used informal methods (descriptive statistics and graphs) to check if the two groups had the same variance in writing scores. However, we can do a "formal" hypothesis test to check if the two variances are approximately equal, using the Folded F test in the "Equality of Variances" table. This can help us decide whether we should use the Pooled or Satterthwaite result. The null hypothesis of the Folded F test is that the variances are equal; the alternative is that the variances are not equal. Because the p-value is greater than alpha (.05), we fail to reject the null hypothesis, and conclude that the variance of writing scores is equal for these two groups. Therefore, we will use the Pooled version of the independent samples t test.

Going back to the third table, we see that there are two versions of thet test: Pooled (which assumes equal variances) and Satterthwaite (which does not assume equal variances). The columns of the table, from left to right, are:

Method andVariances show which formula and variance assumptions are used for that test
df is the degrees of freedom
t Value is the test statistic, determined using the appropriate formula
Pr > |t| is the p-value corresponding to the given test statistic and degrees of freedom

Based on the Folded F test, we decided to use the Pooled version of the test statistic. To determine if the result is significant or not, we compare the Pooled p-value (p < .001) against our chosen significance level alpha (.05). Since the p-value is smaller than alpha, we reject the null hypothesis.

Decision and Conclusions

Sincep < .0001 is less than our chosen significance levelα= 0.05, we can reject the null hypothesis, and conclude that males and females had a statistically significant difference in their average writing scores.

Based on the results, we can state the following:

There was a significant difference in mean writing scores between males and females (t ₃₉₃ = -9.14,p < .05).
The average writing score for females was over 4 points greater than the average writing score for males (95% confidence interval -5.5846, -3.6076).

alvaradoliffe1958.blogspot.com

Source: https://libguides.library.kent.edu/SAS/IndependentTTest

Alvarado Liffe1958