chisqtestGC() Tutorial01 Feb 2014
- Two Factor Variables (Association)
- One Factor Variable (Goodness of Fit)
- Want Less Output?
You use the -test when you are addressing the inferential aspect of research questions about:
- the relationship between two factor variables (test for association);
- the distribution of one factor variable (goodness-of-fit test).
The function (and some of the data) that we will use comes from the
tigerstats package, so make sure that it is loaded:
Note: If you are not working with the R Studio server hosted by Georgetown College, then you will need install
tigerstats on your own machine. You can get the current version from Github by first installing the
devtools package from the CRAN repository, and then running the following commands in a fresh R session:
Two Factor Variables (Association)
When your data are in raw form, straight from a data frame, you can perform the test using “formula-data input”. For example, in the
mat111survey data, we might wonder whether sex and seating preference are related, in the population from which the sample was (allegedly randomly) drawn. The function call and the output are as follows:
Two-Way Table Input
Sometimes you already have a two-way table on hand:
In that case you can save yourself some typing by entering the table in place of the formula and the
A Table From Summary Data
Remember: if you are given summary data, only, then you can construct a nice two-way table and enter it into
chisquaretestGC(). Suppose that you want this two-way table:
You can get it as follows:
Let’s just check to see that this worked:
Then you can enter
MySexSeat into the function:
When the Null’s expected counts are low,
chisqtestGC() delivers a warning and suggests the use of simulation to compute the -value. You do this by way of the argument
simulate.p.value, and you have three options:
simulate.p.value = "fixed"
simulate.p.value = "random"
simulate.p.value = "TRUE"
Explanatory Tallies Fixed
Suppose that the objects under study are not a random sample from some larger population, and that the way chance comes into the production of the data is through random variation in all of the other factors—besides the explanatory variable— that might be associated with the response variable. Then since the items being observed are fixed, the tally of values for the explanatory variable are fixed. The response values for these items are the product of chance, but only through random variation in those other factors.
The study from the
ledgejump data frame was an example of this. The 21 incidents were fixed, so there were nine cold-weather incidents and 12 warm-weather incidents, no matter what. The crowd behavior at each incident, however is still a matter of chance.
In such a case you might want to resample under the restriction that in all of your resamples, the tally for the explanatory variable stays just the same as it was in the data you observed. Then your function call looks like:
You can set
B, the number of resamples, as you wish, but it should be at least a few thousand. Of course the -value, having been determined by random resampling, will vary from one run of the function to another.
Explanatory Tallies Random
m111survey study on sex and seating preference, the subjects are a random sample from a larger population. In that case the tallies for both the explanatory and the response variables depend upon chance. If you simulate in such a case, then you set
simulate.p.value to “random”:
Both Tallies Fixed
If you want to resample in such a way that the tallies for BOTH the explanatory and response variables stay exactly the same as they were in the actual data, then you set
simulate.p.value to “TRUE”. This invokes R’s standard method for resampling:
It’s not easy to understand why R would adopt such a method, but there is some good theoretical support for it. If you are ever in doubt about how to simulate, just use this third option.
Graphs of the P-Value
You can get a graph of the -value in the plot window by setting the argument
graph to TRUE. When you did not simulate, the graph shows a density curve for the random variable with the relevant degrees of freedom. When you simulate, the graph is a histogram of the resampled -statistics.
Here is a case with no simulation:
Here is a case with simulation:
One Factor Variable (Goodness of Fit)
The variable seat in the
m111survey data frame indicates the classroom seating preference of each person in the survey. Suppose we want to know whether or not the sample data provide strong evidence that seating preference in the Georgetown College population is not uniformly distributed among the three possible options (Front, Middle, and Back). That is, letting
proportion of all GC students who prefer the front,
proportion of all GC students who prefer the middle,
proportion of all GC students who prefer the back,
we want to test the hypotheses:
at least one of the above proportions is not
We can do so using the -test for goodness of fit. The argument
p will give what the Null Hypothesis believes to be the distribution of the variable seat in the Georgetown College population:
Suppose that the data had only come to us in summary form:
We could still perform the test, by making a table and storing it as an R-object:
Then we could perform the test using
When expected cell counts fall below 5,
chisqtestGC() issues a warning and suggests the use of simulation. We can perform simulation at any time, though.
For goodness-of-fit tests, the only relevant form of simulation is the one provided by setting
TRUE. Of course we also need to set the number
B of resamples.
Want Less Output?
If you do not want to see quite so much output to the console and are only interested in the essentials for reporting a -test, then set the argument