Five Reasons to Teach Elementary Statistics With R: #3

Introduction

This is the third in a projected five-part series of posts aimed at colleagues who teach elementary statistics. If you teach with R but hesitate to spring such a powerful and complex tool on unsuspecting introductory students—many of whom whom will have had no prior experience with the command line, much less with coding—then I hope these posts will give you some encouragement.

The previous post in this series described R Studio’s package manipulate and its applications in the easy authoring of instructional applets. Today we’ll look at shiny, a related RStudio project.

In order to try the ensuing examples, download an ancillary package that we use for our elementary course:

Reason #3: RStudio’s shiny

Shiny appears to be intended primarily for data analysts working in industry or in academic or institutional research, but on the very day of its public release Victor Moreno pointed out its implications for statistics education (see his comment on this RStudio blog post). For statistics instructors Shiny offers essentially the same benefits as manipulate, but in addition comes pimped out with:

• options for dynamic user input;
• output formats that go well beyond manipulate’s home in the Plots pane;
• default Bootstrap styling.

Examples

“Slow” Simulation

At my College we believe that simulation is important to understanding probability concepts, but we also find that our students don’t easily grasp the import of a simulation when the computers simply generates, say, 3000 re-samples and summarizes the results, all in flash. We feel the need for plenty of “one at a time” simulation experiences that serve as transitions to the analysis of large-scale simulation results, and we don’t always find apps on the web that cater to our needs in just the way we would like.

Suppose for example you are wondering whether a certain die is loaded. You don’t want to crack it open, so you roll it sixty times, getting the following results:

Spots One Two Three Four Five Six
Freq 8 18 11 7 9 7

This looks like an awful lot of two-spots, but we were not expecting this in advance. By this point in the course students have been made aware of the perils of “data snooping” and hence should be disinclined to employ an inferential procedure that is based specifically on a pattern that one happens to notice in one’s data. Therefore, rather than perform inferential procedures keyed to “Two-spot” side of the die, we might turn instead to the chi-square statistic as a neutral measure of the difference between the observed results and what one would expect if the die were fair.

The situation is addressed in this Shiny app:

http://rstudio.georgetowncollege.edu:3838/SlowGoodness

After re-sampling for a few minutes, students are convinced that it’s not so unlikely, after all to get results like the ones we observed, if the die is fair all along.

Students are then prepared to understand a full-scale re-sampling simulation like the following one:

## Pearson's chi-squared test with simulated p-value
## 	 (based on 3000 resamples)
##
##   observed counts Expected by Null contribution to chisq statistic
## A               8               10                             0.4
## B              18               10                             6.4
## C              11               10                             0.1
## D               7               10                             0.9
## E               9               10                             0.1
## F               7               10                             0.9
##
##
## Chi-Square Statistic = 8.8
## Degrees of Freedom of the table = 5
## P-Value = 0.125

Sure enough, if the die is fair then there is a reasonably good chance—about 12.5%—of getting results at least as extreme as the ones we got in our 60 rolls.

Note: Shiny users know that the apps are liable to run more quickly if you run them locally. To run the foregoing app locally from an R session, pull it out of the tigerstats package:

Understanding Model Assumptions

Students tend to be somewhat rigid in their handling of “safety checks”—the diagnostics they are instructed to perform in order to judge whether the statistical model underlying a given inferential procedure is appropriate to the data at hand. This rigidity stems partly from a lack of understanding of what the inferential procedure is intended to deliver (for example, that a method for making 95%-confidence intervals for a parameter should produce intervals that cover the parameter about 95% of the time in repeated sampling), and partly from a lack of experience with situations in which the mathematical assumptions of the model are not perfectly satisfied.

The following Shiny app:

• coverage properties of confidence intervals (e.g, what “95% confidence” means, from a frequentist point of view);
• the effect on coverage properties, at various sample sizes, of departures from normality assumptions in procedures based upon the t-statistic.

Both “slow” (one-at-a-time) simulation and large-scale simulation (5000 samples) are available to the student.

Types of Error

Simulation is also helpful in coming to understand such notions as the level of significance of a hypothesis test (i.e., the probability a true Null Hypothesis in repeated sampling), and the notion of power as well. See the following app:

Illustrating Fine Points

Sometimes you want to have an app on hand, not because it addresses a major course objective, but simply in case students ask a particular question. For example, sometimes when the class is looking at a scatter plot—with regression line—of data that comes from a bivariate normal distribution, a student will remark that the regression line looks “too shallow”. This root of this question is a confusion, in the student’s mind, between two purposes that a line might serve:

• to provide a “linear summary”” of the scatter plot;
• to provide linear predictions, based on the scatter plot, of y-values from x-values.

The so-called “SD line”—the line that runs through the point of averages and whose slope is the ratio of the standard deviation of the y-value to the standard deviation of the x-value—is well-suited to the former task, whereas the regression line is, of course, the right choice for the latter one. When many students first look at a scatter plot, they see an SD line in their mind’s eye; when they get around to producing the regression line, it can look like a misfire.

The following app helps clear things up for students. It is based on a discussion on the “shallow regression line” issue in Statistics, the classic text by Freeman, Pisani and Purves.

Playing Games

Here’s is yet another of those “find the regression line” apps that you see all over the web:

You have the option to keep score. Your score is the sum of the number of times you have submitted a guess and the following “closeness measure”:

Shiny vs. manipulate

You don’t need to know much at all about web development in order to program in Shiny, but for R users there is the extra requirement of becoming comfortable with the reactive programming paradigm. The hurdle is not all that high: as an intermediate-level R-programmer, I was able to pick up Shiny over a weekend. The online Shiny tutorials and a few consultations with Stack Overflow provide almost everything I needed to know.

The pay-back for the extra learning is considerable. Shiny apps permit a much more flexible user-interface, as compared to manipulate. For example, it is easy to make input “dynamic”, in the sense that the requests that a user can make of the app can be easily made to depend upon previous choices that the user has made. It’s also easy to provide plenty of written explanation for the activity, as it proceeds: with manipulate apps this can be somewhat difficult.

On the other hand, since manipulate apps run directly within RStudio, they can easily be programmed to work with any data frame that the user specifies. Shiny apps will allow you to upload a CSV file, but for elementary students this process is usually too much of a burden.

Show Me Shiny has some wonderful instructional apps.

Considering all of the buzz surrounding Shiny, I am baffled at how difficult it is has been for me to find other up-to-date sites featuring Shiny apps for statistics instruction. Perhaps readers of this post could direct me to any that they know of. Eventually it would be nice to develop something like a ShinyTeachingTube, which could serve as a central hub for Shiny instructional applets.

Course Management With the RStudio Server

Introduction

At my institution we teach both elementary and upper-level undergraduate statistics using R, in the environment of the RStudio Linux server installed and configured on our campus network. Although students are made aware of the existence of the desktop version of RStudio and eventually are encouraged to install it on their personal machines, the default course environment is that of the server.

One reason for this choice is that the server allows us—instructors working in consultation with our sysadmin—to standardize the R environment (R version, installed packages, etc.) for all class members, so that if we add a feature or fix a problem we have some reasonable confidence that it will work for everyone.

Another reason—which constitutes the theme of this post—is that the server environment facilitates course management, especially in technical respects specific to a statistics course, where standard online content management systems such as Moodle or Blackboard may fall short. The aim of this post is to record, for colleagues at our institution and for folks at other institutions who are considering making the switch to R, the principal ways in which in we have tweaked the server for course-management purposes. R and RStudio are wonderful free software, but like all free software, they come with a certain “cost of ownership”, and those costs can be considerable if (like me) you begin with little in the way of programming/hacking skills. I hope that the following information will reduce the ownership costs for others who choose to teach with R in a similar vein.

Installation

I assume that you have persuaded your sysadmin to install and configure some version of the RStudio Linux server. My sysadmin chose to set up the Cent OS version, and configured it so that all members of the campus community can access it by means of their username and password.

If your personal machine runs Linux—either the Ubuntu or Cent OS distribution—it’s a good idea to brush up on (or acquire) some very basic command-line skills and to install the server on your own machine as well, so you can replicate some of the strategies described below. Just a little bit of knowledge of the innards—file permissions, etc.—pays off handsomely in being able to work with your sysadmin to diagnose and resolve quickly any problems that arise. I myself run Ubuntu, but have not found significant differences between how the server works for me and how it works on campus.

Establishing a “Common Source” Folder

Ask your sysadmin to grant superuser privileges t oyou and other course instructors. Then one of you should create a folder in your Home directory on the server that will serve as a common source for course material. The sysadmin can create a symbolic link to the folder and can set permissions so that all users may read files in the folder but only you and fellow instructors can write to it. This folder serves as the repository for assignments, solutions, syllabi, etc.

If you are not the owner of the folder, you can get to it using the ellipses button in the upper right-hand corner of the Files pane. Simply enter the path-name as specified by your sysadmin. For one of our courses it is simply: /mat111.

From there you can navigate the directory structure in the Files pane, in the usual way. To reset the Files pane view back to your Home directory, push the Ellipses button again and enter: ~.

All of the foregoing will make sense to you once your have studied Unix-like directory structures.

Automated Assignment Collection/Return

Once our elementary students have acquired some proficiency with R, we introduce them to R Markdown and require them to turn in certain homework and project assignments as R Markdown documents. We write comments into a copy of the assignment and return it to the student. One of the best arguments for teaching in the server environment is that this collection and return process can be automated. Here’s how we do it these days.

First of all, each instructor should create a text file consisting of the network usernames of student in his or her course (or section thereof), one username per line, and name it something like students.txt.

Save the file in your Home directory on the server.

You are going to create some sub-directories in Home directories of your students, so for this you will need to act as a superuser. This action will in turn require you to provide your password to the computer. For security reasons, you don’t want to send out the password every time you perform a superuser action, so you need to encrypt your password and provide a key in its place. For this purpose our sysadmin has written the following Perl script:

The above script, and others to follow, are house in /scripts. You will use it to create an encrypted version of your password that is stored in a new file in your Home directory. To run the script, issue the following (suitably modified) command in R:

After you run the script, clear your R History: you don’t want to leave your password hanging out in the open.

Create Subdirectories

Here is the Perl script that we currently use to create submit and returned directories in the Home directory of each student in the class. Obviously your sysadmin will modify it to suit the file structure of your server.

To run the script, issue the following R command, suitably modified:

system("perl createdirectories.pl --studentfile=<StudentFileName>")

There are options to receive an email report confirming the creation of the directories, and to set permissions for them as well. Currently we use the default settings.

Collect Assignments

Students save an assignment into their submit directory, named according to some convention that you establish. Specifics vary, but the name must end with an underscore followed by the student username. For example: HW05_jdoe.Rmd is the fifth homework assignment, submitted by the student with username jdoe.

The Perl script for collection of assignments is as follows:

To run the script issue a command like the following:

If you would like to receive an email with a list of all students from whom you got an assignment, run this instead:

You can run the collection script as often as you like: it will pick up newly-submitted assignments but will not overwrite assignments collected from other students in a previous run.

Return Assignments

All of the assignments you collect appear in a homework folder in your Home directory, in sub-directories by assignment name and sub-sub-directories by student username. Navigate to the assignments one by one. For each assignment, open the R Markdown file and save it with an additional tag in the file name that will mark it out as the graded/commented copy to be returned to the student. We use _com as our tag, creating files like this: HW05_jdoe_com.Rmd.

For returning assignments, we have the following Perl script:

To run the script, you need the key for your encrypted password. Run a command like the following:

Note that the sysadmin has established, for each instructor, a file in /usr/local/sbin of student usernames for the instructor’s course. As students drop your course and you edit your local student file accordingly, the two files may fall out of sync, but the return script will still work correctly for students still enrolled in the course.

All in all, the server environment has proven to be quite useful for our courses. Nevertheless, there are a few complications and potential problems to keep in mind.

• Students can read from the Common Source, directory, but cannot write to it. If a student wishes to perform an “knitting” type of action to a file in the Common Source directory—e.g., knitting an R Markdown to HTML or previewing an R Presentation document—then she must save a copy into her Home directory and perform the knitting operations upon it. The same often goes for other instructors (default file permissions are still a bit unclear to us).
• Shiny apps are wonderful. We put them into the ancillary package that we use for our own elementary course, so that R users can run them locally once the package is installed, or run them locally after downloading them from the package’s Git Hub repository. However, at many institutions the firewalls don’t permit execution of the Shiny scripts. If this is the case at your own institution and you want your students to work with Shiny apps, then you must either install and configure the Shiny server or deploy the apps yourself on a site hosted by RStudio, e.g., http://shinyapps.io/. We have experimented with both venues and are pleased with the results.
• A small percentage of users eventually experience mysterious problems—e.g., loss of ability to knit an R markdown document more than once in a single server session—that we have not been able to diagnose and resolve completely. If the problem becomes sufficiently severe, a student could always use the desktop version, but this in itself creates a course management problem. Larger institutions than ours may wish to consider paying for the Enterprise version of the RStudio server, and the support that accompanies it.

We are grateful for the work of Scott Switzer, who serves as Server System Manager in the Office of Information Technology Services at Georgetown College. Scott manages the RStudio server and the College’s Shiny server, created the Perl scripts in this post, helped establish other website support for our elementary statistics course, and at an early stage played the role of informal command-line guru to the the author of this post. If you have the good fortune to work with a such a sysadmin your own institution, make sure that she gets lots of love and special ice cream!

Qnorm() Tutorial

Comparison of qnorm() with pnorm()

Thew function prnorm() in regular R, as well as the function pnormGC() in the tigerstats` package, compute probabilities from known bounding values. For example, suppose that $X$ is a normally distributed random variable with mean 70 and standard deviation 3, and that you want to know:

Then you know the boundary value 72, but you don’t know the probability: the area under the normal density curve before 72. Functions like pnormGC() aim to give you that area–that probability:

require(tigerstats)

pnormGC(72, region="below", mean=70,
sd=3,graph=TRUE)

## [1] 0.7475

The function qnorm(), which comes standard with R, aims to do the opposite: given an area, find the boundary value that determines this area.

For example, suppose you want to find that 85th percentile of a normal distribution whose mean is 70 and whose standard deviation is 3. Then you ask for:

qnorm(0.85,mean=70,sd=3)

## [1] 73.11

The value 73.1093 is indeed the 85th percentile, in the sense that 85% of the values in a population that is normally distributed with mean 70 and standard deviation 3 will lie below 73.1093. In other words, if you were to pick a random member $X$ from such a population, then

$% $.

You can check that this is correct by plugging 73.1093 into pnormGC():

pnormGC(73.1093,region="below",mean=70,
sd=3,graph=TRUE)

## [1] 0.85

Sure enough, the area under the curve before 73.1093 is 0.85.

A Few More Examples

Making the Top Ten Percent (An Area Above)

Suppose that SAT scores are normally distributed, and that the mean SAT score is 1000, and the standard deviation of all SAT scores is 100. How high must you score so that only 10% of the population scores higher than you?

Here’s the solution. If 10% score higher than you, then 90% score lower. So just call qnorm() with 0.90 as the boundary value:

qnorm(0.90,mean=1000,sd=100)

## [1] 1128

In other words, the 90th percentile of SAT scores is around 1128.

Note: qnorm() deals by default with areas below the given boundary value. If we had asked for:

qnorm(0.10,mean=1000,sd=100)

then we would have got only the 10th percentile of the SAT scores, not the desired 90th percentile. If you would like to input 0.10 directly, then you can do so provided that you fiddle with the lower.tail argument:

qnorm(0.10,mean=1000,sd=100,
lower.tail=FALSE)

## [1] 1128

But really it seems easier just to do the math:

$1 - 0.10 = 0.90.$

An Area Between

Find a positive number $z$ so that the area under the standard normal curve between $-z$ and $z$ is 0.95.

Here’s the solution. If 95% of the area lies between $-z$ and $z$, then 5% of the area must lie outside of this range. since normal curves are symmetric, half of this amount–2.5%–must lie before $-z$. Then the area under the curve before $z$ must be:

Hence the number $z$ is actually the 97.5th percentile of the standard normal distribution, and we can find it as follows:

qnorm(0.975,mean=0,sd=1)

## [1] 1.96

So $z$ is about 1.96. We can check this result graphically as follows:

pnormGC(c(-1.96,1.96),region="between",mean=0,
sd=1,graph=TRUE)

## [1] 0.95

Roctopress: Configure Your Octopress Blog for R

Introduction

This post records how I configured Octopress for blogging about R, my favorite statistical programming environment. It is intended for colleagues and students who would like to begin blogging in a similar vein. I will assume that you have got Octopress up and running, and that you have chosen to have it hosted by GitHub Pages.

I claim little in the way of originality: most of this post consists of tips I picked up from knowledgeable people on the web.

Directory House-Keeping

Create the following directories in the Octopress directory:

• Rmd_sources (your posts will start here)
• Rmd_sources_old (to archive old posts source files)
• source/images/figure (your R graphs will be placed here)
• source/images/cache (if you cache the results of an expensive computation, the results go here)

From R Markdown to Markdown

We will adopt the approach of Jason Breyer.

In your Octopress directory, place the Jason’s R Script (perhaps call it rmarkdown.r):

#' This R script will process all R mardown files (those with in_ext file extention,
#' .rmd by default) in the current working directory. Files with a status of
#' 'processed' will be converted to markdown (with out_ext file extention, '.markdown'
#' by default). It will change the published parameter to 'true' and change the
#' status parameter to 'publish'.
#'
#' @param dir the directory to process R Markdown files.
#' @param images.dir the base directory where images will be generated.
#' @param images.url
#' @param out_ext the file extention to use for processed files.
#' @param in_ext the file extention of input files to process.
#' @param recursive should rmd files in subdirectories be processed.
#' @return nothing.
#' @author Jason Bryer <jason@bryer.org>
convertRMarkdown <- function(dir=getwd(), images.dir=dir,
images.url='/images/',
out_ext='.markdown', in_ext='.rmd', recursive=FALSE) {
require(knitr, quietly=TRUE, warn.conflicts=FALSE)
files <- list.files(path=dir, pattern=in_ext, ignore.case=TRUE, recursive=recursive)
for(f in files) {
message(paste("Processing ", f, sep=''))
frontMatter <- which(substr(content, 1, 3) == '---')
if(length(frontMatter) >= 2 & 1 %in% frontMatter) {
statusLine <- which(substr(content, 1, 7) == 'status:')
publishedLine <- which(substr(content, 1, 10) == 'published:')
if(statusLine > frontMatter[1] & statusLine < frontMatter[2]) {
status <- unlist(strsplit(content[statusLine], ':'))[2]
status <- sub('[[:space:]]+$', '', status) status <- sub('^[[:space:]]+', '', status) if(tolower(status) == 'process') { #This is a bit of a hack but if a line has zero length (i.e. a #black line), it will be removed in the resulting markdown file. #This will ensure that all line returns are retained. content[nchar(content) == 0] <- ' ' message(paste('Processing ', f, sep='')) content[statusLine] <- 'status: publish' content[publishedLine] <- 'published: true' outFile <- paste(substr(f, 1, (nchar(f)-(nchar(in_ext)))), out_ext, sep='') render_markdown(strict=TRUE) opts_knit$set(out.format='markdown')
opts_knit$set(fig.path="images/") opts_knit$set(base.dir=images.dir)
opts_knit$set(base.url=images.url) opts_knit$set(fig.width=3.5,fig.height=3,tidy=FALSE)
try(knit(text=content, output=outFile), silent=FALSE)
} else {
warning(paste("Not processing ", f, ", status is '", status,
"'. Set status to 'process' to convert.", sep=''))
}
} else {
}
} else {
warning("No front matter found. Will not process this file.")
}
}
invisible()
}

From R Markdown to HTML

Octopress ships with rdiscount as the converter from markdown to HTML, but to handle math you want another converter. I played around with pandoc for a while, but eventually decided on kramdown.

Octopress may be “locked” to a specific older version of kramdown, so install that version:

Now modify your Gemfile to include kramdown. For example, my Gemfile now looks like:

To get MathJax to work, my source is Zheng Dong,

Considerations of Style

Drawback: your blog will look exactly like my blog.

Notes:

• CSS for caption and figcaption are left over from earlier experiments with pandoc, but I’m retaining them in case they come in handy later.

Producing a Post

Create a new .Rmd document, and put it in Rmd_sources. At the top, make sure you have stuff like this:

It’s a good idea to set some chunk options (use include-FALSE):

All images from all posts end up in the same directory, so when a code chunk results in a graphic give that chunk a unique name (different from all other graph chunks that you will ever produce with this blog).

When you are ready to process run:

source("~/octopress/rmarkdown.r")

(This assumes your working directory is your home directory, and that octopress is directly under your home. Modify the path name if this is not the case.)

Then change working directory to Rmd_sources and run:

convertRMarkdown()
system("cp figure/* ../source/images/figure")

If you created any cache make sure you named the chunks uniquely. Also you need to run:

system("cp cache/* ../source/images/cache")

Finally, get the markdown file into the source/_posts directory:

system("cp 2014-04-01-HowToSurviveZombieArmageddonUsingR.markdown ../source/_posts")

Then the usual:

until it looks good, then

Then commit your changes and push to Git Hub repository:

Creating a Topics Feed

Say you produce something you think merits wider distribution, and want to pass it on to a great site like R-bloggers: then you need to create a category feed. Here’s some help I got from Matt Harrison.

Create the file the source/YourCategoryName.xml (make sure to modify occurrences of “YourCategoryName” in what follows to whatever you please):

Archive Old R Markdowns

Once you are done publishing a post, move the R Markdown source into the Rmd_sources_old directory, so the ConvertRMarkdown() function won’t keep processing them needlessly.

Lingering Problems

• kramdown differs from knitr in how it recognizes math. If you see odd results, consult the kramdown documentation on the web to learn how to really really do it right. If you want to keep on the R Markdown way, then from time to time you may have to escape certain special characters.
• R-bloggers takes your stuff from a blog feed. It does not recognize MathJax, so your math won’t render.