The first thing is to model rolling a standard six-sided die. We want to sample from the uniform distribution of integers between 1 and 6. Don't forget to sample with replacement. As usual, I use yellow background for my code, and blue background to show what the program prints to the screen.
As we expect, the mean or expected value is 1/6 * (1 + 2 + 3 + 4 + 5 + 6) = 21/6 = 3.5 and the variance is 1/6 * 2 * (0.52 + 1.52 + 2.52) = 1/3 * (0.25 + 2.25 + 6.25) = 1/3 * 8.75 = 2.92, and the (population) standard deviation is √2.92 = 1.71.
We will take a look at the distribution of the numbers using the hist function. There are a couple of details to consider when using hist. (I often have to remind myself about its arguments by doing "?hist"). One is the argument "breaks"
I often specify the number of cells explicitly, especially when I want to compare multiple plots. Since I had a little trouble with this plot, I tried a couple of different things, and I want to plot them all in the same window. For this, I use the formatting command "par" and tell it to make a set of plots in 1 row and 3 columns.
As you can see, the first two histograms look funny. What is going on is that R is trying to bin the numbers in the vector with breakpoints exactly on the integer values 1, 2, 3... Since the numbers are themselves drawn from 1, 2, 3... it has to go right or left, and at the boundaries of the plot it looks weird. One argument controlling how this works is:
The default setting is "TRUE", which has cells formed as "right-closed (left-open) intervals"---whatever that means. The result is that at the left boundary all the values "1" and "2" have been binned together. Neither setting for "right" is what we want. My solution was to shift all the values to the left by 0.5 and then plot the result. The "1"s are plotted in the cell between 0 and 1.
Now, let's see what happens if we roll the dice again. What we're going to do is start with a vector of the size we need called z, that is filled with zeroes. (It has a length of 50,000). We need to initialize it because we will be updating at each round. We roll the dice six times and plot the results at each stage.
As you can see, we already have a reasonable approximation to the normal after summing just two numbers drawn at random from the "sample space" of 1, 2... , 6. And by n=6 the approximation is very good. Since the standard deviation goes as √variance, and the variances add, by n=6 we have a range of 31 (from 6 to 36) but the sd is only √(6*2.91) = 4.18 (4.19 in the figure).
Here is the code:
I found the code for the function that plots the normal distribution with the same mean and sd as our samples somewhere on the web. It is doing something a bit funny. I think it is making an "anonymous" (i.e. unnamed) function to feed to plot, and then applying that function with bounds = 0 and max(w), but I'm not real clear about how this works. The parameter lwd is for line width.
Finally, let's look a little more closely at the z vector after 6 rounds. We plot it in the same histogram as the normal density of the same mean and sd, switching which one is in back as we move from the left panel to the right.
The code:
Notice, however, that the correspondence is not perfect, as shown by the results of calling the summary function.
Our vector z seems a little fatter at the first and third quartiles and yet its minimum and maximum values are closer to the mean than the true normal distribution.