原创 Grasping the normal distribution (Part 4)

 2013-5-21 13:46  4957 29 29 分类: 消费电子

[Continued from Grasping the normal distribution (Part 3)]

What if I want areas?
As we've discussed, the sum of the probabilities for all possible results must add up to unity. And indeed they do, in Figure 8. If, for a given "curve," you add up all the y-axis values at the marker points, you will absolutely get 1.0. 
But what if we actually prefer the "curves" to be shown such that they enclose equal areas? In that case, we have one more transformation to perform. 
Figure 5 gives us a hint. In that graph, the width of each bar—call it Δx—is exactly equal to 1. So the area of the bar is the same as its height. Mathematically, we can say: 
(17) 
In this special case, adding up all the occurrences is the same thing as computing the total area, which had better come out to be 7,776. 
The situation is the same in Figure 8. Now the graph is showing probabilities, but the width of the (not very apparent) bar is still unity, so the area of the bar is: 
(18) 
Adding them all up, we should get: 
(19) 
Why would I want to include this new parameter, Δx, if its value is unity anyhow? For two reasons. First, the variable x may have units, like meters, volts, or pomegranates. The parameter Δx might have the value 1, but it will still have the same units as x. Mixing parameters with and without units is not allowed. 
More importantly, I just got through scaling the curves to force them onto the same horizontal range. In doing so, I multiplied by the scale factor given in Equation 16. Now I see that this scale factor is in fact the very same thing as Δx. In the figure, you can see that the marker points are getting closer together as we add dice to the experiment. As in Equation 19, the total area is no longer unity, but Δx. 
To force the curves to have areas of unity, I have to divide the y-values by Δx again. Since these values are no longer probabilities, I'll just call them y. Figure 9 shows the results. 
Now you see it ...
Now here's a graph we can learn to love. Now that we have equal areas under each curve, we can see more clearly how they morph to look more like continuous curves. Not only do the (apparent) curves get smoother as we add dice, but the peak also gets higher, while the sides pinch in to maintain the equal area requirement. 
But hang on ... is that a fifth curve I spy? According to the legend, the dotted black line is something called "Normal." Unlike the other "curves," it's a truly continuous curve. 
That, my friends, is the normal distribution function. It's taken us awhile to get to it, but the evidence of Figure 9 is overwhelming. If, seeing Figure 9, you still aren't convinced that the sum of separate random processes trends to the bell curve of the normal distribution, there's no hope for you. 
Sum vs. integral Before we go forward, I want to call your attention to a very important aspect of Figure 9. As you know, the two-dice through five-dice "curves" are not really curves at all, but discrete functions, with y-values that only exist at the marker points. But the curve labelled "normal" is very much a continuous curve. 
It's not often that you get to see both discrete and continuous functions on the same graph. How did we do this? 
The answer becomes clear when you compare the area under the curves. When I scaled the y-axis values to force the areas for the discrete curves to be unity, I required: 
(20) 
For the continuous curve, I require: 
(21) 
See how the two formulas complement each other? For the discrete version, we're measuring the area of a bar whose height is P(n), and whose width is Δx. Similarly, for the continuous function p(x) we get the area by integrating it over all real numbers. So what is this new function p(x)? 
Well, it's a probability all right, but it's not just a probability that a measurement is exactly the same as the x-axis value. Since x can range over all numbers, the probability that the result is exactly equal to x is zero. 
Instead, p(x) is the probability that the measurement fall into an infinitesimally narrow range, between x and x+dx. 
The math of it all Now that you've seen the curve, I still must show you the math behind it. Here again, I'm given the opportunity to derive the math from first principles. But I'm going to duck it again. As I mentioned earlier, the classical derivation is pretty horrible. If you'd like to see it done the easy way, see the exquisite paper 
The Normal Distribution: A derivation from basic principles, Dan Teague, North Carolina School of Science and Mathematics
<a href="http://courses.ncssm.edu/math/Talks/PDFS/normal.pdf" target="blank">http://courses.ncssm.edu/math/Talks/PDFS/normal.pdf</a>
To learn all there is to know about the normal distribution (including its origin, inspired by a gambler), see the exhaustive study by Saul Stahl: 
"Evolution of the Normal Distribution," Saul Stahl, Mathematics Magazine, Vol. 79, No. 2, April 2006, pp. 96-113
<a href="http://mathdl.maa.org/images/upload_library/22/Allendoerfer/stahl96.pdf" target="blank">http://mathdl.maa.org/images/upload_library/22/Allendoerfer/stahl96.pdf</a>
As for my "derivation," I'm going to follow the example set by J. Willard Gibbs, the father of statistical mechanics, circa 1900. He said (and I paraphrase), "We use this form because it's the simplest one we can think of, that works." Now, that's my kind of physicist! 
Take another look at the shapes in Figure 9. There are a lot of things we can say about them, without knowing anything about the mathematical formula underlying them. Indeed, if we'd been clever enough, we could have said these things from the outset. These things are: 
- The most probable value of x (the peak of the distribution) should be zero
- The distribution should decrease monotonically as x moves away from zero
- The functions should be symmetric around zero
- It should tail off to zero at the extremes (which are ±∞)
As soon as you hear the words, "tail off to zero," you should be thinking of an exponential function. One function that does this is: 
(22)
But that one's no good, because it's not symmetric. In fact, it grows to infinity as x goes more and more negative. 
So what's the next simplest function we can think of? Why, it's the one that doesn't care if x is positive or negative: 
(23) 
This is the function Sir Willard used, and if it's good enough for him. it's good enough for me. Figure 10 shows the function in all its glory. 
That's definitely the shape we want. We still have to add some bric-a-brac to make it functional, but the shape is perfect. 
The area By now we should be very comfortable by the fact that any probability distribution curve must include an area equal to unity. Does this one? Let's find out. The area under the curve of Figure 10 is: 
(24)
Did you see that I had to integrate from -∞ to +∞, which is of course the full range of real numbers? The function in Figure 10 sure looks as though there's little or no area out past x=±4 , but since the function never quite gets to zero, we still have to include those tiny slivers of area out in the suburbs. 
Now, what's the value of the integral? We can find it in a number of ways. If you're feeling adventurous and like to do things from first principles (as I usually do), you can derive the integral yourself. It's fairly easy, but not at all obvious. See how here: 
<a href="http://www.youtube.com/watch?v=fWOGfzC3IeY" target="blank">http://www.youtube.com/watch?v=fWOGfzC3IeY</a>
If you still have your book called Tables of Integrals, you can simply look up the answer. Your book is probably not the same as mine—mine was Pierce, printed in 1939. 
Or, you can do as I did: Ask Mathcad, who says: 
(25) 
Noting very astutely that [INSERT IMAGE HERE] is not the same thing as 1, I see that I must modify Equation 23 to read: 
(26) 
In this form, the function has an integral of 1, so it's earned the right to be called a probability distribution function (hence the name change). Note that the height of the central peak of p(x) occurs when x = 0, where it's clearly: 
(27) 
On the home stretch As you'll recall, in building Figure 9 I had to shrink and stretch the N-dice "curves" to force them onto the same x-axis interval (±1) and keep their areas equal. We need to be able to do something similar for p(x). The new multiplying constant takes care of the area constraint, but we still need to be able to scale the x-axis width. I think it's safe to say that we won't always want the width of the central peak to be about ±2 or so. Even if we did, we still need a scale factor on x, because remember, x can—and often does—have units. I'm pretty sure that I don't know how to raise e to the power 1.618 pomegranates.2 
To take care of this, let's make the change of variables: 
(28) 
I'm sure you must be wondering where that factor of 2 came from. It seems like an unnecessary complexity, added for no good reason. Actually, there is a good reason—even a very good reason—but it won't be apparent until later. For now, just trust me, Ok? 
Note carefully that it's not enough to just substitute for x in Equation 26. If we try to just stretch or shrink the horizontal scale, the function will still have the same height, so the area will change. We really need to go back to Equation 24 and evaluate the integral again. Differentiating the last of Equation 28 gives: 
(29) 
Substituting for both x and dx in Equation 24 gives the new integral: 
(30) 
Since we're integrating over the range ±∞, the changes to the exponent don't matter. [INSERT IMAGE HERE] times infinity is still infinity. So the integral still evaluates to [INSERT IMAGE HERE], which makes the new area: 
(31) 
And our function now takes the form: 
(32) 
I mean...
There is one last little tweak to p(x). Sometimes, people need to translate the x-component so that the central peak no longer occurs at x = 0. This isn't so much a problem for us, because when you're dealing with noise, it's most likely value will always be zero. But for the sake of completeness, here is the normal distribution function in its most general form. 
(33) 
As you can see, we now have two parameters we can adjust to match the situation. The constantµ is an additive factor to shift the peak left and right, while σ allows for scaling (and possibly removing the units of) x. 
These two parameters have names, and those names—which come from the science of statistics—should be familiar to you.µ is the mean, and σ is the standard deviation. As my last trick for this column, I'll prove to you that these names fit the statistical definitions of these parameters. 
Because we had to scale x, we now have a factor of σ in the multiplicative constant. This means, or course, that the height of the central peak will change as we vary σ. 
The expectation value Let's look back for a moment, to the things we were doing with dice. For any number of dice, I showed you the histograms, which can be easily turned into probability distributions using Equation 10. Until now, we've only concerned ourselves with the probabilities of having a certain result, like 2, 12, or 7. But what if the thing we're interested in is not the result itself, but something that depends on it? To stick with the dice-game theme, what if you get, say, $10 every time you roll two dice and get a 4, but only $2 if you roll a 9 (which, you may recall, has the same probability: 1/9). In that case, it's not enough just to know the probability of getting a certain result from a dice roll; you also need to know what happens when you get that roll. In other words, you need the rules of the game. 
To take another example, suppose I buy a $1 lottery ticket, for a pot that's currently worth $300,000,000. What can I expect to get out of the deal? Well, one thing's for sure: It's not the 300 mill, because my likelihood of winning is very, very low. 
There's a mathematical term for this concept, and it's the same one the gamblers use. The only difference is that the gamblers were using it several thousand years earlier. The term is called expectation value. 
Mathematically, if P is the probability of winning, and v the payout value, then the expectation value of my lottery ticket is: 
(34) 
Here I've shown two popular notations for the expectation value. I tend to prefer the angle-bracket notation (..), because it's completely unambiguous. But the E(..) notation seems more popular lately. 
The same principle works for games like the dice game, only then we need to compute the average of all possible outcomes. If there are n possible outcomes from a given dice roll, then the expectation value becomes: 
(35) 
Now that we see the concept, it's easy enough to extend it to the case of continuous functions. If f(x) represents some function of x (the rules of the game, if you will), then its expectation value is: 
(36) 
This important integral embodies the central idea of how to deal with random processes. 
For everything we'll be doing from now on, we'll be using the normal distribution, so we might as well insert it into Equation 36 explicitly, to get: 
(37) 
Just to emphasize: This definition works for any function f(x)—at least, any "well-behaved" function, meaning that it doesn't have any internal infinities. Of course, there's no guarantee that we'll be able to get a closed-form solution; we might have to resort to a numerical method such as Simpson's rule.