[section] <font size="+3">Applied <br> <font size="+2">Probability and Stochastic Processes <br> <font size="-1">Lecture Notes for 577/578 Class</font>

Applied
Probability and Stochastic Processes
Lecture Notes for 577/578 Class

Wlodzimierz Bryc
Department of Mathematics
University of Cincinnati
P. O. Box 210025
Cincinnati, OH 45221-0025
E-mail: brycw@math.uc.edu

Created: October 22, 1995
Modified: 1996
Printed: Mar 13, 1998
© 1995 Wodzimierz Bryc
WWW version

Course description

This course is aimed at students in applied fields and assumes a prerequisite of calculus. The goal is a working knowledge of the concepts and uses of modern probability theory. A significant part of such a ``working knowledge" in modern applications of mathematics is computer-dependent.

The course contains mathematical problems and computer exercises. Students will be expected to write and execute programs pertinent to the material of the course. No programming experience is necessary or assumed. But the willingness to accept a computer as a tool is a requirement.

For novices, BASIC programming language, (QBASIC in DOS, Visual Basic in Windows, or BASIC on Macintosh) is recommended. BASIC is perhaps the easiest programming language to learn, and the first programming language is always the hardest to pick.

Programs in QBASIC 4.5 on IBM-compatible PC and, to a lesser extend, programs on Texas Instrument Programmable calculator TI-85, and WindowsTM programs in Microsoft Visual Basic 3.0 are supported. This means that I will attempt to answer technical questions and provide examples. Other programming languages (SAS, C, C++, Fortran, Cobol, Assembler, Mathematica, LISP, TEX (!), Excel, etc.) can be used, but I will not be able to help with the technical issues.



Contents of the course (subject to change)

Supporting materials

This text is available through the Internet in PDF, PostScript, or DVI formats. A number of other math related resources can be found on WWW. Also available are supporting BASIC program files.

Auxiliary textbooks:

Conventions

Exercises

The text has three types of ``practice questions", marked as Problems, Exercises, and Projects. Examples vary the most and could be solved in class by an instructor; Exercises are intended primarily for computer-assisted analysis; Problems are of more mathematical nature. Projects are longer, and frequently open-ended. Exercises, Projects, and Problems are numbered consecutively within chapters; for instance Exercise 10.2 follows Problem 10.1, meaning that there is no Exercise 10.1.

Programming

The text refers to BASIC instructions with the convention that BASIC key-words are capitalized, the names of variables, functions, and the SUBs are mixed case like ThisExample. Program listings are typeset in a special ``computer" font to distinguish them from the rest of the text.

Copyright and License for 1996 version

This textbook is copyrighted © W. Bryc 1996. All rights reserved. GTDR.

Academic users are granted the right to make copies for their personal use, and for distribution to their students in academic year 1995/1996.

Reproduction in any form by commercial publishers, including textbook publishers, requires written permission from the copyright holder.

WWW version© 1996 W. Bryc

Part 1
Probability


Chapter 1
Random phenomena

Definition of a Tree: A tree is a woody plant with erect perennial trunk of at least 3.5 inches (7.5 centimeters) in diameter at breast height (41/2 feet or 1.3 meters), a definitely formed crown of foliage, and a height of at least 13 feet (4 meters).
The Auborn Society Field Guide to North American Trees.

This chapter introduces fundamental concepts of probability theory; events, and their chances. For the readers who are familiar with elementary probability it may be refreshing to see the computer used for counting elementary events, and randomization used to solve a deterministic optimization problem.

The questions are

1.1  Mathematical models, and stochastic models

Every theory has its successes, and its limitations. These notes are about the successes of probability theory. But it doesn't hurt to explain in non-technical language some of its limitations up front. This way the reader can understand the basic premise before investing considerable time.

To begin with, we start with a truism. Real world is complicated, often to a larger degree than scientists will readily admit. Most real phenomena have multi-aspect form, and can be approached in multiple ways. Various questions can be asked. Even seemingly the same question can be answered on many incompatible levels. For instance, the generic question about dinosaur extinction has the following variants.

Theses questions are ordered from the most individual level to the most abstract. The reader should be aware that probability theory, and stochastic modelling deal only with the most abstract levels of the question. Thus, a stochastic model may perhaps shed some light whether dinosaurs had to go extinct, or whether mammals will go extinct, but it wouldn't go into details of which comet had to be responsible for dinosaurs, or which is the one that will be responsible for the extinction of mammals.

It isn't our contention that individual questions have no merit. They do, and perhaps they are as important as the general theories. For example, a detective investigating the cause of a mysterious death of a young woman, will have little interest in the ``abstract statistical fact" that all humans eventually die anyhow. But individual questions are as many as the trees in the forest, and we don't want to overlook the forest, either.

Probabilistic models deal with general laws, not individual histories. Their predictions are on the same level, too. To come back to our motivating example, even if a stochastic model did predict the extinction of dinosaurs (eventually), it would not say that it had to happen at the time when it actually happened, some 60 million years ago. And the more concrete a question is posed, say if we want to know when Dino the dinosaur died, the less can be extracted from the stochastic model.

On the other hand, many concepts of modern science are define in statistical, or probabilistic sense.(If you think this isn't true, ask yourself how many trees do make a forest.) Such concepts are best studied from the probabilistic perspective. An extreme view is to consider everything random, deterministic models being just approximations that work for small levels of noise.

1.2  Events and their chances

Suppose W is a set, called the probability, or the sample space. We interpret W as a mathematical model listing all relevant outcomes of an experiment.

Let \cal M be a s-field of its subsets, called the events. Events A,B Î \cal M model sentences about the outcomes of the experiment to which we want to assign probabilities. Under this interpretation, the union AÈB of events corresponds to the alternative, the intersection AÇB corresponds to the conjunction of sentences, and the complement A¢ corresponds to the negation of a sentence. For A,B Î \cal M, A\B: = AÇB¢ denotes the set-theoretical difference.

For an event A Î \cal M the probability Pr(A) is a number interpreted as the degree of certainty (in unique experiments), or asymptotic frequency of A (in repeated experiments). Probability Pr(A) is assigned to all events A Î \cal M, but it must satisfy certain requirements (axioms). A set function Pr(·) is a probability measure on (W,\cal M), if it fulfills the following conditions:

  1. 0 £ Pr(A) £ 1
  2. Pr(W) = 1
  3. For disjoint1 Pr(AÈB) = Pr(A)+Pr(B).
  4. If An are such that Çn ³ 1 An = Æ and A1 É A2 É ... É An É An+1 É ... are decreasing events, then
    Pr
    (An)® 0.
    (1)

Probability axioms do not determine the probabilities in a unique way. The axioms provide only minimal consistency requirements, which are satisfied by many different models.

1.2.1  Uniform Probability

For finite set W let
Pr
(A) = # A
#W
.
(2)
This captures the intuition that the probability of an event is proportional to the number of ways that the event might occur.


The uniform assignment of probability involves counting. For small sample spaces this can be accomplished by examining all cases. Moderate size sample spaces can be inspected by a computer program. Counting arbitrary large spaces is the domain of combinatorics. It involves combinations, permutations, generating functions, combinatorial identities, etc. Short recalls the most elementary counting techniques. Problem 1 Three identical dice are tossed. What is the probability of two of a kind?

The following BASIC program inspects all outcomes when five dice are rolled, and counts how many are ``four of a kind".

PROGRAM yahtzee.bas
'


'declare function and  variables
DECLARE FUNCTION CountEq! (a!, b!, c!, d!, e!)

'prepare screen
CLS
PRINT "Listing four of a kind outcomes in Yahtzee..."

'*** go through all cases
FOR a = 1 TO 6: FOR b = 1 TO 6: FOR c = 1 TO 6: FOR d = 1 TO 6: FOR e = 1 TO 6
   IF CountEq(a + 0, b + 0, c + 0, d + 0, e + 0) = 4 THEN
       PRINT a; b; c; d; e; : ct = ct + 1
       IF ct MOD 5 = 0 THEN PRINT :  ELSE PRINT "|";
   END IF
NEXT: NEXT: NEXT: NEXT: NEXT

'print result
PRINT
PRINT "Total of "; ct; " four of a kind."

FUNCTION CountEq (a, b, c, d, e)
'*** count how many of five numbers are the same
DIM x(5)
x(1) = a
x(2) = b
x(3) = c
x(4) = d
x(5) = e
max = 0
FOR j = 1 TO 5
    ck = 0
    FOR k = 1 TO 5
        IF x(j) = x(k) THEN ck = ck + 1
    NEXT k
    ck = -ck
    IF ck > max THEN max = ck
NEXT j
'assign value to function
CountEq = max
'
END FUNCTION

Here is a portion of its output:

The program is written explicitly for tossing five dice, and you may want to modify it to answer similar question for any number of dice, and an arbitrary k-of-a-kind question.

[ 1 Run and time YAHTZEE.BAS. Then estimate how long a similar problem would run if the question involved tossing 15 fair dice. The answer depends on your computer, and the software. Both Pascal and C-programs seem to run on my computer about 15 times faster than the (compiled) QuickBasic. ANS: Running time would take years!

Exercise 1.2.1 shows the power of old-fashioned pencil-and-paper calculation. Problem 2 Continuing Problem 1.2.1, suppose now n identical dice are tossed. What is the probability of n-1 of a kind? ANS: [5 n/( 6n-1)].

1.2.2  Geometric Probability

For bounded subsets W Ì IRd, put

Pr
(A) = |A|
|W|
.
(3)
This captures the intuition that the probability of hitting a target is proportional to the the size of the target.

Geometric probability usually involves multivariate integrals. Example 1 A point is selected from the 32 cm wide circular dartboard. The probability that it lies within the 8cm wide bullseye is [1/ 16].

Example 2 A needle of length 2l < 2 is thrown onto a paper ruled every 2 inches. The probability that the needle crosses a line is 2l/p.

Analysis of Example 1.2.2 is available on WWW.

[ 2 Test by experiment (or simulation on the computer) if the above two examples give correct answers. (Before writing a program, you may want to read Section first.)

Example 3 Two drivers arrive at an intersection between 8:00 and 8:01 every day. If they arrive within 15 seconds of each other, both cars have to stop at the stop-sign. How often do the drivers pass through the intersection without stopping?

[ 1 Continuing Example 1.2.2: What if there are three cars in this neighborhood? Four? How does the probability change with the number of cars? At what number of users a stop light should be installed?

1.3  Elementary probability models

1.3.1  Consequences of axioms

Here are some useful formulas that are easy to check with the help of the Venn diagrams. For all events A, B
Pr
(AÈB) = Pr
(A)+Pr
(B)-Pr
(AÇB)
(4)
Pr
(A¢) = 1-Pr
(A).
(5)
If A Ì B then
Pr
(B\A) = Pr
(B)-Pr
(A).
(6)
If An are pairwise disjoint events, then
Pr
( ¥
È
n = 1 
An) = ¥
å
n = 1 
Pr
(An).
(7)

1.3.2  General discrete sample space

For a finite or countable set W Ì IN and a given summable sequence of non-negative numbers an put

Pr
(A) =

å
n Î A 
an


å
n Î W 
an
.
(8)
In probability theory it is customary to denote

pk = [(ak)/( ån Î W an)] and rewrite (8) as Pr(A) = ån Î A pn.

Formula (8) generalizes the uniform assignment (2), which corresponds to the choice of equal weights ak = 1. At the same time it is more flexible2 and applies also to infinite sample spaces. Example 4 Let W = IN and put pk = [1/( 2k)]. Then the probability that an odd integer is selected is Pr(Odd) = åj = 0¥ 2-2j-1 = 2/3. ()

Table list the most frequently encountered discrete probability assignments.

Name W Probabilities pk
Binomial {0,...,n} pk = (nk)pk(1-p)n-k
Poisson ZZ\scriptscriptstyle + pk = e-l[(lk)/ k!]
Geometric IN pk = p(1-p)k-1
Equally likely outcomes {x1,...,xk} pk = 1/k

Table 1.1: Probability assignments for discrete sample spaces.

Problem 3 For each of the choices of numbers pk in Table 1.1 verify that (8) indeed defines a probability measure.

The reasons behind the particular choices of the expressions for pk in Table 1.1 involve modeling.

1.3.3  General continuous sample space

There is no easily accessible general theory for infinite non-countable sample spaces. When W Ì IRk, the generalization of the geometric probability uses a non-negative density function f(x1,x2,...,xk)

Pr
(A) = C ó
õ


A 
f(x1,x2,...,xk) dx1 dx2... dxk
(9)
For one-dimensional case k = 1, examples of densities are collected in Table on page .

1.4  Simulating discrete events

The most natural way to verify whether a mathematical model reflects reality is to compare theoretically computed probabilities with the corresponding empirical frequencies. Unfortunately, each part of this procedure is often time consuming, expensive, inconvenient, or impossible. Computer simulations are used as a substitute for both: they provide numerical estimates of theoretical probabilities, and they are closer to the direct experiment.

The first step in a simulation is to decide how to generate ``randomness" within a deterministic algorithm in a computer program. Programming languages, and even calculators, provide a method for generating consecutive numbers from the interval (0,1) ``at random". For instance, BASIC instruction3 PRINT RND(1) returns different number at every use. We shall assume that the reader has access to a of generating uniform numbers from the unit interval (0,1). These are called pseudo-random numbers, since the program usually begin the same ``random" sequence at every run, unless special precautions4 are taken.

Once a pseudo-random number from the interval (0,1) is selected, an event that occurs with some known probability p can be simulated by verifying if { RAND(1) < p} occurs in the program. For instance, the number of heads in a toss of a 1,000 fair coins is simulated by the following BASIC program.

PROGRAM tosscoin.bas
'


'Simulating 1000 tosses of a fair coins

H = 0
FOR n = 1 TO 1000  'main loop
        IF RND(1) < .5 THEN H = H + 1
NEXT n
'print final message
PRINT "Got "; H; " heads this time"
END
'

Here is its output:Got 520 heads this time


A simple method for simulating an outcome on a six-face die is to take the integer part of a re-scaled uniformly selected number INT(1+6*RND(1)). Can you use this to write a simulation of a roll of 5 dice? Such simulations are often used to evaluate probabilities empirically as a substitute for a real empirical study.

[ 3 Write the simulation (as opposed to deterministic inspection of all sample points on page pageref) to estimate how often the event ``four of a kind" occurs in a roll of five dice.

[ 2 Run the (modified) coin tossing program in a loop and to answer the following questions.

Hint: Move the main part of TOSSCOIN.BAS into a SUB, or a FUNCTION. This way you can easily use it without cluttering your program with irrelevant details. (See a generic template below.)

More complicated objects are often of interest in simulations. For instance we may want to draw two cards from the deck of 52. One possible way to do it is to number the cards, and select two numbers a,b.

  1. Select the first card a=INT(RND(1)*52+1) as a random integer between 1 and 52.
  2. Select the second card b=INT(RND(1)*52+1) in the same way.
  3. Compare a,b

    1. If a = b then repeat step 2
    2. Otherwise, got two different cards a ¹ b at random

How efficient is this procedure? [ 4 How would you simulate on a computer a random permutation? A random subset?

1.4.1  Generic simulation template

The purpose of simulation is to investigate the unknown values of parameters of interest. In the initial exercises you may want to simulate the events that you know how to compute probabilities of. The purpose of such exercises is to develop intuition about reliability of simulations.

In more advanced exercises you may want to estimate probabilities that aren't known. In such cases it is always a good idea to run simulations of various lengths and compare the results. In this section we briefly discuss how such a simulation can be organized in a way that promotes multiple uses of the same program.

The key is organizing the programs carefully into manageable blocks of small size. Modern BASIc is a structural programming language. The generic program to study the effects of the length of simulation on its output can be written as follows

' PROGRAM Generic.bas
'Generic Simulator
'Size is simulation size varied from Min=100 to Max=10000
For Size 100 to 10000 Step 100
Simulate(Size, Result)
Print "Simulation size="; Size ; Öutput="; Result
Next Size
End

The actual simulation is performed by

SUB Simulate (SizeRequested, Result)
'Runs requested number of simulations and returns average
'Trial numbers consecutive simulations from 1 to SizeRequested
For Trial=1 to SizeRequested
SimulateOne(Score)
Result=Result+Score
Next Size
'Most simulations return averages of single trials
Result=Score/Size
End SUB

The actual modeling is performed in another SUB, which in the generic program we named SimulateOne. This SUB may be as simple as simulating a toss of a single coin

SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
OutCome=0
if RND(1)<1/2 THEN Outcome=1
END SUB

Or it can be as complicated as we wish. The example below simulates a toss of five dice, and uses previously introduced function CountEq(a,b,c,d,e). four-of-a-kind.

SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
d1=int(RND(1)*6+1)
d2=int(RND(1)*6+1)
d3=int(RND(1)*6+1)
d4=int(RND(1)*6+1)
d5=int(RND(1)*6+1)
IF CountEq(d1,d2,d3,d4,d5)=4 THEN Outcome=1
END SUB

1.4.2  Blind search

Elementary probability when coupled with a fast computer is one of the simplest effective optimization method. The method is the blind search - a search for the best answer at random6. Pure blind search is usually simple to run, and therefore fast to realize. It often finds answers that are good enough for practical purposes, or at least can serve as the preliminary estimates. Various ad hoc modifications increase accuracy and are usually easy to implement, too.

[ 3 Write a blind-search program to find the maximum of a function.

1.4.3  Application: Traveling salesman problem

The following program searches for the shortest way to pass through the first n cities7 in the USA in alphabetic ordering.

Planning such a tour is easy by hand for 3-4 cities. For longer tours some help is needed. To check the performance of the blind search, you may want to know what the usual algorithms involve. A greedy method starts with the shortest distance, and then keeps going to the closest city not yet visited. Another heuristic method is to to select a pair of closest cities and accept the best (shortest) connection among those that do not complete a shorter loop, and do not introduce a spurious third connection to a city. Eventually, the disjoint pieces will form a path that often can be further improved upon inspection.

The program is longer but not at all sophisticated - it just selects paths at random. Notice that a solution to Exercise 1.4 - how to generate a random permutation - is given in one of the subprograms (which one?). The method for the latter is simple-minded - the algorithm attempts to place consecutive numbers at random spots until an empty spot is selected.

The following is the main part of the program. You can use it as a template in designing your own version of Blind search programs. The full code with SUBs is online in RANDTOUR.BAS.

'****
CLS
'**** get number of cities (no choice which) from user
LOCATE 2, 1
INPUT ; "Shortest distance between how many cities?", n

'*** initialize program
CLS
nMax = 19 ' current data size. Make sure not exceeded!
IF n > nMax THEN n = nMax

'declare arrays
DIM SHARED dist(nMax, nMax) ' matrix of distances
DIM SHARED city(nMax) AS STRING
DIM P(n), BestP(n)

'read distances
CALL AssignDistances(nMax)

'initial permutation
FOR j = 1 TO n
        P(j) = j
        BestP(j) = j
NEXT j
'initial length of trip
MinLen = PathLen(P())

'*** main loop
DO  'infinite loop till user stops
'count trials
no = no + 1
        '*** interacting with user
        'check if user pressed key to stop
        k$ = INKEY$
        IF k$ > "" THEN EXIT DO  'exit infinite loop
        'display currrent progress
        display (no)
        '*** get any path
        CALL GetPermutation(P())
        x = PathLen(P())
        IF x < MinLen THEN
                'Better path found, so memorize and display
                Dlen = MinLen - x
                MinLen = x
                '*Memorize best order and print 
                FOR j = 1 TO n
                        BestP(j) = P(j)
                        PRINT city(P(j)); "->";
                NEXT j
                'Finish printing
                PRINT city(P(1))
                PRINT "Best so far: "; MinLen
                PRINT "Progress rate "; Dlen / (no - Slen); " miles per trial"
                Slen = no
        END IF
LOOP
'Print final message
CLS
PRINT "ALPHABETIC TOUR OF FIRST "; n; " CITIES in the USA"
PRINT "Blind Search Recommended Route found in "; Slen; "-th search"
FOR j = 1 TO n - 1
        PRINT city(BestP(j)); "-->";
NEXT j
PRINT city(BestP(n)); "-->"; city(BestP(1))
PRINT "Total distance: "; MinLen
LOCATE 22, 40
PRINT "(Distances subject to change)"
END
The program runs in the infinite loop until it is stopped by the user. Once stopped, the program prints the best route it found. For larger sets of cities we may have hard time deciding when to stop it. Here is a sample output (from the improved version, as marked in the actual code):


When you run this program on larger sets of cities, you will notice that the program is not fast. One possible improvement in the design of this program is to modify the randomization to be less likely to pick long paths. For instance, you can attempt to modify paths that are known to be short, or weight the modifications by lengths of resulting paths. Such methods are actually in use in image restoration problems (simulated annealing), see page .

1.4.4  Improving blind search

A bit of experimenting with various ``pure" blind search programs should convince you that

It is possible to improve on the last aspect without complicating the program much. The idea is to make random modifications of the currently best found value. For example, in a one-dimensional maximization of a function f(x), we would do the following steps

  1. Pick an initial ``best-so-far" point x0 and compute initial value y0 = f(x0)
  2. Select at random x1 in the ``neighborhood" of x0 and compute y1 = f(x1)
  3. Compare y0, y1.

    1. If y1 < y2 then repeat Step 2.
    2. If y1 ³ y0 then make x1 the new ``best-so-far" y0: = y1,x0: = x1. Then repeat Step 2.

  4. Stop the program at user request, or when no changes to y0 occur for prolonged number of attempts to improve it.

We want to allow for the chance of checking points far away from the ``best-so-far" answer. But we don't want this to happen too often, because x0 might be rather close to the optimum. The tradeoffs are that the program will tend to follow ``direct path" to the maximum, but the danger is that it will get stuck longer in local maxima.

Improved blind search with time/state dependent randomization is actually implemented within RANDTOUR. It is commented out in the version on the disk, so that it isn't operational. To make it active, uncomment the call to ImproveBest as a replacement for GetPermutation.

1.4.5  Random permutations

Program RANDTOUR.BAS selects permutations at random only in its ``simplest" variant. Here are a few examples of problems that require selecting random permutations.

The ``naive" selection of random permutation wastes many random numbers. Here is an algorithm that conserves resources better. The basic idea is to select a number from the beginning of the list of all numbers, and move it down to the end. The randomly re-arranged numbers are P(1),P(2),...,P(n)

SUB GetPermutationFast(P())
'Put random integers into P()
n=UBOUND(P) 'how many entries
'this loop can be omitted is we are sure that P(j) list all the numbers we want
FOR j=1 TO n
P(j)=j
next j
for j=1 to UBOUND(P) 'not n as n changes in the loop
k=INT(RND(1)*n+1)


SWAP P(n), P(k)
n=n-1
next j

[ 5 There are many incompatible measures of ``quality" of an algorithm. One of the ``objective" criteria is the number of ``If" verifications. Another ``objective criterion" might be the number of calls to a function. A ``subjective" criterion, which depends on the hardware and circumstances, is timing.

Does SUB GetPermutationFast deserve adverb Fast in its name?

[ 4 The Subset-Sum Problem is stated as follows:

Let S be a set of positive integers and let t be a positive integer. Decide of there is a subset S¢ Ì S such that the sum of integers in S¢ is exactly t.

The associated optimization problem is to find a subset S¢ Ì S whose elements' sum is the largest but doesn't exceed t. This optimization problem is NP-complete, ie it isn't known if there is a polynomial time algorithm (polynomial in the size of S) to find S¢.

Investigate how the blind search will do on sets S selected at random, and on sets S constructed in more regular fashion like arithmetic progression S = {a, 2a, 3a, ...}, geometric progression S = a,a2, a3,....

1.5  Conditional probability

In modeling more complicated phenomena we may want to use different probabilities under different circumstances. For instance, in a modified blind search for the minimum of a non-negative function, the randomization strategy might be different when we already made some progress, and it might be different when we are ``stuck" in a non-optimal location. Thus we may want to consider probabilities of the same event A (say, hitting a maximum) under different conditions B.

To formalize this idea, suppose B is an event such that Pr(B) ¹ 0. The last condition merely says that B is an event that does have some chance of occurring. Conditional probability of event A given event B is denoted by Pr(A|B). It is defined as

Conditional probability satisfies the axioms of probability, and Pr(A|B) = 0 if A,B are disjoint. In particular, Pr(A¢|B) = 1-Pr(A|B), Pr(B|B) = 1.

The easiest way to find Pr(A|B) by simulations is to repeatedly simulate the complete experiment, discarding all the outcomes except the ones resulting in B.

[ 6 Suppose we toss repeatedly a fair coin, and the ``success" is to get heads.

Use computer simulations to find the conditional probability that the very first trial was successful, if 10 consecutive (and independent) trials resulted in 8 successes.

Try to answer the same question under the condition that 50 independent trials resulted in 40 successes.

You should notice that it takes forever to simulate events that happen rarely. Section indicates one possible way out of this difficulty.

1.5.1  Properties of conditional probability

Conditional probability is used in modeling. Often Pr(A|B) can be assigned by ``intuitive" considerations. It can then be used to compute other probabilities. The simplest example is Pr(AÇB) = Pr(B)Pr(A|B), which is a direct consequence of the definition.

Example 5 Suppose we have a deck of 52 cards numbered 1 through 52. Since it isn't obvious how to simulate selecting 5 cards without replacement, we may want to select them with replacement instead. Let A denote the event that all five cards are different. What is the probability of A?

We may perform the experiment sequentially, drawing one card at a time. Let Ak denote the event that the k consecutive draws resulted in different cards. Then A = A5 Ì A4 Ì ... Ì A1. Moreover, Pr(A1) = 1.

Clearly, Pr(A2|A1) = [51/ 52], so Pr(A2) = Pr(A2ÇA1) = Pr(A2|A1)Pr(A1) = [51/ 52]. Similarly, Pr(A3) = Pr(A3ÇA2) = Pr(A3|A2)Pr(A2) = [50/ 52][51/ 52]. Continuing this we get Pr(A5) = [51 50 49 48/( 524)] » 0.82.

The following identities are also of interest.

  1. Path Probability: Pr(Çk = 1n Ak) = Õk = 1nPr(Ak|Çj = 1k-1 Aj)
  2. Bayes theorem: Pr(A|B) = [(Pr(B|A)Pr(A))/Pr(B)]
  3. Total probability formula: If {Bn} are pairwise disjoint and exhaustive, ie. Pr(AiÇBj) = Æ for i ¹ j and ÈBn = W, then
    Pr
    (A) =
    å
    n 
    P(A|Bn)Pr
    (Bn).
    (10)

[ 7 What is the probability that in a class of 30 students no matching birthdays occur?

Example 6 A lake has 200 white fish and a 100 black fish, and a nearby pond contains 20 black fish and 10 white ones. No other fish live there.

A fish is selected at random from the lake and moved to the pond. Then a fish is selected from the pond and moved back to the lake. What is the probability that all fish in the pond are black?

1.5.2  Sequential experiments

Often the main experiment consist of a sequence of sub-experiments, each depending on the outcome of the previous one. If n such sub-experiments are chained, then the full experiment results in a chain of events, or a path \cal P = A1ÇA2Ç...ÇAn. If we assume that k-th experiment depends on the outcome of the k-1-th experiment only, then Pr(Ak|Ak-1Ç... ÇA1) = Pr(Ak|Ak-1).

Denoting by \cal P the generic path A1ÇA2Ç...ÇAn, and by \cal P(k) = Ak we have the following path integral formula for the probability of an event \cal F specifying the outcome of the complete experiment, and consisting of many paths \cal P.

Pr
(\cal F) =
å
\cal P Î \cal F 
Pr
(\cal P) =
å
\cal P Î F 
|\cal P |
Õ
k = 0 
Pr
(\cal P(k+1)|\cal P(k)).
(11)

Example 7 Suppose that the double transfer operation from Example 1.5.1 was repeated twice. That is, a random selection was done four times. What is the probability that all fish in the pond are black?

[ 8 Check by simulation how the proportion of black fish in the pond changes when the random transfers from Example 1.5.1 are performed repeatedly for a long time.

1.6  Independent events

This section introduces the main modeling concept behind the entries in the Table 1.1.

Two events A,B are independent, if the conditional probability is the same as unconditional, Pr(A|B) = Pr(A). This is stated in multiplicative form which exhibits symmetry and includes trivial events8 Definition 1 Events A, B are independent if Pr(AÇB) = Pr(A)Pr(B).

Independence captures the intuition of non-interaction, and lack of information. In modeling it is often assumed rather than verified. For instance, we shall assume that the events generated by consecutive outputs of the random generator are independent. We also assume that tosses of a coin (fair, or not!) are independent.

Beginners sometimes confuse disjoint versus independent events. Exclusive (ie. disjoint) events capture the intuition of non-compatible outcomes. Not compatible outcomes cannot happen at the same time. This is not the same as independent outcomes. If A, B are disjoint and you know that A occurred, then you do know a lot about B. Namely you know that B cannot occur. Thus there is an interaction between A and B. Knowing whether A occurred influences chances of B, which is not possible under independence.

Independence (or, more properly, mutual stochastic independence) of families of events is defined by requesting a much larger number of multiplicative conditions. The reason behind is Theorem , which provides a very convenient tool. Definition 2 Events A1, A2, ..., An are independent, if Pr(Çj Î J Aj) = Õj Î J Pr(Aj) for all finite subsets J Ì IN.

Example 8 A coin is tossed repeatedly. Find the probability that heads appears for the first time on the fourth toss.

Problem 4 SUB GetPermutation from the program RANDTOUR.BAS selects numbers between 1 and n at random until it finds a number not yet on the list. Then it ads the number to its list, and repeats the process.

  1. What is the probability that the second number added to the list required more than k attempts?
  2. What is the probability that the last number added to the list required more than k attempts?

Another important concept is the conditional independence. For example, many events in the past and in the future are dependent. But in many mathematical models, past and future are independent conditionally on the present situation. In such a model future depends on past only through present events!

Definition 3 Let C be a non-trivial event. Events A,B are C-conditionally independent if Pr(AÇB|C) = Pr(A|C)Pr(B|C).

1.6.1  Random variables

The general concept of probability space uses ``abstract" sets to represent outcomes of an experiment. But many examples considered so far, represented the outcomes in numerical terms.

Random variables are introduced for convenient description of experiments with numerical outcomes. (The other option is to select W Ì IR, or W Ì IRd.) If we want to run computer simulations, we need to represent even non-numerical experiments (like tossing coins) in numerical terms anyhow. Thus the language of random variables becomes the natural extension of elementary probability theory, expressing many of the same concepts in a little different language.

A random variable is the numerical quantity assigned to every outcome of the experiment. In mathematical terms, random variable is a function X:W® IR with the property that sets {w Î W: X(w) £ a} are events in \cal M for all a Î IR. Recall that the last conditions means that we may talk about probabilities of events {w Î W: X(w) £ a}.

Probabilities for a one-dimensional r. v. are determined by the cumulative distribution function

F(x) = Pr
(X £ x)
(12)
The corresponding tail function R(x) = 1-F(x) = Pr(X > x) is sometimes called the reliability9 function.

Cumulative distribution function can be used to express probabilities of intervals Pr(a < X £ b) = F(b)-F(a). Since probability is continuous, (1) we can also compute Pr(X = a) = limb® a+Pr(a < X £ b) = F(a+)-F(a). The right hand side limit F(a+) exists, as F is a non-decreasing function.

Example 9 Suppose F(x) = (1-e-x)Ú0. Then Pr(|X-2| < 1) = e-1-e-3.

In probability theory we are concerned with probabilities. Random variables that have the same probabilities are therefore considered equivalent. We write X @ Y to denote the equality of distributions, ie. Pr(X Î U) = Pr(Y Î U) for all Borel sets U Ì IR (say, all intervals U).

Vector valued r. v. are measurable the functions W® IRd. In the vector case we also refer to X = (X1,... Xd) as the d-variate, or multivariate, random variable.

We will use the ordinary notation for sums and inequalities between random variables. There is however a word of caution. In probability theory, equalities and inequalities between random variables are interpreted almost surely. For instance X £ Y+1 means Pr(X £ Y+1) = 1; the latter is a shortcut that we use for the expression Pr({w Î W: X(w) £ Y(w)+1}) = 1. Problem 5 Show that F(x) = Pr(X £ x) is right continuous: limx®a+F(x) = F(a).

1.6.2  Binomial trials

The statistical analysis of repeated experiments is based on the following. Theorem 1 Suppose that for j Î IN event Bj is either Sj or Sj¢, where events {Sj} are independent. Then {Bj} are independent.

A binomial experiment, called also binomial trials, consists of the sequence of simpler identical experiments that have two possible outcomes each. The independent events Sj represent successes in consecutive experiments. We assume that we have an infinite sequence of events S1,S2,... Sk,... that are independent and have the same probability p = Pr(Sj). We denote by Fj = Sj¢ the failure in the j-th experiment, and put q = 1-p.

Two important random variables are associated with the binomial experiment are the number X of successes in n trials, and the number T of trials until first success. Example 10 The probability that number X of successes in n trials is k is Pr(X = k) = (kn)pkqn-k. (Here k = 0,..., n.)

Example 11 The probability of more than n attempts needed for the first success is Pr(T > n) = qn. The probability that first success occurs at the n-th trial is Pr(T = n) = pqn-1 (geometric).

Example 12 Geometric distribution has lack of memory property: Pr(T > n+k|T > n) = Pr(T > k).

Random variables are often described solely in terms of cumulative distribution function F(x), or formulas for Pr(X = x) without reference to the underlying probability space W. For instance, the number of minutes T that we spend waiting for a bird to come to the bird feeder at the back of my house is random, and I believe Pr(T = n) = pqn-1 because Pr(T > n+k|T > n) = Pr(T > k).


It is intuitively obvious that on average we get np successes in n trials. It is perhaps less obvious10 that on average we need 1/p trials to get the first success.

[ 9 Write a simulation program to verify the claims about the averages for several values of p.

Example 13 The probability that in 2n tosses of a fair coin, half are heads is [(2n)!/( 4n(n!)2)] » [1/( Ö{ pn})]® 0 as n®¥. The latter isn't easy to prove, but the computer printout is quite convincing, see Table (note that [1/( Ö{p})] » 0.5642).

2nPr(X = n)Frequency in 1000 trials ÖnPr(X = n)
100 0.07959 0.08200 0.56278
300 0.04603 0.06100 0.56372
500 0.03566 0.03700 0.56391
700 0.03015 0.02200 0.56399

Table 1.2: Probabilities Pr(X = n) in 2n Binomial trials.

1.7  Further notes about simulations

By now you should have written some simple simulation programs, and printed out the results. It is perhaps a good moment to pause and consider what are the aspects of simulations that we are interested in.

In general, we would like to get answers to questions that we don't know how to answer in any other way. But before we do that, we should develop some intuition on the cases that can check the answers. Therefore we begin with simulation of probabilities or averages that are known.

1.8  Questions

Problem 6 [Exercise] A family has two children, and one of them is a boy. What is the probability that they have two boys? (If you think this is too hard mathematics, do it as a computer assignment!)ANS: 1/3

Problem 7 [Exercise] A die is thrown until an ace turns up.

Assuming the ace doesn't turn up on the first throw, what is the probability that more than three throws (ie. at least four) will be necessary? ANS: 197/198

Suppose that an ace turns up on an even throw. What is the probability that it turned up on the second throw?

(If you think this is too hard mathematics, do it as a computer assignment!)

[ 5 A deck of 52 cards has 4 suits and 13 values per suit.

  1. Write a program simulating the hand of 5 cards.
  2. Use your program to answer the following questions:

    1. How often does a pair occur?
    2. How often does a two-pair occur?
    3. How often does a three of a kind (three of same value and two different) occur?
    4. How often does a four of a kind occur?
    5. How often does a full house (2+3) occur?
    6. How often does a straight (five cards in a row, not all same suit) occur?
    7. How often does a flush (five cards of one suit, not in order) occur?
    8. How often does a straight flush (five cards in a row all same suit) occur?

    If you think this is too difficult on a computer, compute the probabilities by hand.

[ 10 A math teacher in a certain school likes to give multiple choice tests, grade them as either right, or wrong, and then lets the students to go over the test and correct the ones they got wrong. This gives them two chances to get a problem right, and the chance of getting a question right increases even if the student just guesses the answers. Suppose a student simply guessed the first time, got the corrections, and guessed differently the second time on the wrong answers. How much his grade improves?

[ 6 This is the expanded version of Exercise 1.5.1. In a group of n people, how often at least two have the same birthday?

  1. Find the formula for the probability p(n) assuming 365 days per year, and equally likely birthdays.
  2. Compute the probabilities for n = 20, 30, 40, 50
  3. Write a simulation program, and verify if the simulation agrees with the theoretical answers.
  4. Modify the simulation program to allow for not equal birthdays. Assume January, February, March are less likely then the other days of the year. Change the parameters, and verify how the probabilities p(n) change as you depart from the uniform probabilities. If the change of p(n) is of the magnitude comparable to the simulation accuracy, clearly it is irrelevant.
  5. A randomly selected person has chance 1/4 to be born on a leap year. How does this affect the answers?

Chapter 2
Random variables (continued)

Tree species are not distributed at random but are associated with special habitats.
The Auborn Society Field Guide to North American Trees.

Intuitively, random variables are numerical quantities measured in an experiment. The concept11 is the core of probability theory; it leads outside of elementary probability and it touches advanced concepts of integration, function transforms and weak limits.

For convenience random variables are split into three groups: continuous, discrete, and the rest.

2.1  Discrete r. v.

Definition 4 X is a discrete r. v. if X(W) is countable.

The definition says that X is a discrete r. v. if there is a finite, or countable set \cal V of numbers (values) of X such that pv = Pr(X = v) > 0 and åv Î \cal V pv = 1. The function f(v) = pv is then called the probability mass function of X. For completeness, the domain of the probability mass function is often extended to all x Î IR (or to x Î IRd in the multivariate case) by f(x) = Pr(X = x).

It is easy to see that if f is a function which satisfies two natural conditions:

f(x) ³ 0
(13)

å
x Î IR 
f(x) = 1
(14)
(15)
then there is a probability space with a random variable X such that f is its probability mass function. In modeling random phenomena we can therefore avoid the difficulties of designing appropriate sample spaces, and pick directly relevant densities. The question, if a density does describe the actual outcomes of experiment is to some extend the question of statistics. Properties of various distributions, like lack-of-memory come also handy when selecting appropriate density function.

For discrete r. v. the cumulative distribution function (12) plays lesser role. It is a discontinuous function given by the expression

F(x) =
å
v £ x 
pv.
(16)
This expression does show up in the ``generic simulation method in Section .

2.1.1  Examples of discrete r. v.

Table list the most frequently encountered discrete distributions.

Name Values Probabilities Symbol Parameters
Binomial 0,...,n Pr(X = k) = (nk)pk(1-p)n-kBin(n,p)0 £ p £ 1, n Î IN
Poisson ZZ\scriptscriptstyle + Pr(X = k) = e-l[(lk)/ k!] Poiss(l)l > 0
Geometric ZZ\scriptscriptstyle + Pr(X = k) = p(1-p)k-1 0 £ p £ 1
Uniform {x1,...,xk} Pr(X = xj) = 1/k k Î IN, x1,...,xk Î IR
Hypergeometric

Negative Binomial

Table 2.1: Discrete random variables.

Example 14 Let the random variable X denote the number of heads in three tosses of a fair coin.

Example 15 Let the random variable X denote the score of a randomly selected student on the final exam.

Problem 8 Let N be Poiss(l), and assume N balls are placed at random into n boxes. Find the probability that exactly m of the boxes are empty. ANS: (mn)e-lm/n(1-el/n)n-m.

2.1.2  Simulations of discrete r. v.

Discrete random variables with finite number of values are simulated by assigning values according to the ranges taken by the (pseudo)random uniform random variable U from the random number generator, U=Rand(1). To decide which value of X should be generated, take a partition {0 = a0 £ a1 £ ... £ an-1 £ an... £ 1} of interval (0,1). This means that we simulate X = f(U) using a piecewise constant function f on the interval (0,1). If f(x) = vk for x Î (ak,ak+1), then Pr(X = vk) = ak+1-ak. Therefore we choose a1 = 0, a2 = p1, ...,ak+1 = p1+...+pk. Notice that ak = Pr(X £ vk) = F(vk).

Other methods are also available for the distributions from Table 2.1. For example, program TOSSCOIN.BAS on page pageref simulates binomial distribution Bin(n=100, p=1/2).

The following exercise provides tools to run more involved simulations. [ 11 Write functions SimOneBin(n,p) and SimOneGeom(p) that will simulate a single occurrence of the Bin(n,p) r. v. and the geometric r. v. The sample usage: PRINT SimOneBin(15,.5) should simulate the number of heads in tossing 15 fair coins.

Also write function SimGeneric(p()) which simulates generic r. v. with values {0,1,...,n} and prescribed probabilities pk = p(k).

2.2  Continuous r. v.

Continuous random variables have uncountable sets of values, and the probability of each of them is zero, Pr(X = x) = 0 for all x Î IR.

Since probability satisfies continuity axiom (1), Pr(X Î (a,a+h))® 0 as h® 0 for all a. The main interest in continuous case is that the rate of convergence to 0 is also known, Pr(X Î (a,a+h)) » f(a) h +o(h). Function f(x) in this expansion is called the density function.

In terms of the cumulative distribution function 12), the probability is Pr(X Î (a,a+h)) = F(a+h)-F(a), and thus f(a) = limh® 0[(F(a+h)-F(a))/ h] = F¢(a) is the derivative of the cumulative distribution function F. Therefore when the Fundamental Theorem of Calculus can be invoked (say, when f is piecewise continuous)

F(x) = ó
õ
x

-¥ 
f(u) du.
(17)

Definition 5 Random variable X is (absolutely) continuous, if there is a function f such that Pr(X Î U) = òU f(x)dx. Function f is called the probability density function of X.

It is known that if f is a function which satisfies two natural conditions:

f(x) ³ 0
(18)
ó
õ


IR 
f(x) dx = 1
(19)
(20)
then there is a probability space with a random variable X such that f is its density. This is in complete analogy with the discrete case. In modeling random phenomena we can therefore avoid the difficulties of designing appropriate sample spaces, and pick directly relevant densities. The question, if a density does describe the actual outcomes of experiment is to some extend the question of statistics. Properties of various distributions come also handy when selecting appropriate density function.

It is convenient to use the heuristic probability density function in continuous case corresponds to probability mass function in discrete case, and that expressions that involve in discrete case sums are replaced by integrals, compare (17) and (16).

2.2.1  Examples of continuous r. v.

The following table lists more often encountered densities. Figures and give the graphs of the normal and exponential densities.

Name Range Density Symbol Parameters
Normal -¥ < x < ¥ f(x) = [1/(Ö{2p}s)]exp-[((x-m)2)/(2s2)]N(m,s)m Î IR,s > 0
Exponential x > 0 f(x) = le-lx l > 0
Uniforma < x < b f(x) = [1/( b-a)]U(a,b) a < b real
Gamma x > 0 f(x) = Cxa-1e-x/b Gamma(a,b) a > 0,b > 0,C = [1/( baG(a))]
Weibull

Table 2.2: Continuous random variables.

Missing graph

This graph was generated with java. The browser on your computer does not run java. Please use java-enabled browser, or another computer.

Figure 2.1: Graph of the standard normal N(0,1) density f(x) = (2p)-1/2 e-x2/2.

Missing graph

This graph was generated with java. The browser on your computer does not run java. Please use java-enabled browser, or another computer.

Figure 2.2: Graph of the exponential density f(x) = e-x as the function of x > 0.

Example 16 A dart is thrown at a circular dart board of radius 6. Let X denote the distance of the dart from the center. Assuming uniform assignment of probability (3), the density of X is

f(x) = {
[x/ 18]
if 0 £ x £ 6
0
otherwise

Problem 9 Referring to Exercise 1.2.2, let X,Y denote the arrival times of the two drivers at the intersection. Find the density of the time lapse |X-Y| between their arrivals.

2.2.2  Histograms

Simulations and experiments do not give direct access to the density, but often a histogram will approximate it reasonably well. Histograms are graphical representations of empirical data. A sample histogram is drawn on the side of the square in Figure on page .

To create a useful histogram, split the range into the finite number of intervals. Then graph over each interval the rectangle with the area equal to the observed frequency. The number, and positioning of intervals depends on the amount of data available, and personal preference.

2.2.3  Simulations of continuous r. v.

The generic method for simulating a continuous random variable is similar to the method used in the discrete case. Namely, we take X = f(U) with the suitable function f.

To find f assume it is increasing and thus invertible with inverse g. Then Pr(f(U) < x) = Pr(U < g(x)) = g(x). This completes the generic prescription: take as f(x) the inverse12 of the cumulative distribution functions F(x) = Pr(X £ x).

This method of simulation is quite effective if the inverse of F can be found analytically. It becomes slow when the inverse (or, worse still, cumulative distribution function F itself) is computed by numerical procedures. Since this is the case of the normal distribution, special methods are used to simulate the normal distribution.

Example 17 To simulate X which is exponential with parameter l, use X = -[1/( l)]lnU.

2.3  Expected values

Expected values are perhaps the single most important numerical characterization of a random phenomenon. Definition 6 For discrete random variable X the expected value EX is given by EX = åv vPr(X = v), provided the series converges.

Expected value captures the intuition of the average of a random quantity. It is also this intuition that leads to estimating the expected value by averaging the outcomes of simulations. Example 18 If X has values x1,...,xn with equal probability, then EX = [`x] is the arithmetic mean of x1,...,xn.

Simulationsare oten used to get answers that are too difficult to find analytically. The following exercise can be answered by simulation if you figure out how to shuffle cards from a deck at random (Exercise 1.4). [ 12 What is the expected number of cards which must be turned over in order to see each of one suit.

Example 19 If X takes two values a < b and Pr (X = a) = p, then EX = pa+(1-p)b is the number in the closed interval [a,b].

Definition 7 For continuous random variable X the expected value EX is given by EX = òIRxf(x) dx, provided the integral converges.

Name Probability distribution EX
Normal f(x) = [1/(Ö{2p}s)]exp-[((x-m)2)/( 2s2)]m
Exponential f(x) = le-lx [1/( l)] > 0
Uniform f(x) = [1/( b-a)] 1/2(a+b) real
Gamma
Weibull
Binomial Pr(X = k) = (nk)pk(1-p)n-knp
Poisson Pr(X = k) = e-l[(lk)/ k!]l
Geometric Pr(X = k) = p(1-p)k-1 1/p
Hypergeometric
Negative Binomial

Table 2.3: Expected values of some random variables.

Problem 10 Compute EX for the entries in Table 2.3.

[ 13 Simulate EX for the entries in Table 2.3 (except normal) for different values of parameters involved.

Problem 11 [Exercise] Referring to Example 2.2.1, what is the average distance of a dart from the center?ANS: 4.

If you chose to do the simulations, you should perhaps notice that it is rather difficult to decide how many simulations to take in order to achieve the desired accuracy. Typically, you need to increase the size of a simulation four times to half the error.

Another point to keep in mind is that simulations do return numbers. But from numbers alone it is difficlut to see how parameters of the model change it, and this is a more interesting question.

2.3.1  Tail integration formulas

The following tail integration formula is of considerable convenience in theoretical analysis. Theorem 2 For non-negative random variables, both in the discrete, and in the continuous case
EX = ó
õ
¥

0 
Pr
(X > t) dt
(21)
(The expected value is finite if and only if the corresponding integral converges.)

Proof. To simplify the proof and expose the main idea more clearly, consider the discrete case with a finite number of values. Similarly, to avoid technicalities we consider continuous case with bounded range only.

Discrete case:

EX = x1p1+x2p2+...+xnpn = x1(p1+p2+...+pn-( p2+...+pn))+x2( p2+...+pn-( p3+...+pn))+...+xn-1( pn-1+pn -pn)+xnpn = x1( p1+p2+...+pn)+(x2-x1)( p2+...+pn-(p3+...+pn))+...+( xn-1-xn-2)( pn-1+pn)+(xn-xn-1)pn The latter is ò0xn Pr(X > t)  dt.

Continuous case: Let f denote the density and F be the cumulative distribution function. Integrating by parts EX = ò0b x f(x)  dx = b-ò0bF(x) dx = ò0b(1-F(x) dx. [¯]
If X is discrete integer valued X Î {0,1,...}, then (21) can be written as

EX = ¥
å
n = 0 
Pr
(X > n).
(22)

It is natural to define EX for more general r. v. in the same way through formula (21). Write X = X+-X- to decompose X into its non-negative, and non-positive parts, and then define EX = ò0¥ P(X > t) dt-ò0¥ P(X < -t) dt. Clearly, if one of the integrals diverges, EX is not defined.

Example 20 Suppose X is exponential with the density f(x) = e-x. Let Y be X truncated at level 3. That is

Y = {
X
if X £ 3
3
if X ³ 3

Clearly, Y is not continuous, as Pr(Y = 3) = Pr(X > 3) = e-3 > 0. On the other hand, Y is not discrete as it takes uncountable number of values; in fact all the numbers between 0 and 3 are possible. The definitions of the expected value we gave do not apply, but (21) can be used to show that E(Y) = 1-e-3.

Indeed, Pr(Y > t) = Pr(X > t) = e-t for 0 < t < 3, and Pr(Y > t) = 0 for t > 3.

Example 21 To determine the mean of the geometric distribution we can either compute the sum pån = 1¥ n (1-p)n-1, or use (21) and find the easier sum ån = 0¥ (1-p)n.

2.3.2  Chebyshev-Markov inequality

The following inequality is known as Markov's, or Chebyshev's inequality. Despite its simplicity it has numerous non-trivial applications, see eg. Theorem .

[ 1 If U ³ 0 then

P(U > t) £ EU
t
(23)

Indeed, by (21) we have EU = ò0¥ Pr(U > x) dx ³ ò0tPr(U > x) dx ³ tP(|X| > t).

Problem 12 Suppose U is uniform U(0,1). Then Pr(U > t) £ [1/ 2t]. This means that 1-t < [1/ 2t].

  1. Is the above inequality ``sharp"? (Graph both curves).
  2. Is (21) sharp? That is, given t > 0, is there an X > 0 such that equality occurs?

2.4  Expected values and simulations

Expected value EX is approximated without much difficulty13 on a computer by averaging large number of independent trials.

Example 22 To find the average of the uniform U(0,1) distribution, take 1/n(U1+...+Un).

This is the basis of Monte-Carlo methods , which is a family of related probabilistic methods for computing the integrals. To find òab f(x) dx we simulate Xj = f(a+bUj). The average 1/nåj = 1nXj approximates [1/( b-a)]òab f(u) du for large n.

The variance of the n-th approximation is of the order n-1/2, which is worse than the trapezoidal rule for smooth functions. In return the approximation is insensitive to the smoothness of the integrands, and also to the dimension of the integral. Monte Carlo methods can be used effectively for multiple integrals over irregular domains.

[ 14 Use Monte Carlo method to approximate p = ò-112[Ö(1-x2)]  dx. (You may want to compare the output with numerical procedures described in Section .)

Another method of similar nature is to pick points (X,Y) at random from the rectangle containing the graph of f and check if Y < f(X) holds. The proportion of ``successes" approximates the proportion of the area under the graph of f.

[ 15 Approximate p/4 by selecting points (X,Y) at random from the unit square, and checking if X2+Y2 < 1.

The following sample program computes numerically double integral òòUcos(10x+20y) dxdy over a circle x2+y2 = 1. The only conceptual difficulty as compared to single integrals is how to select points at random from the unit disk. This is done by picking points from a bigger square and discarding those that didn't make it. Can you do this integral analytically? Or by another numerical procedure?

PROGRAM dblint.bas
'

'declarations
DECLARE FUNCTION Integrand! (X!, Y!)
DECLARE FUNCTION InDomain! (X!, Y!)

CONST True = -1

' simulation loop
NumTrials = 10000
FOR j = 1 TO NumTrials
        'select random points from the square [-1,1]x[-1,1]
        X = 2 * RND(1) - 1
        Y = 2 * RND(1) - 1
        'check if this is in the domain
        IF InDomain(X, Y) THEN
                NumTested = NumTested + 1
                Sum = Sum + Integrand(X, Y)
                Var = Var + Integrand(X, Y) ^ 2
        END IF
NEXT j


'Print the answer
PRINT "Examined "; NumTested; " random points"

IF NumTested = 0 THEN END 'nothing found
N = NumTested
PRINT "The integral is approximately "; Sum / N
PRINT "With 95% confidence  the error is less than "; 1.96 * SQR(Var / N - (Sum / N) ^ 2) / SQR(N)

END

FUNCTION InDomain (X, Y)
'This function checks if $x,y$ is in the integration domain
'The definition of the domain can be easily modified here, including
'more complicated domains
IF X ^ 2 + Y ^ 2 < 1 THEN InDomain = True
END FUNCTION

FUNCTION Integrand (X, Y)
'This is the function to be integrated

Integrand = COS(10 * X + 20 * Y)
'
END FUNCTION

It is important to have some idea about how accurate the answer is. The program uses the to estimate the magnitude of the error. A less sharp (and thus safer to use!) error estimate can be obtained from (23)

[ 7 There are two natural choices to estimate by simulation events of small probability. We can pick a large sample size n, run the simulation experiment and hopefully get several data points. The outcome X of such an experiment is a binomial random variable with unknown probability p of success, and we would estimate p » X/n. The trouble is that if we don't know how small the chances are, we might get none, estimating probability to be zero. Or we could run the experiment until we get the prescribed number of successes. The observation would then consist of a set of geometric random variables T1,..., Tk with unknown mean ET = 1/p. We could then estimate p = 1/ET by taking the inverse of the arithmetic mean of Tj. In this approach we are guaranteed to get some observations, as long as p ¹ 0.

The question is Which of the methods would you recommend to use? (Of course, you would recommend a better method, but what the word ``better" might mean here?)

2.5  Joint distributions

Often an experiment involves measuring two or more random numbers, say X and Y. The fact that we know the distribution of X, and the distribution of Y separately doesn't determine probabilities of events that involve both X and Y simultaneously. Example 23 Suppose

Pr
(X = 1,Y = 1) = 1
4
+e
(24)
Pr
(X = -1,Y = -1) = 1
4
+e
Pr
(X = -1,Y = 1) = 1
4
-e
Pr
(X = 1,Y = -1) = 1
4
-e
Then Pr(X = 1) = Pr(Y = 1) = 1/2 regardless of the value of e. On the other hand, Pr(X = Y) = 1/2+2e clearly depends on the value of e.

It is clear that if X is discrete, and Y is discrete, then (X,Y) is an IR2 valued discrete r. v. That is, the values of the pair are countable. Probabilities Pr(X = x, Y = y) are called the joint distribution. Corresponding Pr(X = x) and Pr(Y = y) are the so called marginals. Example 2.5 points out that marginals do not determine joint probabilities uniquely. But if we know the joint probabilities then we can compute the marginals, eg Pr(X = x) = åyPr(X = x, Y = y).

In contrast to the discrete case, joint continuity can not be recognized from the continuity of the components, and requires full definition.

Definition 8 Let X = (X1,...,Xn). Random variables X1,...,Xn are jointly (absolutely) continuous, if there is a function f such that

Pr
(X Î U) = ó
õ
... ó
õ


U 
f(x1,... xn) dx1 ... dxn
for all measurable U. Function f is then called the probability density function of X.

Example 24 Suppose X,Y have uniform distribution in the unit disk. Then the joint density is f(x,y) = 1/p for x2+y2 £ 1 and the density of X is fX(x) = [2/( p)][Ö(1-x2)].

The relation between probabilities and the density is

f(a,b) = 2
a b
Pr
(X £ a, Y £ b).
(25)
Occasionally this can be used to determine the joint density.

2.5.1  Independent random variables

Independence of random variables is defined in terms of joint distributions. The intuition behind this definition is that the events that random variables may generate should be independent. Notice however that the actual definition is simpler than Definition 1.6. Can you explain why? Definition 9 Random variables X1,...,Xn are independent, or stochastically independent, if
Pr
(X1 Î U1, ..., Xn Î Un) = Pr
(X1 Î U1) ... Pr
(Xn Î Un)
(26)
for all measurable U1,..., Un Ì IR.

(Similarly we define the stochastic independence of random vectors).

We say that X1,...,Xn are independent identically distributed (i. i. d) if (26) holds and Pr(Xi Î U) = Pr(Xj Î U) for all Borel U Ì IR. [ 2 If X,Y are discrete with the probability mass function f(x,y), then independence of X,Y is equivalent to f(x,y) = fX(x)fY(y).

If X,Y are continuous with the joint density f(x,y) then independence of X,Y is equivalent to f(x,y) = fX(x)fY(y).

Independence is often part of the model. Independence allows to determine joint distributions from marginals. Thus each independent random variable can be analyzed separately, and then more complex questions can be answered. From the mathematical perspective, under independence we can determine joint distributions if we know marginals. Example 25 Suppose X is binomial Bin(n,p) and Y is Poiss(l). If X,Y are independent, then the joint probability mass function of X,Y is given by f(x,y) = (xn)pn(1-p)n-xe-lly/y! (or 0).

Example 26 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. Their arrival times are independent random variables. Indeed, using formula (25) and elementary area computation, Pr(X < a,Y < b) = ab for 0 < a,b < 1.

2.6  Functions of r. v.

Some random variables are obtained by taking functions of another ones, possibly multi-dimensional. In the notes we often limit our attention to a single random variable Z given by a function Z = f(X,Y) of two arguments; this is convenient for notation and exhibits most of the interesting techniques.

Sums and linear combinations, medians, maxima, and minima are perhaps the most often encountered functions of multidimensional random variables. Methods to compute the distribution, or the expected value of such a function are of considerable practical significance.

2.7  Moments of functions

If Z = f(X,Y) has the expected value, then EZ can be computed directly without computing the density, or probability mass function of Z. The following identity is useful.

If X,Y are discrete and EZ exists, then

Ef(X,Y) =
å
x,y 
f(x,y)Pr
(X = x, Y = y).
(27)

If X,Y are jointly continuous then Z might be continuous, discrete, or say of mixed type. Regardless of the case

Ef(X,Y) = ó
õ
ó
õ


IR2 
f(x,y)f(x,y) dx dy,
(28)
and the double integral converges if EZ exists. Conversely, if the integral converges, then EZ exists and is given by formula (28).

In particular the expected value is linear

E(aX+bY+c) = aEX+bEY+c.
(29)
This can be easily verified using (28), but the identity (28) is beyond the scope of this notes, as we do not want to dwell on the general definition of EZ that would encompass all cases.

The fact that expected value is linear provides a simple method of computing some otherwise difficult sums. Example 27 Suppose X is Binomial Bin(n,p). Then X = X1+...+Xn, where Xj is the number of successes in j-th trial. Clearly each Xj is 0, or 1, and EXj = p.

[ 16 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. On average how much time lapses between their arrivals?

Definition 10 Let m = EX. The variance Var(X) is defined as Var(X) = E(X-m)2.

Notice that

Var(aX+b) = a2Var(X).
(30)
In particular, Var(X) = 0 when X = const. Sometimes a more convenient expression for the variance is
Var(X) = EX2-(EX)2 = E(X-EX)2.

The standard deviation is s = [ÖVar(X)].

Name Probability distribution Var(X)
Normal f(x) = [1/(Ö{2p}s)]exp-[((x-m)2)/( 2s2)]s2
Exponential f(x) = le-lx [1/( l2)]
Uniform f(x) = [1/( b-a)] [1/ 12](b-a)2
Gamma
Weibull
Binomial Pr(X = k) = (nk)pk(1-p)n-knp(1-p)
Poisson Pr(X = k) = e-l[(lk)/ k!]l
Geometric Pr(X = k) = p(1-p)k-1 [1/( p2)]
Hypergeometric
Negative Binomial

Table 2.4: Variances of some random variables.

Problem 13 Compute variances Var(X) for the entries in Table 2.4. (Some of these are a real challenge to your computational skills, so you may safely give up. Another method will make it easier in Chapter ).

Since Var(X) = EX2-(EX)2 ³ 0, the following inequality follows

(EX)2 £ EX2
(31)

Definition 11 The covariance of random variables X,Y with expected values mX, mY is defined as cov(X,Y) = E(X-mX)(Y-mY).

Clearly Cov(X,X) = Var(X) and Var(X+Y) = Var(X)+Var(Y)+2cov(X,Y).

Theorem 3 If X,Y are independent, then Var(X+Y) = Var(X)+Var(Y).

Example 28 let X1,X2,...,Xn be independent {0,1}-valued random variables, and suppose that Pr(Xj = 1) = p. Then Var(åj = 1nXj) = np(1-p). What is the distribution of åj = 1nXj?

Tail integration formula revisited

Example 29 If X > 0 then EeX = 1+ò0¥ et P(X > t) dt

Problem 14 Show that EX2 = ò0¥ t P(|X| > t) dt.

Generalize this formula to E|X|p.

Chebyshev-Markov inequality

Special cases of Chebyshev's inequality (23) are:

Pr
(|X| > t) < 1
t
E|X|
(32)
Pr
(|X-m| > t) < 1
t2
Var(X)
(33)
Pr
(|X| > t) < e-atEeaX
(34)

Chebyshev's inequality is one reason we often strive for small average quadratic error. If E|X-X0|2 < e then we can be sure that Pr(|X-X0| > 3Ö{e}) < 3Ö{e}. The following may be used (with some caution) in computer programs to asses error in estimating probabilities by sampling.

Example 30 If X is Bin(n,p) then Pr(|X/n-p| > [1/( 3Ön)]) £ [1/( 43Ön)]

[ 17 Run a simulation of the event that you know probability of, printing out the error estimate.

Do the same for the event that you don't know the probability analytically. (Use programs written for previous exercises)

2.7.1  Simulations

The unknown variance s2 of a sequence of simulated random variables Xj can be approximated by 1/n åj = 1n (Xj-[`X])2, where [`X] is the arithmetic mean of X1,..., Xn. Thus can also use (33) and Theorem 2.7 to asses errors in estimating variances. Another more accurate method is presented later on in Chapter , but it also requires estimating variances.

From now on, in the output of your simulation programs you should provide some error estimates.

2.8  Application: scheduling

A critical path analysis involves estimating time of completing a project consisting of many tasks of varying lengths. Some of the tasks can be done concurrently, while other may begin only after other preliminary tasks are completed. This is modeled by the dependency graph together with the estimated times.

2.8.1  Deterministic scheduling problem

As an example of simple scheduling problem consider the following. Example 31 Suppose that we want to bake a batch of chocolate-chip cookies. The tasks and their (estimated) times are:

The dependency graph is quite obvious here; for instance, we cannot start baking before batter is ready. What is the shortest time we can eat the cookies?

2.8.2  Stochastic scheduling problem

In some projects, the actual numbers are only the averages, and the actual completion times of the projects may be random. The distribution of the actual completion time, or even its average may be difficult to compute analytically. Nevertheless, simulations let us estimate the average and analyze the probabilities of events.

[ 18 Suppose for the sake of this exercise that the numbers presented in the cookie-baking example are just the average values of the exponential random variables.

2.8.3  More scheduling questions

In more realistic analysis of production processes we also have to decide how to split the tasks between available personnel. Example 32 This exercise refers to the tasks presented in Example 2.8.1. On average, how fast can a single person bake chocolate-chip cookies? What if there are two people?

2.9  Distributions of functions

Distributions of functions of random variables are often difficult to compute explicitly. Special methods deal with more frequent cases.

Sums of discrete r. v.

Sums can be handled directly, but a more efficient method uses generating functions of Chapter .

Suppose X,Y are discrete and f(x,y) = Pr(X = x, Y = y) is their joint probability mass function. Then Z = X+Y is discrete with values z = x+y. Therefore

fZ(z) =
å
x 
f(x,z-x)
(35)

For independent random variables X, Y this takes a slightly simpler form.

fZ(z) =
å
x 
fX(x)fZ(z-x)
(36)

Formula (36) can be used to prove the so called summation formulas. Theorem 4 If X, Y are independent Binomial with the same parameter p, ie. X is Bin(n,p) and Y is Bin(m, p), then X+Y is Binomial Bin(n+m,p).

If X, Y are independent Poisson Poiss(lX) and Poiss(lY), then X+Y is Poisson Poiss(lX+lY).

Sums of continuous r. v.

One can prove that if X,Y are independent and continuous than X+Y is continuous with the density

f(z) = ó
õ
¥

-¥ 
fX(u)fY(z-u)  du
(37)

Formula (37) defines the convolution fX*fY. It can be used to prove the so called summation formulas. Theorem 5 If X, Y are independent Normal, then X+Y is Normal.

If X, Y are independent Gamma with the same parameter b, then X+Y is Gamma(aX+aY,b).

Example 33 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. What is the density of the time that lapsed between their arrivals?

Example 34 Suppose X,Y are independent U(0,1). The density of Z = X+Y is

f(z) = {
z
if 0 £ z £ 1
2-z
if 1 £ z £ 2
.

Minima, maxima

Minima and maxima occur for instance if we are waiting for one of the independent events, and then we follow the first one (minimum), or the last one (maximum). Embedded Markov chains construction in Section are based on minima of independent exponential r. v.

Suppose X,Y are independent. If U = min{X,Y} then Pr(U > t) = Pr(X > t)Pr(Y > t). Therefore the reliability function of U can be computed from the two given ones.

Example 35 If X, Y are independent exponential, then U = min{X,Y} is exponential.

If U = max{X,Y} then Pr(U < t) = Pr(X < t)Pr(Y < t). Therefore the cumulative distribution function of U can be computed from the two given ones.

Example 36 Suppose X,Y are independent uniform U(0,1). Then U = max{X,Y} has the density f(u) = 2u for 0 < u < 1.

Problem 15 Let U1,..., Un be independent uniform U(0,1). Find the density of

<