[section] <font size="+3">Applied <br> <font size="+2">Probability and Stochastic Processes <br> <font size="-1">Lecture Notes for 577/578 Class</font>

Probability and Stochastic Processes
Lecture Notes for 577/578 Class

Wlodzimierz Bryc
Department of Mathematics
University of Cincinnati
P. O. Box 210025
Cincinnati, OH 45221-0025
E-mail: brycw@math.uc.edu

Created: October 22, 1995
Modified: 1996
Printed: Mar 13, 1998
© 1995 Wodzimierz Bryc
WWW version

Course description

This course is aimed at students in applied fields and assumes a prerequisite of calculus. The goal is a working knowledge of the concepts and uses of modern probability theory. A significant part of such a ``working knowledge" in modern applications of mathematics is computer-dependent.

The course contains mathematical problems and computer exercises. Students will be expected to write and execute programs pertinent to the material of the course. No programming experience is necessary or assumed. But the willingness to accept a computer as a tool is a requirement.

For novices, BASIC programming language, (QBASIC in DOS, Visual Basic in Windows, or BASIC on Macintosh) is recommended. BASIC is perhaps the easiest programming language to learn, and the first programming language is always the hardest to pick.

Programs in QBASIC 4.5 on IBM-compatible PC and, to a lesser extend, programs on Texas Instrument Programmable calculator TI-85, and WindowsTM programs in Microsoft Visual Basic 3.0 are supported. This means that I will attempt to answer technical questions and provide examples. Other programming languages (SAS, C, C++, Fortran, Cobol, Assembler, Mathematica, LISP, TEX (!), Excel, etc.) can be used, but I will not be able to help with the technical issues.

Contents of the course (subject to change)

Supporting materials

This text is available through the Internet in PDF, PostScript, or DVI formats. A number of other math related resources can be found on WWW. Also available are supporting BASIC program files.

Auxiliary textbooks:



The text has three types of ``practice questions", marked as Problems, Exercises, and Projects. Examples vary the most and could be solved in class by an instructor; Exercises are intended primarily for computer-assisted analysis; Problems are of more mathematical nature. Projects are longer, and frequently open-ended. Exercises, Projects, and Problems are numbered consecutively within chapters; for instance Exercise 10.2 follows Problem 10.1, meaning that there is no Exercise 10.1.


The text refers to BASIC instructions with the convention that BASIC key-words are capitalized, the names of variables, functions, and the SUBs are mixed case like ThisExample. Program listings are typeset in a special ``computer" font to distinguish them from the rest of the text.

Copyright and License for 1996 version

This textbook is copyrighted © W. Bryc 1996. All rights reserved. GTDR.

Academic users are granted the right to make copies for their personal use, and for distribution to their students in academic year 1995/1996.

Reproduction in any form by commercial publishers, including textbook publishers, requires written permission from the copyright holder.

WWW version© 1996 W. Bryc

Part 1

Chapter 1
Random phenomena

Definition of a Tree: A tree is a woody plant with erect perennial trunk of at least 3.5 inches (7.5 centimeters) in diameter at breast height (41/2 feet or 1.3 meters), a definitely formed crown of foliage, and a height of at least 13 feet (4 meters).
The Auborn Society Field Guide to North American Trees.

This chapter introduces fundamental concepts of probability theory; events, and their chances. For the readers who are familiar with elementary probability it may be refreshing to see the computer used for counting elementary events, and randomization used to solve a deterministic optimization problem.

The questions are

1.1  Mathematical models, and stochastic models

Every theory has its successes, and its limitations. These notes are about the successes of probability theory. But it doesn't hurt to explain in non-technical language some of its limitations up front. This way the reader can understand the basic premise before investing considerable time.

To begin with, we start with a truism. Real world is complicated, often to a larger degree than scientists will readily admit. Most real phenomena have multi-aspect form, and can be approached in multiple ways. Various questions can be asked. Even seemingly the same question can be answered on many incompatible levels. For instance, the generic question about dinosaur extinction has the following variants.

Theses questions are ordered from the most individual level to the most abstract. The reader should be aware that probability theory, and stochastic modelling deal only with the most abstract levels of the question. Thus, a stochastic model may perhaps shed some light whether dinosaurs had to go extinct, or whether mammals will go extinct, but it wouldn't go into details of which comet had to be responsible for dinosaurs, or which is the one that will be responsible for the extinction of mammals.

It isn't our contention that individual questions have no merit. They do, and perhaps they are as important as the general theories. For example, a detective investigating the cause of a mysterious death of a young woman, will have little interest in the ``abstract statistical fact" that all humans eventually die anyhow. But individual questions are as many as the trees in the forest, and we don't want to overlook the forest, either.

Probabilistic models deal with general laws, not individual histories. Their predictions are on the same level, too. To come back to our motivating example, even if a stochastic model did predict the extinction of dinosaurs (eventually), it would not say that it had to happen at the time when it actually happened, some 60 million years ago. And the more concrete a question is posed, say if we want to know when Dino the dinosaur died, the less can be extracted from the stochastic model.

On the other hand, many concepts of modern science are define in statistical, or probabilistic sense.(If you think this isn't true, ask yourself how many trees do make a forest.) Such concepts are best studied from the probabilistic perspective. An extreme view is to consider everything random, deterministic models being just approximations that work for small levels of noise.

1.2  Events and their chances

Suppose W is a set, called the probability, or the sample space. We interpret W as a mathematical model listing all relevant outcomes of an experiment.

Let \cal M be a s-field of its subsets, called the events. Events A,B \cal M model sentences about the outcomes of the experiment to which we want to assign probabilities. Under this interpretation, the union AB of events corresponds to the alternative, the intersection AB corresponds to the conjunction of sentences, and the complement A corresponds to the negation of a sentence. For A,B \cal M, A\B: = AB denotes the set-theoretical difference.

For an event A \cal M the probability Pr(A) is a number interpreted as the degree of certainty (in unique experiments), or asymptotic frequency of A (in repeated experiments). Probability Pr(A) is assigned to all events A \cal M, but it must satisfy certain requirements (axioms). A set function Pr(·) is a probability measure on (W,\cal M), if it fulfills the following conditions:

  1. 0 Pr(A) 1
  2. Pr(W) = 1
  3. For disjoint1 Pr(AB) = Pr(A)+Pr(B).
  4. If An are such that n 1 An = and A1 A2 ... An An+1 ... are decreasing events, then
    (An) 0.

Probability axioms do not determine the probabilities in a unique way. The axioms provide only minimal consistency requirements, which are satisfied by many different models.

1.2.1  Uniform Probability

For finite set W let
(A) = # A
This captures the intuition that the probability of an event is proportional to the number of ways that the event might occur.

The uniform assignment of probability involves counting. For small sample spaces this can be accomplished by examining all cases. Moderate size sample spaces can be inspected by a computer program. Counting arbitrary large spaces is the domain of combinatorics. It involves combinations, permutations, generating functions, combinatorial identities, etc. Short recalls the most elementary counting techniques. Problem 1 Three identical dice are tossed. What is the probability of two of a kind?

The following BASIC program inspects all outcomes when five dice are rolled, and counts how many are ``four of a kind".

PROGRAM yahtzee.bas

'declare function and  variables
DECLARE FUNCTION CountEq! (a!, b!, c!, d!, e!)

'prepare screen
PRINT "Listing four of a kind outcomes in Yahtzee..."

'*** go through all cases
FOR a = 1 TO 6: FOR b = 1 TO 6: FOR c = 1 TO 6: FOR d = 1 TO 6: FOR e = 1 TO 6
   IF CountEq(a + 0, b + 0, c + 0, d + 0, e + 0) = 4 THEN
       PRINT a; b; c; d; e; : ct = ct + 1
       IF ct MOD 5 = 0 THEN PRINT :  ELSE PRINT "|";

'print result
PRINT "Total of "; ct; " four of a kind."

FUNCTION CountEq (a, b, c, d, e)
'*** count how many of five numbers are the same
DIM x(5)
x(1) = a
x(2) = b
x(3) = c
x(4) = d
x(5) = e
max = 0
FOR j = 1 TO 5
    ck = 0
    FOR k = 1 TO 5
        IF x(j) = x(k) THEN ck = ck + 1
    NEXT k
    ck = -ck
    IF ck > max THEN max = ck
'assign value to function
CountEq = max

Here is a portion of its output:

The program is written explicitly for tossing five dice, and you may want to modify it to answer similar question for any number of dice, and an arbitrary k-of-a-kind question.

[ 1 Run and time YAHTZEE.BAS. Then estimate how long a similar problem would run if the question involved tossing 15 fair dice. The answer depends on your computer, and the software. Both Pascal and C-programs seem to run on my computer about 15 times faster than the (compiled) QuickBasic. ANS: Running time would take years!

Exercise 1.2.1 shows the power of old-fashioned pencil-and-paper calculation. Problem 2 Continuing Problem 1.2.1, suppose now n identical dice are tossed. What is the probability of n-1 of a kind? ANS: [5 n/( 6n-1)].

1.2.2  Geometric Probability

For bounded subsets W IRd, put

(A) = |A|
This captures the intuition that the probability of hitting a target is proportional to the the size of the target.

Geometric probability usually involves multivariate integrals. Example 1 A point is selected from the 32 cm wide circular dartboard. The probability that it lies within the 8cm wide bullseye is [1/ 16].

Example 2 A needle of length 2l < 2 is thrown onto a paper ruled every 2 inches. The probability that the needle crosses a line is 2l/p.

Analysis of Example 1.2.2 is available on WWW.

[ 2 Test by experiment (or simulation on the computer) if the above two examples give correct answers. (Before writing a program, you may want to read Section first.)

Example 3 Two drivers arrive at an intersection between 8:00 and 8:01 every day. If they arrive within 15 seconds of each other, both cars have to stop at the stop-sign. How often do the drivers pass through the intersection without stopping?

[ 1 Continuing Example 1.2.2: What if there are three cars in this neighborhood? Four? How does the probability change with the number of cars? At what number of users a stop light should be installed?

1.3  Elementary probability models

1.3.1  Consequences of axioms

Here are some useful formulas that are easy to check with the help of the Venn diagrams. For all events A, B
(AB) = Pr
(A) = 1-Pr
If A B then
(B\A) = Pr
If An are pairwise disjoint events, then

n = 1 
An) =

n = 1 

1.3.2  General discrete sample space

For a finite or countable set W IN and a given summable sequence of non-negative numbers an put

(A) =

n A 

n W 
In probability theory it is customary to denote

pk = [(ak)/( n W an)] and rewrite (8) as Pr(A) = n A pn.

Formula (8) generalizes the uniform assignment (2), which corresponds to the choice of equal weights ak = 1. At the same time it is more flexible2 and applies also to infinite sample spaces. Example 4 Let W = IN and put pk = [1/( 2k)]. Then the probability that an odd integer is selected is Pr(Odd) = j = 0 2-2j-1 = 2/3. ()

Table list the most frequently encountered discrete probability assignments.

Name W Probabilities pk
Binomial {0,...,n} pk = (nk)pk(1-p)n-k
Poisson ZZ\scriptscriptstyle + pk = e-l[(lk)/ k!]
Geometric IN pk = p(1-p)k-1
Equally likely outcomes {x1,...,xk} pk = 1/k

Table 1.1: Probability assignments for discrete sample spaces.

Problem 3 For each of the choices of numbers pk in Table 1.1 verify that (8) indeed defines a probability measure.

The reasons behind the particular choices of the expressions for pk in Table 1.1 involve modeling.

1.3.3  General continuous sample space

There is no easily accessible general theory for infinite non-countable sample spaces. When W IRk, the generalization of the geometric probability uses a non-negative density function f(x1,x2,...,xk)

(A) = C

f(x1,x2,...,xk) dx1 dx2... dxk
For one-dimensional case k = 1, examples of densities are collected in Table on page .

1.4  Simulating discrete events

The most natural way to verify whether a mathematical model reflects reality is to compare theoretically computed probabilities with the corresponding empirical frequencies. Unfortunately, each part of this procedure is often time consuming, expensive, inconvenient, or impossible. Computer simulations are used as a substitute for both: they provide numerical estimates of theoretical probabilities, and they are closer to the direct experiment.

The first step in a simulation is to decide how to generate ``randomness" within a deterministic algorithm in a computer program. Programming languages, and even calculators, provide a method for generating consecutive numbers from the interval (0,1) ``at random". For instance, BASIC instruction3 PRINT RND(1) returns different number at every use. We shall assume that the reader has access to a of generating uniform numbers from the unit interval (0,1). These are called pseudo-random numbers, since the program usually begin the same ``random" sequence at every run, unless special precautions4 are taken.

Once a pseudo-random number from the interval (0,1) is selected, an event that occurs with some known probability p can be simulated by verifying if { RAND(1) < p} occurs in the program. For instance, the number of heads in a toss of a 1,000 fair coins is simulated by the following BASIC program.

PROGRAM tosscoin.bas

'Simulating 1000 tosses of a fair coins

H = 0
FOR n = 1 TO 1000  'main loop
        IF RND(1) < .5 THEN H = H + 1
'print final message
PRINT "Got "; H; " heads this time"

Here is its output:Got 520 heads this time

A simple method for simulating an outcome on a six-face die is to take the integer part of a re-scaled uniformly selected number INT(1+6*RND(1)). Can you use this to write a simulation of a roll of 5 dice? Such simulations are often used to evaluate probabilities empirically as a substitute for a real empirical study.

[ 3 Write the simulation (as opposed to deterministic inspection of all sample points on page pageref) to estimate how often the event ``four of a kind" occurs in a roll of five dice.

[ 2 Run the (modified) coin tossing program in a loop and to answer the following questions.

Hint: Move the main part of TOSSCOIN.BAS into a SUB, or a FUNCTION. This way you can easily use it without cluttering your program with irrelevant details. (See a generic template below.)

More complicated objects are often of interest in simulations. For instance we may want to draw two cards from the deck of 52. One possible way to do it is to number the cards, and select two numbers a,b.

  1. Select the first card a=INT(RND(1)*52+1) as a random integer between 1 and 52.
  2. Select the second card b=INT(RND(1)*52+1) in the same way.
  3. Compare a,b

    1. If a = b then repeat step 2
    2. Otherwise, got two different cards a b at random

How efficient is this procedure? [ 4 How would you simulate on a computer a random permutation? A random subset?

1.4.1  Generic simulation template

The purpose of simulation is to investigate the unknown values of parameters of interest. In the initial exercises you may want to simulate the events that you know how to compute probabilities of. The purpose of such exercises is to develop intuition about reliability of simulations.

In more advanced exercises you may want to estimate probabilities that aren't known. In such cases it is always a good idea to run simulations of various lengths and compare the results. In this section we briefly discuss how such a simulation can be organized in a way that promotes multiple uses of the same program.

The key is organizing the programs carefully into manageable blocks of small size. Modern BASIc is a structural programming language. The generic program to study the effects of the length of simulation on its output can be written as follows

' PROGRAM Generic.bas
'Generic Simulator
'Size is simulation size varied from Min=100 to Max=10000
For Size 100 to 10000 Step 100
Simulate(Size, Result)
Print "Simulation size="; Size ; Öutput="; Result
Next Size

The actual simulation is performed by

SUB Simulate (SizeRequested, Result)
'Runs requested number of simulations and returns average
'Trial numbers consecutive simulations from 1 to SizeRequested
For Trial=1 to SizeRequested
Next Size
'Most simulations return averages of single trials

The actual modeling is performed in another SUB, which in the generic program we named SimulateOne. This SUB may be as simple as simulating a toss of a single coin

SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
if RND(1)<1/2 THEN Outcome=1

Or it can be as complicated as we wish. The example below simulates a toss of five dice, and uses previously introduced function CountEq(a,b,c,d,e). four-of-a-kind.

SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
IF CountEq(d1,d2,d3,d4,d5)=4 THEN Outcome=1

1.4.2  Blind search

Elementary probability when coupled with a fast computer is one of the simplest effective optimization method. The method is the blind search - a search for the best answer at random6. Pure blind search is usually simple to run, and therefore fast to realize. It often finds answers that are good enough for practical purposes, or at least can serve as the preliminary estimates. Various ad hoc modifications increase accuracy and are usually easy to implement, too.

[ 3 Write a blind-search program to find the maximum of a function.

1.4.3  Application: Traveling salesman problem

The following program searches for the shortest way to pass through the first n cities7 in the USA in alphabetic ordering.

Planning such a tour is easy by hand for 3-4 cities. For longer tours some help is needed. To check the performance of the blind search, you may want to know what the usual algorithms involve. A greedy method starts with the shortest distance, and then keeps going to the closest city not yet visited. Another heuristic method is to to select a pair of closest cities and accept the best (shortest) connection among those that do not complete a shorter loop, and do not introduce a spurious third connection to a city. Eventually, the disjoint pieces will form a path that often can be further improved upon inspection.

The program is longer but not at all sophisticated - it just selects paths at random. Notice that a solution to Exercise 1.4 - how to generate a random permutation - is given in one of the subprograms (which one?). The method for the latter is simple-minded - the algorithm attempts to place consecutive numbers at random spots until an empty spot is selected.

The following is the main part of the program. You can use it as a template in designing your own version of Blind search programs. The full code with SUBs is online in RANDTOUR.BAS.

'**** get number of cities (no choice which) from user
INPUT ; "Shortest distance between how many cities?", n

'*** initialize program
nMax = 19 ' current data size. Make sure not exceeded!
IF n > nMax THEN n = nMax

'declare arrays
DIM SHARED dist(nMax, nMax) ' matrix of distances
DIM P(n), BestP(n)

'read distances
CALL AssignDistances(nMax)

'initial permutation
FOR j = 1 TO n
        P(j) = j
        BestP(j) = j
'initial length of trip
MinLen = PathLen(P())

'*** main loop
DO  'infinite loop till user stops
'count trials
no = no + 1
        '*** interacting with user
        'check if user pressed key to stop
        k$ = INKEY$
        IF k$ > "" THEN EXIT DO  'exit infinite loop
        'display currrent progress
        display (no)
        '*** get any path
        CALL GetPermutation(P())
        x = PathLen(P())
        IF x < MinLen THEN
                'Better path found, so memorize and display
                Dlen = MinLen - x
                MinLen = x
                '*Memorize best order and print 
                FOR j = 1 TO n
                        BestP(j) = P(j)
                        PRINT city(P(j)); "->";
                NEXT j
                'Finish printing
                PRINT city(P(1))
                PRINT "Best so far: "; MinLen
                PRINT "Progress rate "; Dlen / (no - Slen); " miles per trial"
                Slen = no
        END IF
'Print final message
PRINT "Blind Search Recommended Route found in "; Slen; "-th search"
FOR j = 1 TO n - 1
        PRINT city(BestP(j)); "-->";
PRINT city(BestP(n)); "-->"; city(BestP(1))
PRINT "Total distance: "; MinLen
LOCATE 22, 40
PRINT "(Distances subject to change)"
The program runs in the infinite loop until it is stopped by the user. Once stopped, the program prints the best route it found. For larger sets of cities we may have hard time deciding when to stop it. Here is a sample output (from the improved version, as marked in the actual code):

When you run this program on larger sets of cities, you will notice that the program is not fast. One possible improvement in the design of this program is to modify the randomization to be less likely to pick long paths. For instance, you can attempt to modify paths that are known to be short, or weight the modifications by lengths of resulting paths. Such methods are actually in use in image restoration problems (simulated annealing), see page .

1.4.4  Improving blind search

A bit of experimenting with various ``pure" blind search programs should convince you that

It is possible to improve on the last aspect without complicating the program much. The idea is to make random modifications of the currently best found value. For example, in a one-dimensional maximization of a function f(x), we would do the following steps

  1. Pick an initial ``best-so-far" point x0 and compute initial value y0 = f(x0)
  2. Select at random x1 in the ``neighborhood" of x0 and compute y1 = f(x1)
  3. Compare y0, y1.

    1. If y1 < y2 then repeat Step 2.
    2. If y1 y0 then make x1 the new ``best-so-far" y0: = y1,x0: = x1. Then repeat Step 2.

  4. Stop the program at user request, or when no changes to y0 occur for prolonged number of attempts to improve it.

We want to allow for the chance of checking points far away from the ``best-so-far" answer. But we don't want this to happen too often, because x0 might be rather close to the optimum. The tradeoffs are that the program will tend to follow ``direct path" to the maximum, but the danger is that it will get stuck longer in local maxima.

Improved blind search with time/state dependent randomization is actually implemented within RANDTOUR. It is commented out in the version on the disk, so that it isn't operational. To make it active, uncomment the call to ImproveBest as a replacement for GetPermutation.

1.4.5  Random permutations

Program RANDTOUR.BAS selects permutations at random only in its ``simplest" variant. Here are a few examples of problems that require selecting random permutations.

The ``naive" selection of random permutation wastes many random numbers. Here is an algorithm that conserves resources better. The basic idea is to select a number from the beginning of the list of all numbers, and move it down to the end. The randomly re-arranged numbers are P(1),P(2),...,P(n)

SUB GetPermutationFast(P())
'Put random integers into P()
n=UBOUND(P) 'how many entries
'this loop can be omitted is we are sure that P(j) list all the numbers we want
FOR j=1 TO n
next j
for j=1 to UBOUND(P) 'not n as n changes in the loop

SWAP P(n), P(k)
next j

[ 5 There are many incompatible measures of ``quality" of an algorithm. One of the ``objective" criteria is the number of ``If" verifications. Another ``objective criterion" might be the number of calls to a function. A ``subjective" criterion, which depends on the hardware and circumstances, is timing.

Does SUB GetPermutationFast deserve adverb Fast in its name?

[ 4 The Subset-Sum Problem is stated as follows:

Let S be a set of positive integers and let t be a positive integer. Decide of there is a subset S S such that the sum of integers in S is exactly t.

The associated optimization problem is to find a subset S S whose elements' sum is the largest but doesn't exceed t. This optimization problem is NP-complete, ie it isn't known if there is a polynomial time algorithm (polynomial in the size of S) to find S.

Investigate how the blind search will do on sets S selected at random, and on sets S constructed in more regular fashion like arithmetic progression S = {a, 2a, 3a, ...}, geometric progression S = a,a2, a3,....

1.5  Conditional probability

In modeling more complicated phenomena we may want to use different probabilities under different circumstances. For instance, in a modified blind search for the minimum of a non-negative function, the randomization strategy might be different when we already made some progress, and it might be different when we are ``stuck" in a non-optimal location. Thus we may want to consider probabilities of the same event A (say, hitting a maximum) under different conditions B.

To formalize this idea, suppose B is an event such that Pr(B) 0. The last condition merely says that B is an event that does have some chance of occurring. Conditional probability of event A given event B is denoted by Pr(A|B). It is defined as

Conditional probability satisfies the axioms of probability, and Pr(A|B) = 0 if A,B are disjoint. In particular, Pr(A|B) = 1-Pr(A|B), Pr(B|B) = 1.

The easiest way to find Pr(A|B) by simulations is to repeatedly simulate the complete experiment, discarding all the outcomes except the ones resulting in B.

[ 6 Suppose we toss repeatedly a fair coin, and the ``success" is to get heads.

Use computer simulations to find the conditional probability that the very first trial was successful, if 10 consecutive (and independent) trials resulted in 8 successes.

Try to answer the same question under the condition that 50 independent trials resulted in 40 successes.

You should notice that it takes forever to simulate events that happen rarely. Section indicates one possible way out of this difficulty.

1.5.1  Properties of conditional probability

Conditional probability is used in modeling. Often Pr(A|B) can be assigned by ``intuitive" considerations. It can then be used to compute other probabilities. The simplest example is Pr(AB) = Pr(B)Pr(A|B), which is a direct consequence of the definition.

Example 5 Suppose we have a deck of 52 cards numbered 1 through 52. Since it isn't obvious how to simulate selecting 5 cards without replacement, we may want to select them with replacement instead. Let A denote the event that all five cards are different. What is the probability of A?

We may perform the experiment sequentially, drawing one card at a time. Let Ak denote the event that the k consecutive draws resulted in different cards. Then A = A5 A4 ... A1. Moreover, Pr(A1) = 1.

Clearly, Pr(A2|A1) = [51/ 52], so Pr(A2) = Pr(A2A1) = Pr(A2|A1)Pr(A1) = [51/ 52]. Similarly, Pr(A3) = Pr(A3A2) = Pr(A3|A2)Pr(A2) = [50/ 52][51/ 52]. Continuing this we get Pr(A5) = [51 50 49 48/( 524)] 0.82.

The following identities are also of interest.

  1. Path Probability: Pr(k = 1n Ak) = k = 1nPr(Ak|j = 1k-1 Aj)
  2. Bayes theorem: Pr(A|B) = [(Pr(B|A)Pr(A))/Pr(B)]
  3. Total probability formula: If {Bn} are pairwise disjoint and exhaustive, ie. Pr(AiBj) = for i j and Bn = W, then
    (A) =


[ 7 What is the probability that in a class of 30 students no matching birthdays occur?

Example 6 A lake has 200 white fish and a 100 black fish, and a nearby pond contains 20 black fish and 10 white ones. No other fish live there.

A fish is selected at random from the lake and moved to the pond. Then a fish is selected from the pond and moved back to the lake. What is the probability that all fish in the pond are black?

1.5.2  Sequential experiments

Often the main experiment consist of a sequence of sub-experiments, each depending on the outcome of the previous one. If n such sub-experiments are chained, then the full experiment results in a chain of events, or a path \cal P = A1A2...An. If we assume that k-th experiment depends on the outcome of the k-1-th experiment only, then Pr(Ak|Ak-1... A1) = Pr(Ak|Ak-1).

Denoting by \cal P the generic path A1A2...An, and by \cal P(k) = Ak we have the following path integral formula for the probability of an event \cal F specifying the outcome of the complete experiment, and consisting of many paths \cal P.

(\cal F) =

\cal P \cal F 
(\cal P) =

\cal P F 
|\cal P |

k = 0 
(\cal P(k+1)|\cal P(k)).

Example 7 Suppose that the double transfer operation from Example 1.5.1 was repeated twice. That is, a random selection was done four times. What is the probability that all fish in the pond are black?

[ 8 Check by simulation how the proportion of black fish in the pond changes when the random transfers from Example 1.5.1 are performed repeatedly for a long time.

1.6  Independent events

This section introduces the main modeling concept behind the entries in the Table 1.1.

Two events A,B are independent, if the conditional probability is the same as unconditional, Pr(A|B) = Pr(A). This is stated in multiplicative form which exhibits symmetry and includes trivial events8 Definition 1 Events A, B are independent if Pr(AB) = Pr(A)Pr(B).

Independence captures the intuition of non-interaction, and lack of information. In modeling it is often assumed rather than verified. For instance, we shall assume that the events generated by consecutive outputs of the random generator are independent. We also assume that tosses of a coin (fair, or not!) are independent.

Beginners sometimes confuse disjoint versus independent events. Exclusive (ie. disjoint) events capture the intuition of non-compatible outcomes. Not compatible outcomes cannot happen at the same time. This is not the same as independent outcomes. If A, B are disjoint and you know that A occurred, then you do know a lot about B. Namely you know that B cannot occur. Thus there is an interaction between A and B. Knowing whether A occurred influences chances of B, which is not possible under independence.

Independence (or, more properly, mutual stochastic independence) of families of events is defined by requesting a much larger number of multiplicative conditions. The reason behind is Theorem , which provides a very convenient tool. Definition 2 Events A1, A2, ..., An are independent, if Pr(j J Aj) = j J Pr(Aj) for all finite subsets J IN.

Example 8 A coin is tossed repeatedly. Find the probability that heads appears for the first time on the fourth toss.

Problem 4 SUB GetPermutation from the program RANDTOUR.BAS selects numbers between 1 and n at random until it finds a number not yet on the list. Then it ads the number to its list, and repeats the process.

  1. What is the probability that the second number added to the list required more than k attempts?
  2. What is the probability that the last number added to the list required more than k attempts?

Another important concept is the conditional independence. For example, many events in the past and in the future are dependent. But in many mathematical models, past and future are independent conditionally on the present situation. In such a model future depends on past only through present events!

Definition 3 Let C be a non-trivial event. Events A,B are C-conditionally independent if Pr(AB|C) = Pr(A|C)Pr(B|C).

1.6.1  Random variables

The general concept of probability space uses ``abstract" sets to represent outcomes of an experiment. But many examples considered so far, represented the outcomes in numerical terms.

Random variables are introduced for convenient description of experiments with numerical outcomes. (The other option is to select W IR, or W IRd.) If we want to run computer simulations, we need to represent even non-numerical experiments (like tossing coins) in numerical terms anyhow. Thus the language of random variables becomes the natural extension of elementary probability theory, expressing many of the same concepts in a little different language.

A random variable is the numerical quantity assigned to every outcome of the experiment. In mathematical terms, random variable is a function X:W IR with the property that sets {w W: X(w) a} are events in \cal M for all a IR. Recall that the last conditions means that we may talk about probabilities of events {w W: X(w) a}.

Probabilities for a one-dimensional r. v. are determined by the cumulative distribution function

F(x) = Pr
(X x)
The corresponding tail function R(x) = 1-F(x) = Pr(X > x) is sometimes called the reliability9 function.

Cumulative distribution function can be used to express probabilities of intervals Pr(a < X b) = F(b)-F(a). Since probability is continuous, (1) we can also compute Pr(X = a) = limb a+Pr(a < X b) = F(a+)-F(a). The right hand side limit F(a+) exists, as F is a non-decreasing function.

Example 9 Suppose F(x) = (1-e-x)0. Then Pr(|X-2| < 1) = e-1-e-3.

In probability theory we are concerned with probabilities. Random variables that have the same probabilities are therefore considered equivalent. We write X @ Y to denote the equality of distributions, ie. Pr(X U) = Pr(Y U) for all Borel sets U IR (say, all intervals U).

Vector valued r. v. are measurable the functions W IRd. In the vector case we also refer to X = (X1,... Xd) as the d-variate, or multivariate, random variable.

We will use the ordinary notation for sums and inequalities between random variables. There is however a word of caution. In probability theory, equalities and inequalities between random variables are interpreted almost surely. For instance X Y+1 means Pr(X Y+1) = 1; the latter is a shortcut that we use for the expression Pr({w W: X(w) Y(w)+1}) = 1. Problem 5 Show that F(x) = Pr(X x) is right continuous: limxa+F(x) = F(a).

1.6.2  Binomial trials

The statistical analysis of repeated experiments is based on the following. Theorem 1 Suppose that for j IN event Bj is either Sj or Sj, where events {Sj} are independent. Then {Bj} are independent.

A binomial experiment, called also binomial trials, consists of the sequence of simpler identical experiments that have two possible outcomes each. The independent events Sj represent successes in consecutive experiments. We assume that we have an infinite sequence of events S1,S2,... Sk,... that are independent and have the same probability p = Pr(Sj). We denote by Fj = Sj the failure in the j-th experiment, and put q = 1-p.

Two important random variables are associated with the binomial experiment are the number X of successes in n trials, and the number T of trials until first success. Example 10 The probability that number X of successes in n trials is k is Pr(X = k) = (kn)pkqn-k. (Here k = 0,..., n.)

Example 11 The probability of more than n attempts needed for the first success is Pr(T > n) = qn. The probability that first success occurs at the n-th trial is Pr(T = n) = pqn-1 (geometric).

Example 12 Geometric distribution has lack of memory property: Pr(T > n+k|T > n) = Pr(T > k).

Random variables are often described solely in terms of cumulative distribution function F(x), or formulas for Pr(X = x) without reference to the underlying probability space W. For instance, the number of minutes T that we spend waiting for a bird to come to the bird feeder at the back of my house is random, and I believe Pr(T = n) = pqn-1 because Pr(T > n+k|T > n) = Pr(T > k).

It is intuitively obvious that on average we get np successes in n trials. It is perhaps less obvious10 that on average we need 1/p trials to get the first success.

[ 9 Write a simulation program to verify the claims about the averages for several values of p.

Example 13 The probability that in 2n tosses of a fair coin, half are heads is [(2n)!/( 4n(n!)2)] [1/( { pn})] 0 as n. The latter isn't easy to prove, but the computer printout is quite convincing, see Table (note that [1/( {p})] 0.5642).

2nPr(X = n)Frequency in 1000 trials nPr(X = n)
100 0.07959 0.08200 0.56278
300 0.04603 0.06100 0.56372
500 0.03566 0.03700 0.56391
700 0.03015 0.02200 0.56399

Table 1.2: Probabilities Pr(X = n) in 2n Binomial trials.

1.7  Further notes about simulations

By now you should have written some simple simulation programs, and printed out the results. It is perhaps a good moment to pause and consider what are the aspects of simulations that we are interested in.

In general, we would like to get answers to questions that we don't know how to answer in any other way. But before we do that, we should develop some intuition on the cases that can check the answers. Therefore we begin with simulation of probabilities or averages that are known.

1.8  Questions

Problem 6 [Exercise] A family has two children, and one of them is a boy. What is the probability that they have two boys? (If you think this is too hard mathematics, do it as a computer assignment!)ANS: 1/3

Problem 7 [Exercise] A die is thrown until an ace turns up.

Assuming the ace doesn't turn up on the first throw, what is the probability that more than three throws (ie. at least four) will be necessary? ANS: 197/198

Suppose that an ace turns up on an even throw. What is the probability that it turned up on the second throw?

(If you think this is too hard mathematics, do it as a computer assignment!)

[ 5 A deck of 52 cards has 4 suits and 13 values per suit.

  1. Write a program simulating the hand of 5 cards.
  2. Use your program to answer the following questions:

    1. How often does a pair occur?
    2. How often does a two-pair occur?
    3. How often does a three of a kind (three of same value and two different) occur?
    4. How often does a four of a kind occur?
    5. How often does a full house (2+3) occur?
    6. How often does a straight (five cards in a row, not all same suit) occur?
    7. How often does a flush (five cards of one suit, not in order) occur?
    8. How often does a straight flush (five cards in a row all same suit) occur?

    If you think this is too difficult on a computer, compute the probabilities by hand.

[ 10 A math teacher in a certain school likes to give multiple choice tests, grade them as either right, or wrong, and then lets the students to go over the test and correct the ones they got wrong. This gives them two chances to get a problem right, and the chance of getting a question right increases even if the student just guesses the answers. Suppose a student simply guessed the first time, got the corrections, and guessed differently the second time on the wrong answers. How much his grade improves?

[ 6 This is the expanded version of Exercise 1.5.1. In a group of n people, how often at least two have the same birthday?

  1. Find the formula for the probability p(n) assuming 365 days per year, and equally likely birthdays.
  2. Compute the probabilities for n = 20, 30, 40, 50
  3. Write a simulation program, and verify if the simulation agrees with the theoretical answers.
  4. Modify the simulation program to allow for not equal birthdays. Assume January, February, March are less likely then the other days of the year. Change the parameters, and verify how the probabilities p(n) change as you depart from the uniform probabilities. If the change of p(n) is of the magnitude comparable to the simulation accuracy, clearly it is irrelevant.
  5. A randomly selected person has chance 1/4 to be born on a leap year. How does this affect the answers?

Chapter 2
Random variables (continued)

Tree species are not distributed at random but are associated with special habitats.
The Auborn Society Field Guide to North American Trees.

Intuitively, random variables are numerical quantities measured in an experiment. The concept11 is the core of probability theory; it leads outside of elementary probability and it touches advanced concepts of integration, function transforms and weak limits.

For convenience random variables are split into three groups: continuous, discrete, and the rest.

2.1  Discrete r. v.

Definition 4 X is a discrete r. v. if X(W) is countable.

The definition says that X is a discrete r. v. if there is a finite, or countable set \cal V of numbers (values) of X such that pv = Pr(X = v) > 0 and v \cal V pv = 1. The function f(v) = pv is then called the probability mass function of X. For completeness, the domain of the probability mass function is often extended to all x IR (or to x IRd in the multivariate case) by f(x) = Pr(X = x).

It is easy to see that if f is a function which satisfies two natural conditions:

f(x) 0

x IR 
f(x) = 1
then there is a probability space with a random variable X such that f is its probability mass function. In modeling random phenomena we can therefore avoid the difficulties of designing appropriate sample spaces, and pick directly relevant densities. The question, if a density does describe the actual outcomes of experiment is to some extend the question of statistics. Properties of various distributions, like lack-of-memory come also handy when selecting appropriate density function.

For discrete r. v. the cumulative distribution function (12) plays lesser role. It is a discontinuous function given by the expression

F(x) =

v x 
This expression does show up in the ``generic simulation method in Section .

2.1.1  Examples of discrete r. v.

Table list the most frequently encountered discrete distributions.

Name Values Probabilities Symbol Parameters
Binomial 0,...,n Pr(X = k) = (nk)pk(1-p)n-kBin(n,p)0 p 1, n IN
Poisson ZZ\scriptscriptstyle + Pr(X = k) = e-l[(lk)/ k!] Poiss(l)l > 0
Geometric ZZ\scriptscriptstyle + Pr(X = k) = p(1-p)k-1 0 p 1
Uniform {x1,...,xk} Pr(X = xj) = 1/k k IN, x1,...,xk IR

Negative Binomial

Table 2.1: Discrete random variables.

Example 14 Let the random variable X denote the number of heads in three tosses of a fair coin.

Example 15 Let the random variable X denote the score of a randomly selected student on the final exam.

Problem 8 Let N be Poiss(l), and assume N balls are placed at random into n boxes. Find the probability that exactly m of the boxes are empty. ANS: (mn)e-lm/n(1-el/n)n-m.

2.1.2  Simulations of discrete r. v.

Discrete random variables with finite number of values are simulated by assigning values according to the ranges taken by the (pseudo)random uniform random variable U from the random number generator, U=Rand(1). To decide which value of X should be generated, take a partition {0 = a0 a1 ... an-1 an... 1} of interval (0,1). This means that we simulate X = f(U) using a piecewise constant function f on the interval (0,1). If f(x) = vk for x (ak,ak+1), then Pr(X = vk) = ak+1-ak. Therefore we choose a1 = 0, a2 = p1, ...,ak+1 = p1+...+pk. Notice that ak = Pr(X vk) = F(vk).

Other methods are also available for the distributions from Table 2.1. For example, program TOSSCOIN.BAS on page pageref simulates binomial distribution Bin(n=100, p=1/2).

The following exercise provides tools to run more involved simulations. [ 11 Write functions SimOneBin(n,p) and SimOneGeom(p) that will simulate a single occurrence of the Bin(n,p) r. v. and the geometric r. v. The sample usage: PRINT SimOneBin(15,.5) should simulate the number of heads in tossing 15 fair coins.

Also write function SimGeneric(p()) which simulates generic r. v. with values {0,1,...,n} and prescribed probabilities pk = p(k).

2.2  Continuous r. v.

Continuous random variables have uncountable sets of values, and the probability of each of them is zero, Pr(X = x) = 0 for all x IR.

Since probability satisfies continuity axiom (1), Pr(X (a,a+h)) 0 as h 0 for all a. The main interest in continuous case is that the rate of convergence to 0 is also known, Pr(X (a,a+h)) f(a) h +o(h). Function f(x) in this expansion is called the density function.

In terms of the cumulative distribution function 12), the probability is Pr(X (a,a+h)) = F(a+h)-F(a), and thus f(a) = limh 0[(F(a+h)-F(a))/ h] = F(a) is the derivative of the cumulative distribution function F. Therefore when the Fundamental Theorem of Calculus can be invoked (say, when f is piecewise continuous)

F(x) =

f(u) du.

Definition 5 Random variable X is (absolutely) continuous, if there is a function f such that Pr(X U) = U f(x)dx. Function f is called the probability density function of X.

It is known that if f is a function which satisfies two natural conditions:

f(x) 0

f(x) dx = 1
then there is a probability space with a random variable X such that f is its density. This is in complete analogy with the discrete case. In modeling random phenomena we can therefore avoid the difficulties of designing appropriate sample spaces, and pick directly relevant densities. The question, if a density does describe the actual outcomes of experiment is to some extend the question of statistics. Properties of various distributions come also handy when selecting appropriate density function.

It is convenient to use the heuristic probability density function in continuous case corresponds to probability mass function in discrete case, and that expressions that involve in discrete case sums are replaced by integrals, compare (17) and (16).

2.2.1  Examples of continuous r. v.

The following table lists more often encountered densities. Figures and give the graphs of the normal and exponential densities.

Name Range Density Symbol Parameters
Normal - < x < f(x) = [1/({2p}s)]exp-[((x-m)2)/(2s2)]N(m,s)m IR,s > 0
Exponential x > 0 f(x) = le-lx l > 0
Uniforma < x < b f(x) = [1/( b-a)]U(a,b) a < b real
Gamma x > 0 f(x) = Cxa-1e-x/b Gamma(a,b) a > 0,b > 0,C = [1/( baG(a))]

Table 2.2: Continuous random variables.

Missing graph

This graph was generated with java. The browser on your computer does not run java. Please use java-enabled browser, or another computer.

Figure 2.1: Graph of the standard normal N(0,1) density f(x) = (2p)-1/2 e-x2/2.

Missing graph

This graph was generated with java. The browser on your computer does not run java. Please use java-enabled browser, or another computer.

Figure 2.2: Graph of the exponential density f(x) = e-x as the function of x > 0.

Example 16 A dart is thrown at a circular dart board of radius 6. Let X denote the distance of the dart from the center. Assuming uniform assignment of probability (3), the density of X is

f(x) = {
[x/ 18]
if 0 x 6

Problem 9 Referring to Exercise 1.2.2, let X,Y denote the arrival times of the two drivers at the intersection. Find the density of the time lapse |X-Y| between their arrivals.

2.2.2  Histograms

Simulations and experiments do not give direct access to the density, but often a histogram will approximate it reasonably well. Histograms are graphical representations of empirical data. A sample histogram is drawn on the side of the square in Figure on page .

To create a useful histogram, split the range into the finite number of intervals. Then graph over each interval the rectangle with the area equal to the observed frequency. The number, and positioning of intervals depends on the amount of data available, and personal preference.

2.2.3  Simulations of continuous r. v.

The generic method for simulating a continuous random variable is similar to the method used in the discrete case. Namely, we take X = f(U) with the suitable function f.

To find f assume it is increasing and thus invertible with inverse g. Then Pr(f(U) < x) = Pr(U < g(x)) = g(x). This completes the generic prescription: take as f(x) the inverse12 of the cumulative distribution functions F(x) = Pr(X x).

This method of simulation is quite effective if the inverse of F can be found analytically. It becomes slow when the inverse (or, worse still, cumulative distribution function F itself) is computed by numerical procedures. Since this is the case of the normal distribution, special methods are used to simulate the normal distribution.

Example 17 To simulate X which is exponential with parameter l, use X = -[1/( l)]lnU.

2.3  Expected values

Expected values are perhaps the single most important numerical characterization of a random phenomenon. Definition 6 For discrete random variable X the expected value EX is given by EX = v vPr(X = v), provided the series converges.

Expected value captures the intuition of the average of a random quantity. It is also this intuition that leads to estimating the expected value by averaging the outcomes of simulations. Example 18 If X has values x1,...,xn with equal probability, then EX = [`x] is the arithmetic mean of x1,...,xn.

Simulationsare oten used to get answers that are too difficult to find analytically. The following exercise can be answered by simulation if you figure out how to shuffle cards from a deck at random (Exercise 1.4). [ 12 What is the expected number of cards which must be turned over in order to see each of one suit.

Example 19 If X takes two values a < b and Pr (X = a) = p, then EX = pa+(1-p)b is the number in the closed interval [a,b].

Definition 7 For continuous random variable X the expected value EX is given by EX = IRxf(x) dx, provided the integral converges.

Name Probability distribution EX
Normal f(x) = [1/({2p}s)]exp-[((x-m)2)/( 2s2)]m
Exponential f(x) = le-lx [1/( l)] > 0
Uniform f(x) = [1/( b-a)] 1/2(a+b) real
Binomial Pr(X = k) = (nk)pk(1-p)n-knp
Poisson Pr(X = k) = e-l[(lk)/ k!]l
Geometric Pr(X = k) = p(1-p)k-1 1/p
Negative Binomial

Table 2.3: Expected values of some random variables.

Problem 10 Compute EX for the entries in Table 2.3.

[ 13 Simulate EX for the entries in Table 2.3 (except normal) for different values of parameters involved.

Problem 11 [Exercise] Referring to Example 2.2.1, what is the average distance of a dart from the center?ANS: 4.

If you chose to do the simulations, you should perhaps notice that it is rather difficult to decide how many simulations to take in order to achieve the desired accuracy. Typically, you need to increase the size of a simulation four times to half the error.

Another point to keep in mind is that simulations do return numbers. But from numbers alone it is difficlut to see how parameters of the model change it, and this is a more interesting question.

2.3.1  Tail integration formulas

The following tail integration formula is of considerable convenience in theoretical analysis. Theorem 2 For non-negative random variables, both in the discrete, and in the continuous case
EX =

(X > t) dt
(The expected value is finite if and only if the corresponding integral converges.)

Proof. To simplify the proof and expose the main idea more clearly, consider the discrete case with a finite number of values. Similarly, to avoid technicalities we consider continuous case with bounded range only.

Discrete case:

EX = x1p1+x2p2+...+xnpn = x1(p1+p2+...+pn-( p2+...+pn))+x2( p2+...+pn-( p3+...+pn))+...+xn-1( pn-1+pn -pn)+xnpn = x1( p1+p2+...+pn)+(x2-x1)( p2+...+pn-(p3+...+pn))+...+( xn-1-xn-2)( pn-1+pn)+(xn-xn-1)pn The latter is 0xn Pr(X > t)  dt.

Continuous case: Let f denote the density and F be the cumulative distribution function. Integrating by parts EX = 0b x f(x)  dx = b-0bF(x) dx = 0b(1-F(x) dx. [¯]
If X is discrete integer valued X {0,1,...}, then (21) can be written as

EX =

n = 0 
(X > n).

It is natural to define EX for more general r. v. in the same way through formula (21). Write X = X+-X- to decompose X into its non-negative, and non-positive parts, and then define EX = 0 P(X > t) dt-0 P(X < -t) dt. Clearly, if one of the integrals diverges, EX is not defined.

Example 20 Suppose X is exponential with the density f(x) = e-x. Let Y be X truncated at level 3. That is

Y = {
if X 3
if X 3

Clearly, Y is not continuous, as Pr(Y = 3) = Pr(X > 3) = e-3 > 0. On the other hand, Y is not discrete as it takes uncountable number of values; in fact all the numbers between 0 and 3 are possible. The definitions of the expected value we gave do not apply, but (21) can be used to show that E(Y) = 1-e-3.

Indeed, Pr(Y > t) = Pr(X > t) = e-t for 0 < t < 3, and Pr(Y > t) = 0 for t > 3.

Example 21 To determine the mean of the geometric distribution we can either compute the sum pn = 1 n (1-p)n-1, or use (21) and find the easier sum n = 0 (1-p)n.

2.3.2  Chebyshev-Markov inequality

The following inequality is known as Markov's, or Chebyshev's inequality. Despite its simplicity it has numerous non-trivial applications, see eg. Theorem .

[ 1 If U 0 then

P(U > t) EU

Indeed, by (21) we have EU = 0 Pr(U > x) dx 0tPr(U > x) dx tP(|X| > t).

Problem 12 Suppose U is uniform U(0,1). Then Pr(U > t) [1/ 2t]. This means that 1-t < [1/ 2t].

  1. Is the above inequality ``sharp"? (Graph both curves).
  2. Is (21) sharp? That is, given t > 0, is there an X > 0 such that equality occurs?

2.4  Expected values and simulations

Expected value EX is approximated without much difficulty13 on a computer by averaging large number of independent trials.

Example 22 To find the average of the uniform U(0,1) distribution, take 1/n(U1+...+Un).

This is the basis of Monte-Carlo methods , which is a family of related probabilistic methods for computing the integrals. To find ab f(x) dx we simulate Xj = f(a+bUj). The average 1/nj = 1nXj approximates [1/( b-a)]ab f(u) du for large n.

The variance of the n-th approximation is of the order n-1/2, which is worse than the trapezoidal rule for smooth functions. In return the approximation is insensitive to the smoothness of the integrands, and also to the dimension of the integral. Monte Carlo methods can be used effectively for multiple integrals over irregular domains.

[ 14 Use Monte Carlo method to approximate p = -112[(1-x2)]  dx. (You may want to compare the output with numerical procedures described in Section .)

Another method of similar nature is to pick points (X,Y) at random from the rectangle containing the graph of f and check if Y < f(X) holds. The proportion of ``successes" approximates the proportion of the area under the graph of f.

[ 15 Approximate p/4 by selecting points (X,Y) at random from the unit square, and checking if X2+Y2 < 1.

The following sample program computes numerically double integral Ucos(10x+20y) dxdy over a circle x2+y2 = 1. The only conceptual difficulty as compared to single integrals is how to select points at random from the unit disk. This is done by picking points from a bigger square and discarding those that didn't make it. Can you do this integral analytically? Or by another numerical procedure?

PROGRAM dblint.bas

DECLARE FUNCTION Integrand! (X!, Y!)

CONST True = -1

' simulation loop
NumTrials = 10000
FOR j = 1 TO NumTrials
        'select random points from the square [-1,1]x[-1,1]
        X = 2 * RND(1) - 1
        Y = 2 * RND(1) - 1
        'check if this is in the domain
        IF InDomain(X, Y) THEN
                NumTested = NumTested + 1
                Sum = Sum + Integrand(X, Y)
                Var = Var + Integrand(X, Y) ^ 2
        END IF

'Print the answer
PRINT "Examined "; NumTested; " random points"

IF NumTested = 0 THEN END 'nothing found
N = NumTested
PRINT "The integral is approximately "; Sum / N
PRINT "With 95% confidence  the error is less than "; 1.96 * SQR(Var / N - (Sum / N) ^ 2) / SQR(N)


FUNCTION InDomain (X, Y)
'This function checks if $x,y$ is in the integration domain
'The definition of the domain can be easily modified here, including
'more complicated domains
IF X ^ 2 + Y ^ 2 < 1 THEN InDomain = True

FUNCTION Integrand (X, Y)
'This is the function to be integrated

Integrand = COS(10 * X + 20 * Y)

It is important to have some idea about how accurate the answer is. The program uses the to estimate the magnitude of the error. A less sharp (and thus safer to use!) error estimate can be obtained from (23)

[ 7 There are two natural choices to estimate by simulation events of small probability. We can pick a large sample size n, run the simulation experiment and hopefully get several data points. The outcome X of such an experiment is a binomial random variable with unknown probability p of success, and we would estimate p X/n. The trouble is that if we don't know how small the chances are, we might get none, estimating probability to be zero. Or we could run the experiment until we get the prescribed number of successes. The observation would then consist of a set of geometric random variables T1,..., Tk with unknown mean ET = 1/p. We could then estimate p = 1/ET by taking the inverse of the arithmetic mean of Tj. In this approach we are guaranteed to get some observations, as long as p 0.

The question is Which of the methods would you recommend to use? (Of course, you would recommend a better method, but what the word ``better" might mean here?)

2.5  Joint distributions

Often an experiment involves measuring two or more random numbers, say X and Y. The fact that we know the distribution of X, and the distribution of Y separately doesn't determine probabilities of events that involve both X and Y simultaneously. Example 23 Suppose

(X = 1,Y = 1) = 1
(X = -1,Y = -1) = 1
(X = -1,Y = 1) = 1
(X = 1,Y = -1) = 1
Then Pr(X = 1) = Pr(Y = 1) = 1/2 regardless of the value of e. On the other hand, Pr(X = Y) = 1/2+2e clearly depends on the value of e.

It is clear that if X is discrete, and Y is discrete, then (X,Y) is an IR2 valued discrete r. v. That is, the values of the pair are countable. Probabilities Pr(X = x, Y = y) are called the joint distribution. Corresponding Pr(X = x) and Pr(Y = y) are the so called marginals. Example 2.5 points out that marginals do not determine joint probabilities uniquely. But if we know the joint probabilities then we can compute the marginals, eg Pr(X = x) = yPr(X = x, Y = y).

In contrast to the discrete case, joint continuity can not be recognized from the continuity of the components, and requires full definition.

Definition 8 Let X = (X1,...,Xn). Random variables X1,...,Xn are jointly (absolutely) continuous, if there is a function f such that

(X U) =

f(x1,... xn) dx1 ... dxn
for all measurable U. Function f is then called the probability density function of X.

Example 24 Suppose X,Y have uniform distribution in the unit disk. Then the joint density is f(x,y) = 1/p for x2+y2 1 and the density of X is fX(x) = [2/( p)][(1-x2)].

The relation between probabilities and the density is

f(a,b) = 2
a b
(X a, Y b).
Occasionally this can be used to determine the joint density.

2.5.1  Independent random variables

Independence of random variables is defined in terms of joint distributions. The intuition behind this definition is that the events that random variables may generate should be independent. Notice however that the actual definition is simpler than Definition 1.6. Can you explain why? Definition 9 Random variables X1,...,Xn are independent, or stochastically independent, if
(X1 U1, ..., Xn Un) = Pr
(X1 U1) ... Pr
(Xn Un)
for all measurable U1,..., Un IR.

(Similarly we define the stochastic independence of random vectors).

We say that X1,...,Xn are independent identically distributed (i. i. d) if (26) holds and Pr(Xi U) = Pr(Xj U) for all Borel U IR. [ 2 If X,Y are discrete with the probability mass function f(x,y), then independence of X,Y is equivalent to f(x,y) = fX(x)fY(y).

If X,Y are continuous with the joint density f(x,y) then independence of X,Y is equivalent to f(x,y) = fX(x)fY(y).

Independence is often part of the model. Independence allows to determine joint distributions from marginals. Thus each independent random variable can be analyzed separately, and then more complex questions can be answered. From the mathematical perspective, under independence we can determine joint distributions if we know marginals. Example 25 Suppose X is binomial Bin(n,p) and Y is Poiss(l). If X,Y are independent, then the joint probability mass function of X,Y is given by f(x,y) = (xn)pn(1-p)n-xe-lly/y! (or 0).

Example 26 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. Their arrival times are independent random variables. Indeed, using formula (25) and elementary area computation, Pr(X < a,Y < b) = ab for 0 < a,b < 1.

2.6  Functions of r. v.

Some random variables are obtained by taking functions of another ones, possibly multi-dimensional. In the notes we often limit our attention to a single random variable Z given by a function Z = f(X,Y) of two arguments; this is convenient for notation and exhibits most of the interesting techniques.

Sums and linear combinations, medians, maxima, and minima are perhaps the most often encountered functions of multidimensional random variables. Methods to compute the distribution, or the expected value of such a function are of considerable practical significance.

2.7  Moments of functions

If Z = f(X,Y) has the expected value, then EZ can be computed directly without computing the density, or probability mass function of Z. The following identity is useful.

If X,Y are discrete and EZ exists, then

Ef(X,Y) =

(X = x, Y = y).

If X,Y are jointly continuous then Z might be continuous, discrete, or say of mixed type. Regardless of the case

Ef(X,Y) =

f(x,y)f(x,y) dx dy,
and the double integral converges if EZ exists. Conversely, if the integral converges, then EZ exists and is given by formula (28).

In particular the expected value is linear

E(aX+bY+c) = aEX+bEY+c.
This can be easily verified using (28), but the identity (28) is beyond the scope of this notes, as we do not want to dwell on the general definition of EZ that would encompass all cases.

The fact that expected value is linear provides a simple method of computing some otherwise difficult sums. Example 27 Suppose X is Binomial Bin(n,p). Then X = X1+...+Xn, where Xj is the number of successes in j-th trial. Clearly each Xj is 0, or 1, and EXj = p.

[ 16 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. On average how much time lapses between their arrivals?

Definition 10 Let m = EX. The variance Var(X) is defined as Var(X) = E(X-m)2.

Notice that

Var(aX+b) = a2Var(X).
In particular, Var(X) = 0 when X = const. Sometimes a more convenient expression for the variance is
Var(X) = EX2-(EX)2 = E(X-EX)2.

The standard deviation is s = [Var(X)].

Name Probability distribution Var(X)
Normal f(x) = [1/({2p}s)]exp-[((x-m)2)/( 2s2)]s2
Exponential f(x) = le-lx [1/( l2)]
Uniform f(x) = [1/( b-a)] [1/ 12](b-a)2
Binomial Pr(X = k) = (nk)pk(1-p)n-knp(1-p)
Poisson Pr(X = k) = e-l[(lk)/ k!]l
Geometric Pr(X = k) = p(1-p)k-1 [1/( p2)]
Negative Binomial

Table 2.4: Variances of some random variables.

Problem 13 Compute variances Var(X) for the entries in Table 2.4. (Some of these are a real challenge to your computational skills, so you may safely give up. Another method will make it easier in Chapter ).

Since Var(X) = EX2-(EX)2 0, the following inequality follows

(EX)2 EX2

Definition 11 The covariance of random variables X,Y with expected values mX, mY is defined as cov(X,Y) = E(X-mX)(Y-mY).

Clearly Cov(X,X) = Var(X) and Var(X+Y) = Var(X)+Var(Y)+2cov(X,Y).

Theorem 3 If X,Y are independent, then Var(X+Y) = Var(X)+Var(Y).

Example 28 let X1,X2,...,Xn be independent {0,1}-valued random variables, and suppose that Pr(Xj = 1) = p. Then Var(j = 1nXj) = np(1-p). What is the distribution of j = 1nXj?

Tail integration formula revisited

Example 29 If X > 0 then EeX = 1+0 et P(X > t) dt

Problem 14 Show that EX2 = 0 t P(|X| > t) dt.

Generalize this formula to E|X|p.

Chebyshev-Markov inequality

Special cases of Chebyshev's inequality (23) are:

(|X| > t) < 1
(|X-m| > t) < 1
(|X| > t) < e-atEeaX

Chebyshev's inequality is one reason we often strive for small average quadratic error. If E|X-X0|2 < e then we can be sure that Pr(|X-X0| > 3{e}) < 3{e}. The following may be used (with some caution) in computer programs to asses error in estimating probabilities by sampling.

Example 30 If X is Bin(n,p) then Pr(|X/n-p| > [1/( 3n)]) [1/( 43n)]

[ 17 Run a simulation of the event that you know probability of, printing out the error estimate.

Do the same for the event that you don't know the probability analytically. (Use programs written for previous exercises)

2.7.1  Simulations

The unknown variance s2 of a sequence of simulated random variables Xj can be approximated by 1/n j = 1n (Xj-[`X])2, where [`X] is the arithmetic mean of X1,..., Xn. Thus can also use (33) and Theorem 2.7 to asses errors in estimating variances. Another more accurate method is presented later on in Chapter , but it also requires estimating variances.

From now on, in the output of your simulation programs you should provide some error estimates.

2.8  Application: scheduling

A critical path analysis involves estimating time of completing a project consisting of many tasks of varying lengths. Some of the tasks can be done concurrently, while other may begin only after other preliminary tasks are completed. This is modeled by the dependency graph together with the estimated times.

2.8.1  Deterministic scheduling problem

As an example of simple scheduling problem consider the following. Example 31 Suppose that we want to bake a batch of chocolate-chip cookies. The tasks and their (estimated) times are:

The dependency graph is quite obvious here; for instance, we cannot start baking before batter is ready. What is the shortest time we can eat the cookies?

2.8.2  Stochastic scheduling problem

In some projects, the actual numbers are only the averages, and the actual completion times of the projects may be random. The distribution of the actual completion time, or even its average may be difficult to compute analytically. Nevertheless, simulations let us estimate the average and analyze the probabilities of events.

[ 18 Suppose for the sake of this exercise that the numbers presented in the cookie-baking example are just the average values of the exponential random variables.

2.8.3  More scheduling questions

In more realistic analysis of production processes we also have to decide how to split the tasks between available personnel. Example 32 This exercise refers to the tasks presented in Example 2.8.1. On average, how fast can a single person bake chocolate-chip cookies? What if there are two people?

2.9  Distributions of functions

Distributions of functions of random variables are often difficult to compute explicitly. Special methods deal with more frequent cases.

Sums of discrete r. v.

Sums can be handled directly, but a more efficient method uses generating functions of Chapter .

Suppose X,Y are discrete and f(x,y) = Pr(X = x, Y = y) is their joint probability mass function. Then Z = X+Y is discrete with values z = x+y. Therefore

fZ(z) =


For independent random variables X, Y this takes a slightly simpler form.

fZ(z) =


Formula (36) can be used to prove the so called summation formulas. Theorem 4 If X, Y are independent Binomial with the same parameter p, ie. X is Bin(n,p) and Y is Bin(m, p), then X+Y is Binomial Bin(n+m,p).

If X, Y are independent Poisson Poiss(lX) and Poiss(lY), then X+Y is Poisson Poiss(lX+lY).

Sums of continuous r. v.

One can prove that if X,Y are independent and continuous than X+Y is continuous with the density

f(z) =

fX(u)fY(z-u)  du

Formula (37) defines the convolution fX*fY. It can be used to prove the so called summation formulas. Theorem 5 If X, Y are independent Normal, then X+Y is Normal.

If X, Y are independent Gamma with the same parameter b, then X+Y is Gamma(aX+aY,b).

Example 33 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. What is the density of the time that lapsed between their arrivals?

Example 34 Suppose X,Y are independent U(0,1). The density of Z = X+Y is

f(z) = {
if 0 z 1
if 1 z 2

Minima, maxima

Minima and maxima occur for instance if we are waiting for one of the independent events, and then we follow the first one (minimum), or the last one (maximum). Embedded Markov chains construction in Section are based on minima of independent exponential r. v.

Suppose X,Y are independent. If U = min{X,Y} then Pr(U > t) = Pr(X > t)Pr(Y > t). Therefore the reliability function of U can be computed from the two given ones.

Example 35 If X, Y are independent exponential, then U = min{X,Y} is exponential.

If U = max{X,Y} then Pr(U < t) = Pr(X < t)Pr(Y < t). Therefore the cumulative distribution function of U can be computed from the two given ones.

Example 36 Suppose X,Y are independent uniform U(0,1). Then U = max{X,Y} has the density f(u) = 2u for 0 < u < 1.

Problem 15 Let U1,..., Un be independent uniform U(0,1). Find the density of

Problem 16 If X,Y are independent exponential random variables with parameters l, m, show that Pr(X < Y) = [(l)/( l+m)].

Order statistics

Order statistics generalize minima and maxima. Their main use is in (robust) statistics, and this section can be safely skipped. Let X1,..., Xn be independent continuous random variables with the cumulative distribution function G and density g = G.

Let R1,..., Rn be the corresponding order statistics. This means that at the end of each experiment we re-arrange the numbers X1,... Xn into the increasing sequence R1,..., Rn. This means that R1 = minj Xj is the smallest, R2 = maximinj i Xj is the second largest, etc.

The density of Rk can be found by the following method. In order for the inequality Rk > x to hold, there must be at least k values among the Xj above level x. Since Xj are independent and have the same probability p = Pr(Xj > x) of ``success" in crossing over the x-level, this means that Pr(Rk > x) is given by the binomial formula with n trials and probability of success p = 1-G(x).

(Rk > x) = n

j = k 

When the derivative is taken, the sum collapses into just one term, giving the elegant answer rk(x) = [n!/( (k-1)!(n-k)!)](G(x))k(1-G(x))n-kg(x).

2.10  L2-spaces

Inequalities related to expected values are best stated in the geometric language of norms and normed spaces. We say that X L2, if X is square integrable, ie. EX2 < .

The L2 norm is

||X||2 =   _____

Notice that ||X-EX||2 is just another notation for the standard deviation. Thus standard deviation is the L2 distance of X from a constant.

We say that Xn converges to X in L2, if ||Xn-X||2 0 as n . We shall also use the phrase sequence Xn converges to X in mean-square. An example of the latter is Theorem .

Several useful inequalities are collected in the following14. Theorem 6 For all square-integrable X,Y

Proof. Proof of (39): Quadratic function f(t) = E|X+tY|2 = EX2+2tEXY+t2EY2 is non-negative for all t. Therefore its determinant D 0. (Compute D to finish the proof.)

Proof of (40): Use (39) with Y = 1.

Proof of (41): By (39) E|X+Y|2 ||X||22+||Y||22+2||X||2||Y||2. [¯]

2.11  Correlation coefficient

Correlation is a concept deeply rooted in statistics. The correlation coefficient corr(X,Y) is defined for square-integrable non-degenerate r. v. X, Y by the formula
r = corr(X,Y) = EXY-EXEY
The Cauchy-Schwarz inequality (39) implies that -1 corr(X,Y) 1.

Random variables with r = 0 are called uncorrelated. Correlation coefficient close to one of the extremes 1 means that there is a strong linear relation between X,Y; this is stated more precisely in ().

Theorem 2.7 states that independent random variables with finite variances are uncorrelated.

Problem 17 Give an example of dependent random variables that are uncorrelated.

2.11.1  Best linear approximation

Suppose we would like to approximate random variable Y by another quantity X that is perhaps better accessible. Of all the possible ways to do it, linear function Y mX+b is perhaps the simplest. Of all such linear functions, we now want to pick the best. In a single experiment, the error is |Y-mX-b|. We could minimize the average empirical error over many experiments 1/nj|Yj-mXj-b|. This approximates the average error E|Y-mX-b|. Let us agree to measure the error of the approximation by a quadratic error E(Y-mX-b)2 instead. (This choice leads to simpler mathematics.)

Question: For what values of m, b the error E(Y-mX-b)2 is the smallest? When is it 0?

Let H(m,b) = E(Y-mX-b)2. Clearly, H is a quadratic function of m,b IR. The unique minimum is determined by the system of equations

= 0
= 0.

The answer is m = [cov(X,Y)/ Var X], b = EY-mEX, and the minimal error is


2.12  Application: length of a random chain

A chain in the x,y-plane consists of n links each of unit length. The angle between two consecutive links is a, where a > 0 is a constant. Assume the sign is taken and random, with probability 1/2 for each. Let Ln be the distance from the beginning to the end of the chain. The angle between the k-th link and the positive x-axis is a random variable Sk-1, where we may assume (why?) S0 = 0 and Sk = Sk-1+xka, where x = 1 with probability 1/2. The following steps determine the average length of the random chain.

  1. Ln2 = (k = 0n-1cosSk)2+(k = 0n-1sinSk)2.
  2. EcosSn = cosn a
  3. EsinSn = 0
  4. EcosSm cosSn = cosn-maEcos2 Sm for m < n
  5. EsinSm sinSn = cosn-maEsin2 Sm for m < n
  6. ELn2-Ln-12 = 1+2cosa[(1-cosn-1a)/( 1-cosa)]
  7. ELn2 = n [(1+cosa)/( 1-cosa)] -2cosa[(1-cosna)/( (1-cosa)2)]

2.13  Conditional expectations

2.13.1  Conditional distributions

For discrete X,Y the conditional distribution of variable Y given the value of X is just the conditional probability, Pr(Y = y|X = x). In jointly continuous case, define the conditional density
f(y|X = x) = f(x,y)
Conditional density f(y|X = x) is defined only for x such that fX(x) > 0; this is a reasonable approach for the most often encountered continuous, or piecewise continuous densities. Since the densities are actually the elements of L1 space rather than functions, special care is needed in the definition of the conditional density. In fact the theory of probability is often developed without the reference to conditional distributions.

Definition 12 The conditional expectation E{X|Y = y} is defined as xPr(X = x|Y = y) in discrete case, and as IR x f(x|Y = y)  dx in the continuous case. One can show that the expected values exist, when E|X| < .

Example 37 A game consists of tossing a die. If the face value on the die is X then a coin is tossed X times. Let Y be the number of heads. Then E(Y|X = x) = 1/2x.

2.13.2  Conditional expectations as random variables

Since E(X|Y = y) depends on the actual value of Y, and Y is random, the conditional expectation is a random variable itself. We shall write E{X|Y} or EYX for the random variable defined by the conditional expectation E{X|Y = y}.

Example 38 Suppose Y is a discrete with different values on the events A1, A2, , An which form a non-degenerate disjoint partition of the probability space W. Then

E{X|Y}(w) = n

k = 1 
mk IAk(w),
where mk = AkX dP /P(Ak). In other words, on Ak we have E{X|\cal F} = AkX dP /P(Ak). In particular, if X is discrete and X = xj IBj, then we get intuitive expression
E{X|\cal F} =
xj P(Bj|Ak) for w Ak.

Example 39 Suppose that f(x, y) is the joint density with respect to the Lebesgue measure on IR2 of the bivariate random variable (X, Y) and let fY(y) 0 be the (marginal) density of Y. Put f(x|y) = f(x, y)/fY(y). Then E{X|Y} = h(Y), where h(y) = - x f(x|y) dx.

Total probability formula for conditional expectations is as follows.

EY = E(E(Y|X))
compare (10). Example 40 In Example 2.13.1, EY = 3/2.

2.13.3  Conditional expectations (continued)

In discrete case conditional expectations of functions are given by,

E(g(X)|Y = y) =

(X = x|Y = y)
The following version of total probability formula is often useful.

Eg(X) = E(E(g(X)|Y))
Example 41 Suppose N is Binomial Bin(m,q) and given the value of N, r. v. X is Bin(N,p). What is the distribution of X?

Similar question can be solved for N having a Poisson distribution.

Example 42 What is the distribution of a geometric sum of i. i. d. exponential r. v.?

Example 43 Stock market fluctuations can be modelled by Z = x1+...+xN, where N, the number of transactions, is Poisson(l) and x are normal N(0,s). There is no explicit formula for the density of Z, but there is one for the moment generating function. Thus Chebyshev inequality gives bounds of the form Pr(Z > t) exp....

2.14  Best non-linear approximations

This section explains the relevance of conditional expectations to the problem of best mean-square approximation.

Theorem gives geometric interpretation of the conditional expectation E{·|Z}; for square integrable functions E{.|Z} is just the orthogonal projection of the Banach (normed) space L2 onto its closed subspace L2(Z), consisting of all 2-integrable random variables of the form f(Z).

Theorem 7 For square integrable Y the quadratic error E(Y-h(X))2 among all square integrable functions h(X) is the smallest if h(x) = E(Y|X = x).

Theorem 2.14 implies that the linear approximation from Section is usually less accurate. In Chapter we shall see that linear approximation is the best one can get in the all important normal case. Even in non-normal case linear approximations offer quick solutions based on simple second order statistics. In contrast, the non-linear approximations require elaborate numerical schemes to process the empirical data.

2.15  Lack of memory

Conditional probabilities help us to arrive at important classes of densities in modeling. In this section we want to analyze an non-aging device, which characteristics do not change with time.

Suppose T represents a failure time of some device. If the device is working at time t, then the probability of surviving additional s seconds is Pr(T > t+s|T > t). For a device that doesn't exhibit aging this probability should be the same as for the brand new device.

(T > t+s|T > t) = Pr
(T > s)
Problem 18 Show that exponential T satisfies (47).

Problem 19 Show that geometric T satisfies (47) for integer t,s.

Equation (47) implies Pr(T > t+s) = Pr(T > t)Pr(T > s), an equation that can be solved. Theorem 8 If T > 0 satisfies (47) for all t,s > 0, then T is exponential.

Proof. The tail distribution function N(x) = P(T > x) satisfies equation

N(x+y) = N(x)N(y)
for arbitrary x, y > 0. Therefore to prove the theorem, we need only to solve functional equation (48) for the unknown function N(·) under the conditions that 0 N (·) 1, N(·) is left-continuous, non-increasing, N(0+) = 1, and N(x) 0 as x .

Formula (48) implies that for all integer n and all x 0

N(nx) = N(x)n.
Since N(0+) = 1 and N(·), it follows from (49) that r = N(1) > 0. Therefore (49) implies N(n) = rn and also N(1/n) = r1/n (to see this, plug in (49) values x = 1 and x = 1/n respectively). Hence N(n/m) = N(1/m)n = rn/m (by putting x = 1/m in (49)).

This shows that for rational q > 0

N(q) = rq.
Since N(x) is left-continuous (why?), N(x) = limq\nearrow x N(q) = rx for all x 0. It remains to notice that since N(x) 0 as x , we have r < 1. Therefore r = exp(-l) for some l > 0 and N(x) = exp(-lx), x 0. [¯]
Remark 1 Geometric distribution also has the lack of memory property. If equation (48) is assumed to hold for integer values of x,y only, and T > 0 is integer valued, then T is geometric.

2.16  Intensity of failures

The intuitive lack-of-memory, or non-aging property of the exponential distribution can be generalized to include simple models of aging. We may want to assume that a component analyzed becomes less reliable, or more reliable with time. An example of the first one is perhaps a brand new car. An example of the latter is perhaps a software operating system when updates are installed promptly.

Let T > 0 be a continuous r. v. interpreted as a failure time of a certain device. If the device is in operational condition at time t, then the probability that it will fail immediately afterwards may be assumed negligible. The probability of failing within h units of time is Pr(T < t+h|T > t). The failure rate at time t is defined as

l(t) =
h 0 
(T < t+h|T > t)

Example 44 If T is exponential then the failure rate is constant.

A family of failure rates that exhibit interesting aging patterns is provided by the family of power functions l(t) = ta.

Theorem 9 If T is continuous with failure rate l(t) = ta, where a > 0 then T has the Weibull density:

f(t) = C ta-1e-bta for t > 0.
(Here C = ab is the normalization constant).

2.17  Poisson approximation

Of the discrete distributions, the formula for the Poisson distribution is perhaps mysterious. Poisson distribution is often called the law of rare events. Theorem 10 Suppose Xn are Bin(n,pn) and npn l. Then Pr(Xn = k)e-llk/k!.

Proof. Rewrite the expression (kn)[(lk)/( nk)](1-[(l)/ n])n-k = lk/k!(1-l/n)n/(1-l/n)kj = 0k(1-j/n). [¯]
Example 45 How many raisins should a cookie have on average so that no more than one cookie in a hundred has no raisins? So that no more than one cookie in a thousand has no raisins?

Example 46 A bag of chocolate chip cookies has 50 cookies. The manufacture claims there are a 1,000 chips in a bag. Is it likely to find a cookie with 15 or less chips in such a bag?

2.18  Questions

Problem 20 The performance of the algorithm for selecting a random permutation in GetPermutation SUB of RANDTOUR.BAS can be estimated by the following.

From numbers 1, ... n, select at random k > n numbers. On average, how many of these numbers repeat? (and hence should be thrown out)

Problem 21 The performance of the algorithm for selecting a random permutation in GetPermutation SUB of RANDTOUR.BAS can be estimated by analyzing the following ``worst-case" scenario.

When the algorithm attempts to select last of the random numbers 1, ... n, then

  1. What is the probability if will find the ``right number" on first attempt?
  2. How many attempts on average does the algorithm take to find the last random number?

[ 19 What is the probability that in the group of 45 people one can find two born on the same day of the year? (Compare Example 1.5.1).

Problem 22 For a group of n person, find the expected number of days of the year which are birthdays of exactly k people. (Assume 365 days in a year and that all birthdays are equally likely.)

Problem 23 A large number N of people are subject to a blood test. The test can be administered in one of the two ways:

(i) Each person can be tested separately (N tests are needed).

(ii) The blood sample of k people can be pooled (mixed) and analyzed. If the test is positive, each of the k people must be tested separately and in all k+1 tests are then required for k people.

Assume the probability p that the test is positive is the same for all people and that the test results for different people are stochastically independent.

  1. What is the probability that the test for a pooled sample of k people is positive?

  2. What is the expected number of tests necessary under plan (ii)?

  3. Find the equation for the value of k which will minimize the expected number of tests under plan (ii).
  4. Show that the optimal k is close to [1/( p)] and hence that the minimum expected number of tests is on average about [2N/( p)].

[ 20 Solve Exercise 1.2.2 on page pageref assuming that the arrival times are exponential rather than uniform. Assume independence.

[ 21 There are 3 stop-lights spaced within 1km of each other and operating asynchronously. (They are reset at midnight.) Assuming each is red for 1 minute and then green for one minute, what is the average time to pass through the three lights by a car that can instantaneously accelerate to 60km/h.

This exercise can be developed into a simulation project that may address some of the following questions

More theoretical questions

Problem 24 [Hoeffding] Show that if XY, X, Y are discrete, then


(P(X t, Y s)-P(X t)P(Y s)) dt ds.

Problem 25 Let X 0 be a random variable and suppose that for every 0 < q < 1 there is T = T(q) such that

P(X > 2t) q P(X > t) for all t > T.
Show that all the moments of X are finite.

Problem 26 If x, U are independent, Pr(x = 0) = Pr(x+1) = 1/2 and U is uniform U(0,1). What is the distribution of U+x?

Chapter 3
Moment generating functions

3.1  Generating functions

Properties of a sequence {an} are often reflected in properties of the generating function h(z) = nanzn.

3.2  Properties

The moment generating function of a real-valued random variable X is defined by MX(t) = Eexp(tX). If X > 0 has the density f(x), the moment generating function is its Laplace transform: M(t) = 0 etx f(x) dx.

A moment generating function is non-negative, and convex (concave up). The typical example is the moment generating function of the {0,1}-valued random variable.

Missing graph

This graph was generated with java. The browser on your computer does not run java. Please use java-enabled browser, or another computer.

Figure 3.1: Graph of the moment generating function M(t) = 3/4+1/4et.

A linear transformations of X changes the moment generating function by the following formula.

MaX+b(t) = etbMX(at).

Important properties of moment generating functions are proved in more theoretical probability courses15.

Theorem 11 (i) The distribution of X is determined uniquely by its moment generating function M(t).

(ii) If X, Y are independent random variables, then MX+Y(t) = MX(t) MY(t) for all t IR.

(iii) M(0) = 1, M(0) = EX, M(0) = EX2

Name Distribution Moment generating function
Normal N(0,1) f(x) = [1/({2p})]exp-[(x2)/ 2]M(t) = et2/2
Exponential f(x) = le-lx M(t) = [(l)/( l-t)]
Uniform U(-1,1) f(x) = 1/2 for -1 x 1 M(t) = 1/tsinht
Gamma f(x) = 1/G(a)b-a xa-1exp-x/b M(t) = (1-bt)-a
Binomial Pr(X = k) = (nk)pk(1-p)n-kM(t) = (1-p+pet)n
Poisson Pr(X = k) = e-l[(lk)/ k!]M(t) = expl(et-1)
Geometric Pr(X = k) = p(1-p)k-1 M(t) = [(pet)/( 1-(1-p)et)]

Table 3.1: Moment generating functions.

Problem 27 Find moment generating functions for each of the entries in Table 3.1.

Problem 28 Use moment generating functions to compute EX, Var(X) for each of the entries in Table 2.4.

Problem 29 Prove the summation formulas stated in Theorems 36 and 37

For a d-dimensional random variable X = (X1, , Xd) the moment generating function MX: IRd \sf CC is defined by MX(t) = Eexp(t·X), where the dot denotes the dot (scalar) product, ie. x·y = xkyk. For a pair of real valued random variables X, Y, we also write M(t, s) = M(X, Y)((t, s)) and we call M(t, s) the joint moment generating function of X and Y.

The following is the multi-dimensional version of Theorem 3.2.

Theorem 12 (i) The distribution of X is determined uniquely by its moment generating function M(t).

(ii) If X, Y are independent IRd-valued random variables, then

MX+Y(t) = MX(t) MY(t)
for all t in IRd.

3.2.1  Probability generating functions

For ZZ\scriptscriptstyle + -valued random variables it is convenient to consider the so called generating function G(z) = M(lnz). In this case Theorem 3.2(i) is elementary, as G(z) = k = 0 pkzk determines uniquely its Taylor series coefficients pk = Pr(X = k).

3.3  Characteristic functions

The major nuisance in using the moement generating functions is the fact that the moment generating functions may not exist, when the definiting integral diverges.

For this reason it is more preferable to use an expression that is always bounded, and yet has the same convenient algebraic properties. The natural candidate is

eix = cosx+isinx.
The characteristic function is accordingly defined as

fX(t) = EeitX.

For symmetric random variables complex numbers can be avoided at the expense of trigonometric identities. Example 47 If X is -1,1 valued, Pr(X = 1) = 1/2, then f(t) = cost.

Example 48 The characteristic function of the normal N(0,1) distribution is f(t) = e-t2/2.

3.4  Questions

Problem 30 Let Sn = X1+... + Xn be the sum of mutually independent random variables each assuming the values 1,2,...,a with probability 1/a.

  1. Show that EeuSn = ([(eu(1-eau))/( a(1-eu))])n.
  2. Use the above identity to show that for k n
    (Sn = k) = a-n

    j = 0 

    (For a = 6 Pr(Sn = k) is the probability of scoring the sum k+n in a throw with n dice. The solution of this problem is due to de Moivre.)

Problem 31 Suppose the probability pn that a family has exactly n children is apn when n 1 and suppose p0 = 1-a[p/( 1-p)]. (Notice that this is a constaint on the admissible values of a, p since p0 0.

Suppose that all distributions of the sexes for n children are equally likely. Find the probability that a family has exactly k girls.
Hint: The answer is at first as the infinite series. To find its sum, use generating function, or negative binomial expansion: (1+x)-k = 1+[(k+1)/ 1!]x+[((k+1)k+2)/ 2!]x2+....


[(2apk)/( (2-p)k+1)].

Problem 32 Show that if X 0 is a random variable such that

P(X > 2t) 10 (P(X > t))2 for all t > 0,
then Eexp(l|X|) < for some l > 0.

Problem 33 Show that if Eexp(lX2) = C < for some a > 0, then

Eexp(tX) Cexp( t2
for all real t.

Problem 34 Prove that function f(t): = Emax{X, t} determines uniquely the distribution of an integrable random variable X in each of the following cases:

Problem 35 Let p > 2. Show that exp(|t|p) is not a moment generating function.

Chapter 4
Normal distribution

Next to a stream in a forest you see a small tree with tiny, bell-shaped, white flowers in dropping clusters.
The Auborn Society Field Guide to North American Trees.

The predominance of normal distribution is often explained by the . This theorem asserts that under fairly general conditions the distribution of the sum of many independent components is approximately normal. In this chapter we give another reason why normal distribution might occur. The usual definition of the standard normal variable Z specifies its density f(x) = [1/( {2p})]e-[(x2)/ 2], see Figure 2.1 on page pageref. In general, the so called N(m, s) density is given by

f(x) = 1

e -[((x-m)2)/(2s2)].
By completing the square one can check that the moment generating function M(t) = EetZ = - eitx f(x) dx of the standard normal r. v. Z is given by
M(t) = e[(t2)/ 2].

In multivariate case it is more convenient to use moment generating functions directly. For consistency we shall therefore adopt the following definition.

Definition 13 A real valued random variable X has the normal N(m, s) distribution if its moment generating function has the form

M(t) = exp(tm+ 1
where m, s are real numbers.

From Theorem 3.2 one can check by taking the derivatives that m = EX and s2 = Var(X). Using (53) it is easy to see that every univariate normal X can be written as

X = sZ + m,
where Z is the standard N(0,1) random variable with the moment generating function e[(t2)/ 2]. This is perhaps the most convenient representation16 of the general univariate normal distribution. Traditionally, it was used to answer questions like

If X has given mean m = 123 and given variance s2 = 456, for what values of a we have Pr(|X| > a) = .8?

with the help of tabularized values of the cumulative distribution function of standard normal Z.

[ 22 [Basketball coach's problem] Collect data on the heights of 25 to 50 randomly selected males. Make a histogram of the data, compute the empirical mean and standard deviation.

4.1  Herschel's law of errors

The following narrative comes from J. F. W. Herschel17.

``Suppose a ball is dropped from a given height, with the intention that it shall fall on a given mark. Fall as it may, its deviation from the mark is error, and the probability of that error is the unknown function of its square, ie. of the sum of the squares of its deviations in any two rectangular directions. Now, the probability of any deviation depending solely on its magnitude, and not on its direction, it follows that the probability of each of these rectangular deviations must be the same function of its square. And since the observed oblique deviation is equivalent to the two rectangular ones, supposed concurrent, and which are essentially independent of one another, and is, therefore, a compound event of which they are the simple independent constituents, therefore its probability will be the product of their separate probabilities. Thus the form of our unknown function comes to be determined from this condition...''

Ten years after Herschel, the reasoning was repeated by J. C. Maxwell18. The fact that velocities are normally distributed is sometimes called Maxwell's theorem.

The beauty of the reasoning lies in the fact that the interplay of two very natural assumptions: of independence and of rotation invariance, gives rise to the normal law of errors - the most important distribution in statistics.

Theorem 13 Suppose random variables X, Y have joint probability distribution m(dx, dy) such that

(i) m(·) is invariant under the rotations of IR2;

(ii) X, Y are independent.

Then X, Y are normal.

The following technical lemma asserts that moment generating function exists. [ 1 If X,Y are independent and X+Y, X-Y are independent, then EexpaX < for all a.

Proof. Consider a real function N(x): = P(|X| x). We shall show that there is x0 such that

N(2x) 8(N(x - x0))2
for each x x0. By Problem 3.4 this will end the proof.

Let X1, X2 be the independent copies of X. Inequality (56) follows from the fact that event {|X1| 2x} implies that either the event {|X1| 2x}{|X2| 2x0}, or the event {|X1+X2| 2(x - x0)}{|X1 - X2 | 2(x - x0)} occurs.

Indeed, let x0 be such that P(|X2| 2x0) 1/2. If |X1| 2x and |X2| < 2x0 then |X1X2| |X1| - |X2| 2(x -x0). Therefore using independence and the trivial bound P(|X1+X2| 2a) P(|X1| a)+P(|X2| a), we obtain

P(|X1| 2x) P(|X1| 2x)P(|X2| 2x0)
+ P(|X1+X2| 2(x - x0)) P(|X1- X2| 2(x - x0) )
N(2x)+4N2(x - x0)
for each x x0. [¯]
Proof of Theorem 4.1. Let M(u) = EuX be the moment generating function of X. Since E eu(X+Y)+v(X-Y) = E e(u+v)X+(u-v)Y

M(2 u)M(2v) = M(u+v)M(u-v)
This implies that Q(x) = lnM(x) satisfies

Q(2 u)+ Q(2v) = Q(u+v)+Q(u-v)
Differentiating (58) with respect to u and then v we get
Q(u+v) = Q(u-v)

Therefore (take u = v) the second derivative Q(u) = Q(0) = const 0. This means M(u) = exp( au+bu2). [¯]

4.2  Bivariate Normal distribution

Definition 14 We say that X,Y have jointly normal distribution (bivariate normal), if aX+bY is normal for all a,b IR.

If EX = EY = 0, the moment generating function M(t,s) = EetX+sY is given by M(t,s) = e1/2(s12 t2+2tsrs1s2+s2t2)

Clearly s12 = Var(X), s22 = Var(Y), r = Cov(X,Y).

When r 1 the joint density of X,Y exists and is given by

f(x,y) = 1

exp- x2-y2-2rs1s2 xy

Example 49 If X,Y are jointly normal with correlation coefficient r 1 then the conditional distribution of Y given X is normal.

4.3  Questions

Problem 36 Show that if X,Y are independent and normal, then X+Y is normal. (Hint: moment generating functions are easier than convolution formula (36).)

Problem 37 If X,Y are independent normal N(0,1), find the density of X2+Y2. (Hint: compute cumulative distribution function, integrating in polar coordinates.)

Problem 38 For jointly normal X,Y show that E(Y|X) = aX+b is linear.

Problem 39 If X,Y are jointly normal then Y-rXsY/sX and X are independent.

Problem 40 If X, Y are jointly normal with variances sX2,sY2 and the correlation coefficient r, then X = sX(g1 cosq+g2sinq, Y = sY(g1 sinq+g2cosq, where gj are independent N(0,1) and sin2q = r.

Chapter 5
Limit theorems

This is a short chapter on asymptotic behavior of sums and averages of independent observations. Theorem justifies simulations as the means for computing probabilities and expected values. Theorem provides error estimates.

5.1  Stochastic Analysis

There are several different concepts of convergence of random variables.

Definition 15 Xn X in probability, if Pr(|Xn-X| > e) 0 for all e > 0

Example 50 Let Xn be N(0,s = 1/n). Then Xn o in probability.

Definition 16 Xn X almost surely, if Pr(Xn X) = 1.

Example 51 Let U be the uniform U(0,1) r. v. Then 1/n U 0 almost surely.

Example 52 Let U be the uniform U(0,1) r. v. et

Xn = {
if U > 1/n
. Then Xn 0 almost surely.

Definition 17 Xn X in L2 (mean square), if E|Xn-X|2 0 as n .

Remark 2 If Xn X in L2, then by Chebyshev's inequality Xn X in probability.

If Xn X almost surely, then Xn X in probability.

5.2  Law of large numbers

Each law of large numbers (there are many of them) states that empirical averages converge to the expected value. In statistical physics the law of large numbers impies that trajectory averages and population averages are asymptoticaly the same. In simulations, it provides a theoretical foundation, and connects the frequency with the probability of an event.

Theorem 14 Suppose Xk are such that EXk = m, Var(Xk) = s2, and cov(Xi,Xj) = 0 for all i j. Then 1/nj = 1n Xj m in L2

Proof. To show that 1/nj = 1n Xj m in mean square, compute the variance. [¯]
[ 1 If Xn is Binomial Bin(n,p) then 1/n Xn p in probability.

5.2.1  Strong law of large numbers

The following method can be used to prove strong law of large numbers for Binomial r. v.

  1. If X is Bin(n,p), use moment generating function to show that E(X-np)4 Cn2
  2. Use Chebyshev's inequality and fourth moments to show that

    nPr(|Xn/n-p| > e) < .

  3. Use the convergence of the series to show that limNPr(n N {|Xn/n-p| > e}) = 0.

  4. Use the continuity of the probability measure to show that Pr(Nn N {|Xn/n-p| > e}) = 0.

  5. Show that with probability one for every rational e > 0 there is N = N(e) such that for all n > N the inequality |Xn/n-p| < e holds. Hint: if Pr(Ae) = 1 for rational e, then Pr(eAe) = 1.

5.3  Convergence of distributions

In addition to the types of convergence introduced in Section , we also have the convergence in distribution.

Definition 18 Xn converges to X in distribution, if Ef(Xn) Ef(X) for all bounded continuous functions f.

Theorem 15 If Xn X in distribution, then Pr(Xn (a,b)) Pr(X (a,b)) for all a < b such that Pr(X = a) = Pr(X = b) = 0.

Theorem 16 If MXn(u) MX(u) for all u, then Xn X in distribution.

Theorem 2.17 states convergence in distribution to Poisson limit. Here is a proof that uses moment generating functions.

Proof of Theorem 2.17. MXn(u) = (1+pn(eu-1))n el(eu-1). [¯]

5.4  Central Limit Theorem

We state first normal approximation to binomial. Theorem 17 If Xn is Binomial Bin(n,p) and 0 < p < 1 is constant, then [(Xn-np)/( [npq])] Z in distribution to N(0,1) random variable Z.

Proof. For p = 1/2 only. The moment generating function of [(Xn-np)/( [npq])] = [(2Xn-n)/( n)] is

Mn(u) = e-nu(1/2+1/2e-2u/n)n = ([(e-u/n+eu/n)/ 2])n = coshn (u/n) eu2/2. (Can you justify the limit?) [¯]

Picture Omitted

Figure 5.1: Empirical proportions [^p]n as the function of sample size. Output of LIMTHS.BAS for p = .25 and 5 n 30

Limit theorems are illustrated by the program LIMTHS.BAS. This program graphs empirical proportions [^p]t as the (random) function of t n, makes a histogram, and compares it with the Normal and Poisson histograms. By trying various lengths of path n, one can see the almost sure Law of Large Numbers, and for moderate n see the normal approximation.

Problem 41 The graph of [^p]n as the function of n as given by LIMTHS.BAS suggests a pair of ``curves" between which the averages are ``squeezed". What are the equations of these curves?

[ 23 Suppose a poll of size n is to be taken, and the actual proportion of the voters supporting an issue in question is p = 1/2. Determine the size n such that the observed proportion [^p] = 1/nX satisfies Pr([^p] > .8) .01.

[ 24 Plot the histogram for a 100 independent Binomial Bin(n = 100,p = .5) random variables.

Theorem 18 If Xj are i.i.d. with EX = m, Var(X) = s2 then [(j = 1n Xj- nm)/( sn)] Z

Proof. For m = 0, s = 1 only.

The moment generating function is

Mn(u) = (M1(u/n))n = (1+[u/( n)]M(0)+[(u2)/ 2n]M(0)+O([1/( n2)]))n eu2/2 [¯]
A lesser known aspect of the central limit theorem is that one can actually simulate paths of certain continuous time processes by taking Xn(t) = [1/( n)]k nt xk, where xk are independent mean-zero r. v.

The following program uses this to simulate random curves. The program requires a graphics card on the PC. PROGRAM firewalk.bas

'This program simulates random walk paths (with uniform incerments)
'reflected at the boundaries of a region
'declarations of subs
DECLARE SUB CenterPrint (Text$)

' minimal error handling - graphics card is required

'request good graphics (some cards support SCREEN 7, etc)

LOCATE 1, 1 'title
CenterPrint "Path of reflected random walk"
LOCATE 9, 1  'timer location
PRINT "Timer"

scale = 10 '
WINDOW (0, 0)-(scale, scale): VIEW (150, 100)-(600, 300)
LINE (0, 0)-(scale, scale), 10, B': LINE (scale, scale)-(2 * scale, 0), 11, B
FOR j = -4 TO 4
LINE (0, scale / 2 + j)-(scale / 50, scale / 2 + j), 2
X = scale / 2
Y = scale / 2
dead = 0
T = 0
col = 14 * RND(1)
speed = scale / 100
WHILE INKEY$ = "" 'infinite loop until a key is pressed
T = T + 1
X0 = X
Y0 = Y
X = X + (RND(1) - 1 / 2) * speed
Y = Y + (RND(1) - 1 / 2) * speed
IF X < 0 THEN X = -X: col = 14 * RND(1)
IF Y < 0 THEN Y = -Y: col = 14 * RND(1)
IF X > scale THEN X = 2 * scale - X: col = 14 * RND(1)
IF Y > scale THEN Y = 2 * scale - Y: col = 14 * RND(1)
IF X > 2 * scale THEN X = 4 * scale - X: col = 14 * RND(1)
LINE (X0, Y0)-(X, Y), col
LOCATE 10, 1


ErrTrap: 'if there are errors then quit
PRINT "This error requires graphics card VGA"
PRINT "Error running program"
PRINT "Press any key ...'"

SUB CenterPrint (Text$)
' Print text centered in 80 column screen
offset = 41 - LEN(Text$) \ 2
IF offset < 1 THEN offset = 1
LOCATE , offset

5.5  Limit theorems and simulations

One role of limit theorems is to justify the simulations and provide error estimates. The simulation presented on page pageref uses the Central Limit Theorem to print out the so called 95% confidence interval.

Central limit theorem is also a basis for a fast simulation of the normal distribution, see Section .

5.6  Large deviation bounds

We begin with the bound for binomial distribution.

Recall that the relative entropy is H(q|p) = -qlnq/p - (1-q) ln[(1-q)/( 1-p)].

Theorem 19 Suppose Xn is Binomial Bin(n,p). Then for p p

( 1
Xn p) expn H(p|p)

Proof. Use Chebyshev's inequality (34) and the moment generating function. Pr(Xn n p) = Pr(uXn un p) e-npu EuXn =

( e-pu (1-p+peu))n = ( (1-p) e-pu+pe(1-p)u)n.

Now the calculus question: for what u 0 the function f(u) = (1-p)e-pu+pe(1-p)u attains a minimum? The answer is u = 0 if p p, or u = ln([(1-p)/ p][(p)/( 1-p)]). Therefore the minimal value is f(umin) = [p/( p)]e(1-p)u = exp(-plnp/p-(1-p)ln(1-p)/(1-p)) [¯]
[ 2 If p = 1/2 then

(| 1
Xn- 1
| > t) e-nt2/2

Proof. This follows from the inequality (1+x)ln(1+x) +(1-x)ln(1-x) x2. [¯]

5.7  Conditional limit theorems

Suppose Xj are i.i.d. Conditional limit theorems say what is the conditional distribution of X1, given the value of the empirical average 1/nj = 1n h(Xj). Such probabilities are difficult to simulate when 1/nj = 1n h(Xj) differs significantly from Eh(X).

5.8  Questions

[ 25 A 400-point multiple choice test has four possible responses, one of which is correct.

Part 2
Stochastic processes

Chapter 6

This chapter collects information about how to simulate special distributions.

Considerations of machine efficiency, in particular convenience of doing fixed precision arithmetics, make the uniform U(0,1) the fundamental building block of simulations. We do not consider in much detail how such sequences are produced - efficient methods are hardware-dependent. We also assume these are i. i. d., even though the typical random number generator returns the same pseudo-random sequence for each value of the seed, and often the sequence is periodic and correlated. This is again a question of speed and hardware only.

6.1  Generating random numbers

Suppose x0 is an arbitrary number between 0 and 1 with 5 decimal places or more. Let x1 = {147x0}, and xn+1 = {147xn}, where {a} = a-[a] denotes the fractional part. Here are the questions: Is xn a random sequence? Does it have ``enough" propertries of a random sequence to be used for simulations?

6.1.1  Random digits

The classical Peano curve actually maps the unit interval onto the unit square preserving the Lebesgue measure. Thus two independent U(0,1) are as ``random" as one!

Experiments with discrete outcomes aren't necessarily less random than continuous models. Expansion into binary fractions connects infinite tosses of a coin with a single uniform r. v.

Example 53 Let U be uniform U(0,2p). Random variables Xk = sign (sin(2k U)) are i. i. d. symmetric independent.

Theorem 20 If xj are independent identically distributed discrete random variables with values {0,1} and Pr(x = 1) = 1/2 then k = 1 [1/( 2k)]xk in uniform U(0,1).

Proof. We show by induction that if U is independent of {xj} uniform U(0,1) r. v., then [1/( 2n)]U+k = 1n [1/( 2k)]xk is uniform for all n 0. For induction step, notice that in distribution

[1/( 2n)]U+k = 1n[1/( 2k)]xk @ 1/2x1+1/2U. This reduces the proof to n = 1 case.

The rest of the proof is essentially the solution of Problem 2.18. Clearly Pr(1/2x1+1/2U < x) = Pr(x1 = 0)Pr(U/2 < x)+Pr(x1 = 1)Pr(1/2+U/2 < x) Expression Pr(U/2 < x) is 0, 2x, or 1. Expression Pr(1/2+U/2 < x) is 0, 2x-1, or 1. Their average (check carefully ranges of x!) is

if x < 0
if 0 x 1
if x > 1

Coefficient 2 does not play any special role. The same fact holds true in any number system - if N > 1 is fixed, and U = Xj N-j is expanded in pase N, then Xj are independent uniform {0,1,..., N-1}-valued discrete random variables.

From the mathematical point of view all of the simulations can be based on a single U(0,1) random variable. In particular, to generate independently uniform integer numbers19 in the prescribed range 0 ... N we need to pick any u (0,1), and define Xj = Nju mod N.

Clearly Xj is the integer part of NUj, where Uj solve the recurrence

Uj+1 = {UjN}
and {} denotes the fractional part. A computer realization of this construction would use Uj = nj/M with some M > N, Thus nj+1 = Nnj mod M.

Many programs provide access to uniform numbers. These however might be platform-dependent, and are often of ``low quality". Often a person performing simulation may want to use the code they have explicit access to. What is ``random enough" for one application may not be random enough for another.

6.1.2  Overview of random number generators

There is good evidence, both theoretical (see (62)) and empirical, that the simple multiplicative congruential algorithm

nj+1 = a nj mod N
can be as good as the more general linear congruential generator. Park & Miller propose a minimal standard generator based on the choices a = 75, N = 231-1. The computer implementation of this method is not obvious due to overflows when multippying large integers in finite computer arithmetics, see Schrage's algorithm in [].

The linear congruential method has the advantage of being very fast. It has the disadvantage that it is not free of sequential correlations between successive outputs. This shows up clearly in the fact that the consecutive k-points lie in at most N1/k subspaces of dimension k-1.

Many system-supplied random number generators are linear congruential generators, which generate a sequence of integers n1, n2, ... each between 0 and N-1 by the recurrence relation

nj+1 = anj+b mod N
The value of N is the largest integer representable in the machine. This is 216 on 16-bit machines, unless long integers are used (then it is 232). Value used in C-supplied generator are a = 1103515245, b = 12345, N = 232.

6.2  Simulating discrete r. v.

6.2.1  Generic Method - discrete case

Section 2.1.2 described the generic method for simulating a discrete random variables with finite number of values. Namely, take X = f(U), where f is a suitable piecewise constant functions on the interval (0,1). Let Pr(X = vk) = pk. Then f(x) = vk for x (j = 1kpj,j = 1k+1pj).

This is rather easy to convert into the computer algorithm.

6.2.2  Geometric

A simple method for simulating geometric distribution is to simulate independent binomial trials until the first success.

6.2.3  Binomial

A simple method for simulating Binomial Bin(n,p) random variable is to simulate binomial trials with the prescribed probability of success. For a sample program illustrating this method, see TOSSCOIN.BAS page pageref.

6.2.4  Poisson

An exact method to simulate Poisson distribution is based on the fact that it occurs the Poisson process, and that sojourn times in the Poisson process are exponential (see Theorem on page ). Therefore, to simulateX with Poiss(l) distribution, simulate independent exponential r. v. T1, T2,..., with parameter l = 1 and put as the value of X the first value of n such that T1+...+Tn > l.

A reasonable approximation to Poisson Poiss(l) random variable X is obtained by simulating binomial Bin(n,p) random variable X with l = np. Since X n, use n large enough to exceed any realistic values of X. Run program LIMTHMS.BAS to compare the histograms - why is Poisson distribution more spread out than the binomial?

6.3  Simulating continuous r. v.

6.3.1  Generic Method - continuous case

Section 2.2.3 described the generic method for simulating a continuous random variable, similar to the method used in the discrete case. Namely, take X = f(U) where f is the inverse20 of the cumulative distribution functions F(x) = Pr(X x).

Problem 42 Given a cumulative distribution function G(x) with inverse H(), and density g(x) = G(x), let U1,U2 be two independent uniform U(0,1) random variables. Show that X = H(U1), Y = U2g(X) are uniformly distributed in the region {(x,y): 0 < y < g(x)}.

6.3.2  Randomization

If the conditional density of Y given X = x is f(y|x), and the density of X is g(x), then the density of Y is f(y|x)g(x) dx.

Example 54 Suppose Y has density cye-y = Cn = 0y(1-y)n/n!. Let Pr(X = n) = c/n!. Then the conditional distribution is f(y|X = n) = Cy(1-y)n, which is the distribution of a median from the sample of size 2n.

6.3.3  Simulating normal distribution

By the central limit theorem (Theorem 5.4), if U1,...,Un are i. i. d. uniform U(0,1) then {[12/ n]}k = 1n (Uk-1/2) is asymptotically normal N(0,1). In particular a computer-efficient approximation to normal distribution is given by

k = 1 
Uk - 6.

The exact simulation of normal distribution uses the fact that an independent pair X1,X2 of normal N(0,1) random variables can be written as

X1 = RcosQ
X2 = RsinQ
where Q is uniform U(0,2p), R = [(X2+Y2)] is exponential (see Problem 4.3) with parameter l = 1/2, and random variables Q,R are independent. Clearly R = {-2lnU}, see Example 2.2.3.

We simulate two independent normal N(0,1) r. v. from two independent uniform r. v. U1, U2 by taking Q = 2pU1, R = {-2lnU2} and using formulas (65)-(66).

6.4  Rejection sampling

The idea of rejection method is very simple. In order to simulate a random variable with the density f(x), select a point X,Y at random uniformly from the region {(x,y): y < g(x)}. Then Pr(X < x) = -x f(t)  dt is the cumulative distribution function, which we do not have to know analytically.

The name comes from the technique suggested by Problem 6.3.1. Instead of selecting points under the graph of f(x), we pick another function g(x) > f(x), which has known antiderivative with explicitly available inverse. We pick points under the graph of g(x), and "reject" those that didn't make it below the graph of f(x).

Often used density g(x) = c/(1+x2) leads to X = H(U1), where H(u) = tan(pu/c).

Rejection sampling can be used to simulate continuous or discrete distributions. The idea behind using it in discrete case is to convert discrete distribution to a function of continous random variable. For example, to use rejection sampling for Poisson distribution, simulate the density f(x) = e-ll[x]/[x]! and take the integer part [X] of the resulting random variable.

6.5  Simulating discrete experiments

6.5.1  Random subsets

To simulate uniformly selected random subsets of {1,..., n}, define sets by a sequences S(j) {0,1} with the interpretation j S if S(j) = 1. Now select independently 0 or 1 with probablity 1/2 for each of the values of S(j), j = 1,..., n.

SUB RandomSubset( S())
n = UBOUND(S) ' read out the size of array
FOR j = 1 TO n
'select entries at random
S(j) = INT(RND(1) + 1)

For subsets of low dimension, a sequence of 0,1 can be identified with binary numbers. Set operations are then easily converted to binary operations on integers/long integers.

6.5.2  Random Permutations

Suppose we want to re-arrange elements a(1),..., a(n) into a random order. A quick method to accomplish this goal is to pick one element at a time, and set it aside. This can be easily implemented within the same sequence.

SUB Rearrange( A())
n = UBOUND(A) ' read out the size of array
ns = n 'initial size of randomization
FOR j = 1 TO n
'take a card at random and put it away
k = INT(RND(1) * ns + 1)
SWAP A(k), A(ns)
ns = ns - 1 'select from remaining a(j)

Problem 43 Let a1,... an be numbers such that j aj = 0, j aj2 = 1 Let X denote the sum of the first half of those numbers, after a random rearrangement.

6.6  Integration by simulations

Program listing on page pageref explains how to perform double integration by simulation. Here we concentrate on refining the procedure for single integrals by reducing the variance.

To evaluate J = ab f(x)  dx we may use the approximation 1/Nj = 1N f(Xj)/g(Xj), where Xj are i.i.d. with the density g(x) such that abg(x) dx = 1.

The error, as measured by the variance21, is



j = 1 

= 1





The variance is the smallest if g(x) = C|f(x)|, therefore a good approximation g to function f will reduce the variance22. This procedure of reducing variance is called importance sampling.

For smooth functions, a better approximation is obtained by selecting points more uniformly than the pure random choice. The so called Sobol sequences are based on sophisticated mathematics. Cookbook prescriptions can be found eg in Numerical recipes.

Note: The reliability of the Monte Carlo Method, and the associated error bounds depends on the quality of the random number generator. It is rather unwise to use an unknown random number generator in questions that require large number of randomizations. Example 55 The following problem is a challenge to any numerical integration method due to rapid oscillations, 012sin2(1000 x)  dx = 1-[1/ 2000]sin2000 1.0004367

6.6.1  Stratified sampling

The idea of stratified sampling is to select different number of points from different subregions. As a simple example, suppose we want to integrate a smooth function over the interval [0,1] using n points. Instead of following with the standard Monte Carlo prescription, we can divide [0,1] into k non-overlapping segments and choose nj points from the j-th subinterval Ij. An extreme case is to take k = n and nj = 1 - this becomes a variant of the trapezoidal method. The optimal choice of nj is to select them proportional to the local standard deviation of the usual Monte Carlo estimate of Ij f(x)  dx. Indeed, denoting by Fj the estimator of the integral over Ij, the variance of the answer is Var(j = 1k Fj) = j Var(Fj) = j sj2/nj. The minimum under the constraint jnj = n is nj sj.

A simple variant of recursive stratified sampling is to generate points and subdivisions based on estimated values of the variances.

6.7  Monte Carlo estimation of small probabilities

Unlikely events happen too rarely to have any reasonable hope of simulating them directly. Under such circumstances a special method of selective sampling23 was developed.


Suppose we want to find Pr(X1+...+Xn > an) for large n and a given density f(x) of independent r. v. X. Consider instead independent random variables Yj with the ``tilted density" Celx f(x), where C is the normalizer, and l is such that EY = a. By the law of large numbers (Theorem 5.2), the event Y1+...+Yn > an has large probability, and

Pr(X1+...+Xn > an) = y1+...+yn > ane-ly1... e-lynf(y1)... f(yn) dy1... dyn. This leads to the following procedure.

Example 56 Suppose Xj are {0,1}-valued so that X1+... Xn is binomial Bin(n,p). What is the distribution of Yj? Simplify the expression e-lY1... e-lYn.

[ 26 Write the program computing by simulation the probability that in a n = 10 tosses of a fair coin, at least 8 heads occur. Once you have a program that does for n = 10 a comparable job to the ``naive" program below, try Exercise 1.5 on page pageref.

Here is a naive program, that does the job for n = 10, but not for n = 100. It should be used to test the more complex ``tilted density" simulator.

PROGRAM heads.bas

'Simulating N fair coins
' declarations
DECLARE FUNCTION NumHeads% (p!, n%)
DEFINT I-N  ' declare integer variables

'prepare screen

PRINT "Simulating toss of n fair coins"

'get users input
n = 10 'number of trials 
INPUT "Number of coins n=", n
pr = .5 ' fairness of a coin
frac = .8 ' percentage of heads seeked
' get the frac from the user
PRINT " How often we get more than f heads? (where 0<f<"; n; ")"
INPUT "f=", frac
IF frac >= 1 THEN frac = frac / n 'rescale if too large
'tell user what is going on
PRINT "Hit any key to see the final answer"
LOCATE 20, 10
PRINT "With some patience you may see digits stabilize"

DO 'main loop
        T = T + 1
        IF NumHeads(pr, n) > frac * n THEN s = s + 1
        IF INKEY$ > "" THEN EXIT DO
        LOCATE 10, 10
        PRINT "Trial #"; T
        PRINT "Current estimate"; USING "##.#####"; s / T
'print the answer
PRINT "Prob of more then "; INT(frac * n); " heads in "; n; " trials is about "; s / T

FUNCTION NumHeads (p!, n)
'simulate a run of n coins
 h = 0
FOR k = 1 TO n
        IF RND(1) < p! THEN h = h + 1
NumHeads = h

Chapter 7
Introduction to stochastic processes

stochastic, a. conjectural; able to conjecture
Webster's New Universal Unabridged Dictionary

Stochastic processes model evolution of systems that either exhibit inherent randomness, or operate in an unpredictable environment. This unpredictability may have more than one form, see Section .

Probability provides models for analyzing random or unpredictable outcomes. The main new ingredient in stochastic processes is the explicit role of time. A stochastic process is described by its position X(t) at time t [0,1], t [0,), or t {0,1,...}.

From the conceptual point of view, stochastic processes that use discrete moments of time t {0,1,...} are the simplest. Since the discrete moments of time can represent arbitrarily small time increments, discrete time models are rich enough to model real-world phenomena. Continuous time versions are convenient mathematical idealizations.

7.1  Difference Equations

Mathematical models of time evolution of deterministic systems often involve differential equations.

Difference equations are discrete analogues of the differential equations. Difference equations occur in applied problems, and also when solving differential equations by series expansions, or by Euler's method. The concepts of numerical versus analytical solution, initial values, boundary values, linearity, and superposition of solutions occur in the discrete setup in complete analogy with the theory of differential equations.

A difference equation determines an unknown sequence (yn) through a relation that specifies the pattern. The relation often has the form of an equation that ties the consecutive values of the sequence. The technical term for such an equation is recurrence.

We will consider only special cases of classes of equation

yn+1 = f(n,yn,yn-1,...,yn-k+1).
Here coefficient k is the order of the equation. For instance, yn+1 = f(n,yn) is an equation of order 1, yn+1 = f(n,yn,yn-1) is an equation of order 2, etc.

7.1.1  Examples

Example 57

Suppose a sequence yn is to satisfy

yn+1 = cos(yn),
where the cosine function is in radians.

It is easy to write a short program that computes the values of yn. But to compute the actual sequence we need to specify the initial value y0. Table gives sample outputs of such a program for several choices of y0.

y0 y1 y2 y3 y4 y5 y6 y7 y8


.5403023 .8575532 .6542898 .7934803 .7013688 .7639596 .7221025 .7504177
-.5 .8775826 .6390125 .8026851 .694778 .7681958 .7191654 .7523558 .7300811
0 1 .5403023 .8575532 .6542898 .7934803 .7013688 .7639596 .7221025
.5 .8775826 .6390125 .8026851 .694778 .7681958 .7191654 .7523558 .7300811
1 .5403023 .8575532 .6542898 .7934803 .7013688 .7639596 .7221025 .7504177
1.5 .0707372 .9974992 .542405 .8564697 .6551088 .7929816 .7017242 .7637303
2 -.4161468 .9146533 .6100653 .8196106 .6825058 .7759946 .7137247 .7559287

Table 7.1: Sequences yn satisfying equation () with different initial values y0.

Similar numerical procedures show up in differential equations, where they are used to approximate continuous solutions by discretizing time. The procedure is called Euler method, or tangent line method.


Some difference equations are simpler than others. Suppose a sequence yn is to satisfy
yn+1 = yn +d,
where d is a given number. It is easy to write a short program that computes values of yn. To compute them we again need to specify the initial value y0.

On the other hand, we may notice that equation (68) defines the arithmetic progression. Instead of a table like Table 7.1, we can write down the solution for all possible values of y0 and for all n. Namely,

yn = y0+dn.
In general, an analytical solution of the difference equation is the formula that expresses yn as the function of n. This should be contrasted with the numerical solution which is an algorithm, or computer program, that computes values of yn. The general solution is the function of both y0 and n, as contrasted with a particular solution that works for a prescribed initial value24 y0 only.

Example 58 Here is another well known difference equation with obvious solution. Suppose

yn+1 = ryn ,
where r is a given number. Equation (70) defines a geometric progression and its solution is.
yn = y0rn

Examples 7.1.1 and 7.1.1 are deceptively simple. Not every difference equation has a simple or easy to guess answer.


The following equation defines the Fibonacci sequence
yn+1 = yn+yn-1.
In a typical application, yn denotes the number of rabbits at the end of the nth month. In particular, if we buy one newborn rabbit at the beginning of the first month then the first terms of the sequence are easy25 to write down:
1, 1, 2, 3, 5, 8, 13, 21, ...
Without much difficulty this can be converted to a computer program and used to answer questions like When will the population of rabbits exceed 1 million? The general expression (solution) of the equation corresponding to this situation is given by the following formula.

yn = 1



- 1



This is the outcome of the standard computation!

7.2  Linear difference equations

Many interesting difference equations, including (68), (70), (72) fall into the category of linear difference equations with constant coefficients. The first order linear difference equations with constant coefficients has the form

yn+1+a yn = g(n).
The second order linear difference equations with constant coefficients has the form
yn+2+ayn+1+b yn = g(n).

A method to solve difference equation of order 2 consists of the following steps:


Here is an example of the applied problem that leads to a natural, but not obviously solvable difference equation.

Suppose you borrow y0 dollars on fixed monthly interest rate r. If you do not make any payments on your loan, then your balance will ``balloon" exponentially. Formula yn = (1+r)ny0 expresses monthly balance after n months when no loan payments are made. Some people prefer to pay a fixed amount of p dollars at the end of each month. If p is large enough, then the balance may even shrink down! This situations is easily described by a difference equation. To write it down, compute the next month balance, if the previous month balance is known:

yn+1 = (1+r)yn - p .
Equation (73) and its general solution are of interest to bank officers and to their customers. From the formula for yn, they can determine monthly payments that will pay a loan off in the prescribed amount of time.

In differential equations, we often solve a similar problem in continuous time, with continuous compounding and continuous payments schedule. This is a mathematical simplification of the actual banking situation, but the answers are reasonably close, and they are easier to get.

Here is an example of the mathematical problem that leads to a natural, but non-trivial26 difference equation.

Example 59 Suppose we want to find the formula for the sum of all consecutive integers from 1 to n. Let the answer by yn. Then the recurrence formula

yn+1 = yn +n+1

Now (74) can be written as yn+1-yn = n+1 and actually this is why we use the name difference equations rather than recurrence relations. Notice that we do know the solution of its continuous analogue. Equation y = t+1 for unknown function y = y(t) resembles (74). Its solution27 is y(t) = 1/2t2+t. So to find the answer to (74) we may try substituting un = 1/2n2+n into the equation. Unfortunately, simple arithmetics shows that we don't get the right answer, as

un+1 = un +n+ 3
But then, we are not that far off the target. Let vn = un-yn. Subtracting equation (74) from equation (75) we get the differential equation for (vn)
vn+1 = vn+ 1
This is the special case of the arithmetic progression equation (68). From Example 7.1.1 equation (69) we know that vn = n/2. Therefore yn = 1/2n2+n-n/2 = [(n(n+1))/ 2]

The method we used - subtracting the equations - works for the so called linear difference equations (and also linear differential) equations. It is closely related to the Principle of Superposition for linear differential equations.

7.2.1  Problems

  1. Check if the given sequence solves the difference equation.

    1. Equation: yn+1 = -yn. Sequence yn = cosnp.
    2. Equation: yn+1 = yn+2yn-1. Sequence yn = 2n.
    3. Equation: yn+1 = yn+2yn-1. Sequence yn = -1/2n3+3n2-5/2n+1.
    4. Equation: yn+1 = 2yn-yn-1+2. Sequence yn = n2.

  2. Write down the difference equation that you need to solve each of the following problems (do not solve the equation, nor the problem).

    1. A loan of $1000 has interest rate that varies with time as follows: first month interest is 0%, second month interest is [1/ 12]%, third month interest is [2/ 12]%, etc with monthly interest increasing by [1/ 12]% every month. Determine the fixed monthly payment p that will pay this loan within one year.

    2. A loan of $1000 has constant monthly interest rate of [1/ 12]%. I arranged my monthly payments in the following fashion: at the end of the first month I will pay nothing, at the end of the second month I will pay $10, at the end of the third month I will pay $20, etc, with monthly payments increasing by $10 every month. Determine when I will pay off this loan and how much money I will pay in total.

  3. Find the general solution of the following difference equations.

    1. yn+1 = 1/n yn
    2. yn+1 = [n/( n+1)]yn
    3. n(n+1)yn+1 = yn

  4. Solve the given initial value problem for the difference equation.

    1. yn+1 = 1/n yn; y0=1
    2. yn+1 = [n/( n+1)]yn-1 ; y0 = 1, y1 = 0
    3. n(n+1)yn+1 = 2yn ;y0 = 1, y1 = 0

  5. Find the formula for

    1. yn = 12+22+...+n2
    2. yn = 13+23+...+n3
    3. yn = 1/2+1/6+...+[1/( n(n+1))]
    4. yn = 1+r+r2+...+rn-1
    5. yn = r+2r2+3r3+...+nrn

  6. Each solution of equation (67) has the limit limnyn. Show that the first digits of this limit are .739085133.

7.3  Recursive equations, chaos, randomness

Before jumping to models that use explicit randomness in their evolution, it is quite illuminating to analyze first some mathematically simple and well defined ``deterministic evolutionary processes". The following set of examples describes deterministic evolution in discrete time of a system described by a single numerical parameter x (0,1). All of the examples fall into the category of (non-linear) difference equations: given initial value x0 (0,1) and a simple evolution equation of the form xn+1 = g(xn), we are supposed to make inferences about the behavior of the solutions.

There is no randomness in the evolution itself. But since we are allowed to choose any initial condition, and no initial condition can be measured exactly, we may as well consider the initial value to be random. Example 60 Equation

xn+1 = cos(xn)
is analyzed numerically in Example 7.1.1. A useful technique to analyze such equations is to graph function y = g(x) together with line y = x, and represent the sequence xk by points (xk,xk) on the line. The actual proof that xn x* may perhaps be not that obvious, but the geometric argument seems to be quite convincing.

Example 61 Let {x} denote the fractional part of x. Equation

xn+1 = {2xn}
uses discontinuous function g(x). The previous graphical technique is more difficult to apply here, but the reason for this difficulty might be not apparent. Special initial points, like x0 = 1/2, x0 = 1/4, x0 = 1/8, x0 = 3/4, ... are relatively easy to analyze. But these are exceptional - the majority of the initial points actually doesn't follow this pattern.

Here is a rather surprising fact. Suppose x0 is selected at random. Since xk is a function of x0, and x0 is random, xk becomes a random variable. It may happen that xk < 1/2. Call this event Ak. The reasoning presented in Section 6.1.1 implies that events {Ak} are independent and have the same probability Pr(Ak) = 1/2. Therefore the deterministic evolution equation contains at least as much randomness as the tosses of a coin.

The last example may perhaps give the impression that it is the discontinuity of the evolution equation that is the source of difficulties. This is not at all the case.

Example 62 Equation

xn+1 = 4xn(1-xn)
is another example of the "chaotic equation" with solutions exhibiting as much irregularity as the tosses of a coin. Attempts at graphing its solutions do no indicate any patterns. Tiny differences in the choice of initial value x0 significantly change the evolution within short time.

Example 7.3 indicates that ``naive" prediction of the future of a system is unreliable even within the realm of deterministic evolution equations.

On the other hand, there are aspects of the evolution that we can analyze reliably. These deal with the average behavior of the evolution.

There are two classes of questions that we can answer, but both deal with statistical nature of the evolution:

For instance, we can ask and get reliable answers to questions like: Theory of stochastic processes uses descriptive rather than casual models. Its primary goal is to isolate methods that answer questions that can be answered - about the averages and chances of events. It is setup in the form that makes it more natural to ask the "correct" questions. But in real life, and in simulations, we do have access to aspects of the phenomenon than what the theory does not expose. In analyzing simulations it is important to keep in mind the examples above. Avoid collecting data that deal with instances rather than statistical phenomena. Print out well defined statistics of the simulation only. Do not clutter your simulations with irrelevant details.

7.4  Modeling and simulation

The quick way to get insights into operation of real systems is to model their behavior. Here are examples of enterprises that operate under randomness. Mathematicians devised methods of modelling each of these. But it is interesting also to simulate their behavior. Notice that simulations require assumptions, but so do the analytical methods. Regardless of the method, we have to be careful about what are the questions we can answer.

Example 63 Suppose we want to study how much capital is needed to run a casino. We need to answer the following.

There is also a number of questions that we do not want to answer.

Example 64 Suppose we want to study how much capital is needed to insure cars. We need to answer the following.

There is also a number of questions that we do not want to answer. For instance we do not want to predict whether I'll file an insurance claim today, driving back home on I-74 without enough sleep since I was preparing this class until 3AM.

Example 65 A construction company has n jobs to be performed in the future. For each of these jobs, experts provide estimate of the cost. Then, after questioning, they complement the estimates by the lower/upper bound for the costs. How much money should be allotted?

Example 66

A store averages l(t) customers per hour at time t of the day. Each customer brings some (random) profit. However, a customer may just leave the store without shopping, if the lines are too long.

7.5  Random walks

A random walk is a process of the form Xn = j = 1n xj, where xj are i. i. d. Random walks have independent increments , and describe the accumulation of independent contributions over time.

Random walks are examples of Markov processes which will be studied in detail in Chapter . Their special structure allows to analyze them independently of the general theory.

7.5.1  Stopping times

The stopping times are random variables that describe phenomena which depend on the trajectory of a random walk. The definition captures the intuition that their values are determined by the history of a Markov chain.

Definition 19 T:W IN is a stopping time, if the event {T = n} is independent of xn+1,xn+2,....

This definition is specialized to random walks. In a more general Markov case the definition is less transparent but captures the same idea.

The most important example of a stopping time is the first entrance time T = inf{k: Xk A}. An example that is not a stopping time is the last exit from a set.

When T < we define random sums XT = j Txj. The following theorem is an exercise when T is independent of x.

Theorem 21 If xj are i. i. d., Ex = m, and ET < is a stopping time then


Proof. EXT = n EXn Pr(T = n) = n k = 1nExk Pr(T k). Since xk and {T k} = {T < k} are independent, therefore EST = mn Pr(T n) = mET by tail integration formula (21). [¯]
Theorem 22 If xj are i.i.d., Ex = m, Var(x) = s2 < , and ET < then

E(ST-Tm)2 = s2ET

(These formulas are of interest in branching processes, and in chromatography.)

Example 67 The number of checks cashed at a bank per day is Poisson random variable N with mean l = 200. The amount of each check is a random variable with a mean of $30 and a standard deviation of $5. If the bank has $6860 on hand, is the demand likely to be met?

Problem 44 Suppose xk, T are independent. Find the variance of XT in terms of the first two moments of x, T.

7.5.2  Example: chromatography

Chromatography is a technique of separation mixtures into compounds. One of its uses is to produce the DNA bands.

The sample is injected into a column, and the molecules are transported along the length by electric potential, flow of gas, or liquid. The basis for chromatographic separation of a sample of molecules is difference in their physical characteristics. The molecules switch between two phases: mobile, and stationary, and the separation of compounds is caused by the difference in times spend in each of the phases.

Suppose that the molecules of a compound spend random independent amounts of time Uk in mobile phase and random amount of time Wk in the stationary phase. Thus at time t the position of a molecule is given by a random sum vj = 1T(t) Uj, where T(t) = inf{k: j = 1kUj+Wj > t}.

Section 7.5.1 gives formulas for the mean and the variance of the position. Since the number of transitions T is likely to be large for a typical molecule, it isn't surprising that the actual position has (asymptotically) normal distribution. (The actual Central Limit Theorem for random sums is not stated in these notes.)

[ 27 Simulate the output of the chromatography column of fixed length separating a pair of substances that have different distributions of mobile and stationary phases Uk, Wk. Select a hundred particles of each substance, and measure the degree of separation at the end of the column.

7.5.3  Ruining a gambler

The following model is a reasonable approximation to some of the examples in Section 7.4.

Suppose a gambler can afford to loose amount L > 0, while the casino has capital C < 0. Let xj = 1 be i. i. d. random variables modelling the outcomes of consecutive games, Sn be the partial sums (representing gains of the gambler), and let T = inf{k:Sk L or Sk C} be the total number of games played. Then Pr(T > k) = Pr(C < Sk < L) and thus ET = kPr(C < Sk < L).

The special case Pr(x = 1) = 1/2 is easily solved, since in this case EST = ExET = 0. Let p = Pr(ST = C) denote the probability of ruining the casino. Since ST is either C, or L we have 0 = EST = pC+(1-p)L, giving p = L/(L-C). This formula means that a gambler has a fair chance of ruining a casino in a fair game, provided he brings with him enough cash L.

For more general random walks (and less fair games) probability of gambler's ruin can be found explicitly using the one-step-analysis (Section ). It is also interesting to find how long a game like that would last on average. (The expression for ET given above is not explicit.)

7.5.4  Random growth model

The following models various growth phenomena like the spread of a disease, where the infected individual may either die, or infect a number of other individuals. Here we concentrate on bacteria which have simple reproduction mechanism, and all spatial relations are neglected.

Let Xt denote the number of bacteria in t-th generation, with X0 = 1. Assume that a bacteria can die with probability q > 0, or divide into two cells with probability p = 1-q, and that all deaths occur independently. Our goal here is to find the average number of bacteria m(t) = E(Xt) in the t-th generation. This can be recovered from Theorem 7.5.1. Instead, we show another method based on conditioning.

The number of bacteria in the next generation is determined by binomial probabilities: Pr(Xt+1 = 2k|Xt = n) = (nk)pk qn-k. Therefore E(Xt+1|Xt) = 2p Xt and the average population size m(t) = E(Xt) satisfies difference equation m(t+1) = 2p m(t). We have m(t) = (2p)t. In particular, the population on average grows exponentially when p > 1/2.

It is perhaps surprising that a population of bacteria with p = 3/4, which on average grows by 50% per generation, has still a 1/3 chance of going extinct. One way to interpret this is to say that infections by a ``deadly" and rapidly developing desease may still have a large survival rate without any intervention of medicine, or immune system. (The methods to compute such probabilities will be introduced in Section . The answer above assumes infection by a single cell.)

Problem 45 Find the formula for the variance Var(Xt) of the number of bacteria in t-th generation.

Problem 46 What is the probability that an infection by 10 identical bacteria with the doubling probability p = 3/4 dies out?

Chapter 8
Markov processes

evolution, n. [L. evolutio (-onis)], an unrolling or opening...
Webster's New Universal Unabridged Dictionary

Markov processes are perhaps the simplest model of a random evolution without long-term memory.

Markov process is a sequence Xt of random variables indexed by discrete time t ZZ\scriptscriptstyle + , or continuous t 0 that satisfies the so called Markov property. The set of all possible values of variables Xt is called the state space of the Markov chain. Typical examples of state spaces are IR, IN, the set of all non-negative pairs of integers, and finite sets.

Markov chains are Markov processes with discrete time. Thus a Markov chain is an infinite sequence {Xk}k ZZ\scriptscriptstyle + of (usually, dependent) random variables with short-term (one-step) memory.

8.1  Markov chains

The formal definition of the Markov property is as follows. Definition 20 A family of discrete r. v. {Xk}k ZZ\scriptscriptstyle + is a Markov chain, if

(Xk+1 U|X0,...,Xk) = Pr
(Xk+1 U|Xk)
depends only on the present value Xk.

Examples of Markov chains are:

Examples of non-Markov processes are easy to construct, but lack of Markov propertry is not obvious to verify. In general, if Xk is a Markov process, Yk = f(Xk) may fail to be Markov.

8.1.1  Finite state space

If a Markov chain has a finite state space, we can always assume28 it consists of integers.

Markov condition

(Xk+1 = x|X0,..., Xk) = Pr
(Xk+1 = x|Xk)
implies that probability of reaching x in the next step depends only on the present value Xk. The probabilistic behavior of such a chain is completely determined by the initial distribution pk = Pr(X0 = k) and the transition matrices Pn(i,j) = Pr(Xn = j|Xn-1 = i), see formula (11) on page pageref. For mathematical convenience we shall assume that one step transition matrices Pt = P do not depend on time t. Such Markov chains are called homogeneous. This assumption isn't realistic, nor always convenient. For instance, the Markov simulation in Section uses a Markov chain with transitions that vary with time. But homogeneous Markov chains are still flexible enough to handle some time dependencies efficiently through modifications to the state space.

Example 68 Suppose Xn is a Markov chain with periodic transition probabilities Pn = Pn+T. Then Yn = (Xn+1,Xn+2,..., Xn+T) is a homogeneous Markov chain.

Problem 47 Suppose xj are independent {0,1}-valued with Pr(x = 1) = p. Let Xn = axn+bxn+1, where ab 0.

Explain why Xn is a Markov chain.

Write the transition matrix for the Markov chain Xn.

[ 3 The probabilities Pr(Xn = j|X0 = i) are given by the i,j-entries of the matrix Pn

Proof. This is the consequence of Markov property (78) and the total probability formula (10). [¯]
Powers of moderately sized matrices are easy to compute on the computer. Section indicates a mathematical method of computing Pn for small dimensions using the Cayley-Hamilton theorem. Under certain conditions the powers converge. [ 28 Find limnPn for the matrix from Problem 8.1.1.

8.1.1 Stationary Markov processes

Suppose Pr(X0 = k) = pk, where pk solve stationarity equations

pk = 1
[p1,..., pd] = [p1,..., pd]×P

Then the resulting process is stationary: the distribution of each k-tuple (X(t1),...,X(tk)) is invariant under shifts in time,

(X(t1+s),...,X(tk+s)) @ (X(t1),...,X(tk)).

This is interpreted as ``equilibrium", or ``steady state". Notice that ``steady state" is a statistical concept, and is not easily visible in a simulation of the single trajectory. In order to be able to see it, one has to simulate a large number of independently evolving Markov chains that begin with the same initial distribution and have the same transition matrix.

If Xt is a stationary Markov process and f is a function on its state space, then Y(t) = f(Xt) is also stationary, although not necessarily Markov.

If Xj(t) are independent realizations of the same Markov process and f is a function on their state space then Yn(t) = n-1/2j = 1n (f(Xj(t))-m) is stationary and approximately normal random sequence.

8.1.2  Markov processes and graphs

The states of a Markov chain may be represented by vertices of a graph, and one step transitions may be described by directed edges with weights. Such representation of a markov chain aids in visualizing a Markov chain.

8.1.2 Classification of states

Graph notions have bearing on properties of the Markov chain. In particular, Markov chain is irreducible, if the corresponding graph is connected. Markov chain is periodic, if there is N > 1 (the period) such that all cycles of the graph are multiples of N. If there is no such N then Markov chain is called aperiodic.

Finite state Markov chain is regular, if there is deterministic number N such that all states are connected by paths of length at most N. Problem 48 Show that regular Markov chain is aperiodic and irreducible.

8.1.2 Trajectory averages

Additive functionals of a Markov process are expressions of the form 1/nj = 0n-1 f(Xj). Under certain conditions, the averages converge and the limit doesn't depend on the initial distribution. Under certain conditions, partial sums are approximately normal.

Problem 49 Let Xk be an irreducible {0,1}-valued Markov chain with invariant initial distribution.

8.1.2 Asymptotic probabilities

Let pk(n) = Pr(Xn = k), and suppose that the limit pk() = limnpk(n) exists. Since pk(n+1) = j = 1dpj(n)P(j,k), therefore the limit probabilities satisfy the stationarity equations (79)

For regular (finite state space) Markov chains the limit actually exists independently of the initial state. Therefore the stationarity equations can be used to find the limit.

Problem 50 Let

P = [

Problem 51 Let

P = [

Problem 52 Let

P = [

8.1.2 Example: two-state Markov chain

Suppose Xk is a Markov chain with transition matrix

P = [
]. Then

Pn = 1


+ (1-a-b)n



If 0 < a,b < 1




and the rate of convergence is exponentially fast.

Problem 53 Suppose Xk is a Markov chain with transition matrix

P = [
Then Yn = (Xn,Xn+1) is also a Markov process. Find its transition matrix and the stationary distribution.

8.2  Simulating Markov chains

Homogeneous and non-homogeneous Markov chains with finite state space are fairly straightforward to simulate. A generic prescription is to simulate Xn+1 using the conditional distribution determined by Xn. This simulation differs very little from the generic method described in Section 2.1.2

8.2 Example: how long a committee should discuss a topic?

This example involves simulation of a Markov chain. For Markov chains many theoretical results are available, but simulation is often the most expedient way to study it.

[ 29 Suppose n people on a committee discuss a certain issue. When one person finishes speaking, we assume that it is equally likely that any of the other n-1 begins. We also assume that each person speaks for exponentially distributed random time. How long does it take on average for all participants to take part in the discussion?

8.2.1  Example: soccer

The soccer game29 is played by two teams, each with 10 players in the field and a goalkeeper. A modern line-up split the players into between the zones of defence, center, and attack. Thus a configuration (3-4-4) means 3 defenders (backs), 4 midfield link men and 4 strikers (forwards). In the Markov model of soccer we will just watch the position of the ball, and we assume it can be only in one of the five positions: left goal, left defence, midfield, right defence, right goal. Wee shall assume that at every unit of time the ball has to move left or right with chances proportional to the number of players lined-up for a given position.

For each possible configurations of teams, we could determine the average score, and thus determine which player distribution is the best, if there is one. If there is no best single arrangement that wins against all other arrangements, then we should still be able to find the optimal mixed strategy.

[ 30 For the Markov chain defined above, select a pair of configurations and determine the following.

8.2.1  A brief review of game theory

In game theory, a rational player is supposed to minimize her losses against the best choice of the opponent. The terminology uses minimax= minimize maximal loss, maximin= maximize minimal gain; the amazing fact is that these are equal and often the optimal strategies are randomized (mixed).

The optimal strategy in a game is the best (in the above minimax sense) randomized strategy against every deterministic strategy of the opponent. Problem 54 What is the ``optimal" arrangement of players in soccer?

8.2.1 Modifications

As given above, the model of the soccer game is simple enough to be analyzed without computer. In a more realistic model one could consider additional factors.

8.3  One step analysis

For homogeneous Markov chains a surprising number of quantities of interest can be computed using the so called one step analysis. The idea is to find out what happens to the quantity of interest within one step of the evolution, and solve the resulting difference equation.

Example 69 In the setup of Section 7.5.3, suppose Pr(x = 1) = p = 1-Pr(x = -1) . Let T be the first time random walk reaches L > 0 or C < 0. Find the probability of winning (ruining the casino) Pr(XT = C|X0 = 0).

Note that evven though we are interested in a single number Pr(XT = C|X0 = 0), the fist-step analysis requires computing probabilities p(x) = Pr(XT = C|X0 = x) for all initial positions x.

Example 70 Suppose Pr(x = 1) = p, Pr(x = -1)1-p . Let T be the first time random walk reaches L > 0 or C < 0. Find the average length of the game E0(T).

Note that the fist-step analysis requires computing m(x) = Ex(T) for all initial positions x.

Problem 55 On average, how long does it take for a symmetric random walk on a square lattice to exit from a d×d square?

Problem 56 For the Markov chain defined in Section 8.2.1, select a pair of player configurations and determine the following.

Problem 57 A fair coin is tossed repeatedly until k = 2 successive heads appear. What is the average number of tosses required? (Hint: see Problem 8.1.1, or run a simulation.)

Problem 58 One of the quality control rules is to stop the process when on k = 8 successive days the process runs above, or below, specification line. On average, how often a process that does not need any adjustment will be stopped by this rule?

8.3.1  Example: vehicle insurance claims

Suppose that you have an insurance contract for a vehicle. The premium payment is due yearly in advance and there are four levels P1 > P2 > P3 > P4. The basic premium is P1, unless no claims were made in the previous year. If no claims were made in a given year, then the premium for the following year is upgraded to the next category.

Suppose that the probability of the damage larger than s hundred dollars is e-s (substitute your favorite density for the exponential density used in this example). Because of the incentive of lowered premium, not all damage should be reported. The goal of this example is to determine numbers s1,s2,s3,s4 above which the claims should be filed; sj is a cutoff used in a year when premium Pj is paid.

Let Xt be the type of premium paid in t-th year. Clearly, the transitions are i 1 with probability e-si, 1 2 with probability 1-e-s1, etc.

In the long run, the yearly cost is

C(s1,s2,s3,s4) =
where pj are the equilibrium probabilities and mj are average un-reimbursed costs: mj = 1000sj s e-s  ds. The optimal claim limits follow by minimizing expression (82).

[ 31 Find optimal sj when P1 = 800, P2 = 720, P3 = 650, P4 = 600 (about 10% discount).

8.3.2  Example: a game of piggy

In a game of piggy, each player tosses two dice, and has an option of adding the outcome to his score, or rolling again. The game ends when the first player exceeds 100 points.

Each player has to toss the dice at least once per turn. The player can choose to toss the dice as many times as (s)he wishes as long as no ace occurs. However the current total is added to player's score only when the player ends his turn voluntarily.

If an ace occurs, then the player's turn is over, and his score is not increased. If two aces occur then the score is set to zero.

A player chooses the following strategy: toss the dice for as long as an ace occurs, or the sum of current subtotal+score exceeds number K. (If her score exceeds K she tosses the dice once, as this is required by the rules.) The quantity to optimize is the average number of turns that takes the player to reach the score of a hundred. The number of turns under this strategy is the number of aces when the score is less than K, plus the number of throws when the score is larger than K.

If Xk denotes players score after the k-th throw, then clearly Xk is a Markov chain with the finite number of states, and with chance [1/ 36] of returning to zero at each turn.

Which value of K minimizes the average number of turns that takes the player to reach a hundred?

A simple-minded computation would just take into account the average gain and disregard the score completely. If the players subtotal is t then after the additional throw the total is on average 23/36×(t+7). This is less then t for t 12, thus there is no point continuing beyond 12. Is this conclusion correct? Should the player choose to stop once the total on the dice exceeds 12? [ 32 What is the average number of turns it takes to reach a 100 under this strategy?

A less simple-minded computation would take into account the average gain: If the players score is s then after the additional throw his score on average is 1/36 ×0+ 1/3 ×s + 23/36×(s+7) which is more then s for s < 7×23. Is this conclusion correct? Should the player always choose to toss again? [ 33 What is the average number of turns it takes to reach a 100 under this strategy?

[ 34 What is the average number of turns it takes to reach a 100 under the cautious strategy of no additional tosses?

Another computation would take into account the current score s and current subtotal t. After the additional throw the numbers on average are s1 = 1/36 ×0+ 35/36 ×s, t1 = 23/36×(t+7). On average, t1+s1 > t+s when t < 12-s/13.

More complicated strategies could depend on current totals of other players, current score of the player, and the subtotal in some more complicated fashion. For instance, if s denotes the score, t denotes current subtotal, one could stop using two-parameter criterion t > A-Bs. This may have seemingly different effects than the previous strategy: when A is large, and score s is close to 0 there is no need for stopping; but if accumulated score s is large, the player may behave more cautiously.

One could optimize the probability of winning against k = 3,4 players instead of just minimizing the average number of tosses. The latter puts this problem into game theory framework. Simulations seem to indicate that the strategy based on t < 25-1/9s works well against inexperienced human players.

8.4  Recurrence

A state x of Markov chain is recurrent, if with probability one the chain returns back to x. Otherwise it is called transient. If Xt is irreducible then all states are simultaneously either recurrent, or transient.

Theorem 23 State x is recurrent iff Px,x(t) = .

Proof. Let M be the number of times Xt returns to state x.

Let f be the probability of returning to x. State x is recurrent iff f = 1. Suppose f < 1 and let M be the number of returns. Clearly

Prx(M k) = fk, thus Ex(M) = f/(1-f) < . Since M = t I{Xt = x}, Ex(M) = Px,x(t).

Interesting fact: simple random walks in Rd are recurrent when d < 3, transient when d 3. However the return times have infinite average.

Theorem 24 Let T be a return time, and suppose m(x) = ExT < . Then Px,x(t) 1/m(x) as t.

Problem 59 Suppose Markov chain Xt moves to the right kk+1 with probability 1/k or returns to 1 with probability 1-1/k, k = 1,2,.... Find its stationary distribution, and the average return time to state k.

8.5  Simulated annealing

This sections describes a more refined method for randomized search of minima of functions.

Suppose a finite set V is given, and we are to minimize a function U:VIR.

The first step of design is to specify a connected directed graph \cal G = (V,E). In other words, for every point u V we need to pick the set of directed edges (u,v) for the markov chain to follow from state u. (This step is usually performed for computational efficiency; theoretically, all possible transitions could be admissible.)

The second step is to choose "control parameter" q that will vary as the program is running.

The third step is to define transition probabilities:

P(u,v) = C(u)e-q(U(v)-U(u))+,
where C(u,v) = v e-q(U(v)-U(u))+ is the norming constant, and the only v's considered are those with (u,v) E. (Can you explain now why \cal G shouldn't be the complete graph).

Theorem 25 The invariant measure of transition matrix (83) is pq(u) = 1/Z eqU(u)

Proof. To check the invariance condition, denote \cal N(u) = {v: (u,v) E}.

An efficient realization of the above is to pick v \cal Nu at random and accept it with probability

1 if U(v) U(u)
e-QU(v) otherwise.

8.5.1  Program listing

The program is available online, or on the disk.

PROGRAM fasttour.bas

'Declarations of subprograms
DECLARE SUB Modify (Path!(), i!)
DECLARE SUB GetNeighbor (P!(), T!)
DECLARE SUB GetRandomPair (i0!, P!(), n!)
DECLARE SUB display ()
DECLARE SUB AssignDistances (n!)
DECLARE SUB GetPermutation (P!())
'Clear display and write a title
INPUT ; "Shortest distance between how many cities?", n

nMax = 19  'only 19 cities in DATA statement
IF n > nMax THEN n = nMax
'declare matrices required
DIM P(n), BestP(n) ' permutation and best permutation P(i)=# of i-th city visited
'shared matrices
DIM SHARED dist(nMax, nMax) 'distances  between cities i, j
DIM SHARED city(nMax) AS STRING 'names of cities

'read distances
CALL AssignDistances(nMax)

'initial permutation
FOR j = 1 TO n
        P(j) = j + 1
        BestP(j) = j + 1
P(n) = 1
BestP(n) = 1

MinLen = PathLen(P())
T = 1 / 10000
DO  'infinite loop
'theta =1/temperature
T = T * 1.1
PRINT "T="; 1 / T
FOR ct = 1 TO 100 * n

        'check if key pressed
        k$ = INKEY$
        IF k$ > "" THEN EXIT DO  'exit infinite loop
        CALL GetNeighbor(P(), T)
        x = PathLen(P())
        IF x < MinLen THEN   'memorize
                MinLen = x
                FOR j = 1 TO n
                        BestP(j) = P(j)
                        PRINT city(P(j)); "->";
                NEXT j
                PRINT city(P(1))
                PRINT "Best so far: "; MinLen
        END IF
'Print final message
PRINT "Simulated Annealing Recommended Route ";
FOR j = 1 TO n - 1
        PRINT city(BestP(j)); "-->";
PRINT city(BestP(n)); "-->"; city(BestP(1))
PRINT "Total distance: "; MinLen
PRINT "Can you find a better one?"
LOCATE 22, 40
PRINT "(Distances subject to change)"

cities: 'FIRST 19 cities in alphabetic order (Source: AAA map)
DATA " Albany", "Albuquerque", "Atlanta", "Atlantic City", "Augusta"
DATA "Austin", "Baltimore","Baton Rouge", "Billings", "Birmingham", "Bismarck", "Boise", "Boston"
DATA "Buffalo", "Calgary", "Charlotte", "Cheyenne", "Chicago", "Cincinnati"

distances: 'from city number RowNumber to RowNumber+k, k=0 to nMax
DATA 0,2046,1067,315,336,1985,358,1585,2071,1167,1654,2615,174,285,2485,808,1789,823,724
DATA 0,1430,1986,2384,849,1949,1110,1006,1298,1423,1205,2222,1761,1712,1663,541,1289,1347
DATA 0,872,1287,1011,671,580,2019,158,1571,2451,1125,926,2425,258,1625,740,487
DATA 0,535,1790,163,1390,2085,968,1668,2629,373,455,2522,614,1803,837,664
DATA 0,2205,578,1805,2407,1383,1990,2951,162,621,2677,1029,2125,1159,1060
DATA 0,1627,436,1650,853,1558,2011,2043,1591,2356,1269,1204,1155,1152
DATA 0,1227,1948,805,1531,2492,416,459,2385,451,1666,700,527
DATA 0,1878,422,1589,2239,1643,1309,2443,838,1432,976,870
DATA 0,1861,417,732,2245,1786,706,2019,465,1248,1542
DATA 0, 1521,2293,1221,922,2375,416,1467,690,483
DATA 0,1149,1828,1369,927,1602,882,831,1125
DATA 0,2789,2330,875,2495,826,1792,2040
DATA 0,459,2515,867,1963,997,898
DATA 0,2200,693,1504,538,439
DATA 0,2456,1171,1685,1979
DATA 0,1669,771,490
DATA 0,966,1214
DATA 0,294

SUB AssignDistances (n)
'This reads all distances to dist(i,j)
RESTORE cities
FOR j = 1 TO n
   READ city(j)
RESTORE distances

REDIM dist(n, n)
FOR j = 1 TO n
   FOR i = j TO n
      READ x
      dist(i, j) = x
      dist(j, i) = x
      'minimal error handling
      IF x = 0 AND i <> j THEN PRINT "Error in DATA before row "; j; "col"; i: STOP
   NEXT i

SUB display
'show information about what the program does
  STATIC tst
        tst = tst + 1
        y = CSRLIN
        x = POS(1)
        LOCATE 1, 45
        PRINT "    Hit any key to stop"
        LOCATE 2, 60
        PRINT "                  "
        LOCATE 2, 60
        PRINT "Trying path #"; tst
        LOCATE y, x


SUB GetNeighbor (P(), T)
'Move at  random  to another nearby permutation P
CP = PathLen(P())
DIM Q(n)
FOR k = 1 TO n: Q(k) = P(k): NEXT
i = INT(n * RND(1)) + 1
j = INT(i * RND(1)) + 2
IF j > n THEN j = 1
SWAP Q(i), Q(j)
Prob = EXP(-T * (CP - PathLen(Q())))
SWAP P(i), P(j)

FUNCTION PathLen (P())
'Compute length of path P according to distances in Dist(i,j)
s = dist(P(1), P(n))
FOR j = 1 TO n - 1
   s = s + dist(P(j), P(j + 1))
PathLen = s

[ 35 Notice that the DO ... LOOP selecting the next entry in GetNeighbor has random length. What is the average length of its execution?

8.6  Solving difference equations through simulations

In the example below we limit our attention to the one-dimensional problem. This simplifies the notation, while the basic ideas are the same.

Let u(x,t) be an unknown function of x IRd, t 0. The difference equation we may want to solve is the following discrete analog of the diffusion equation.

u(x,t+1) = u(x,t)+A 1

v = ek 
u(x,o) = u0(x)

The solution is u(x,t) = E(u0(St)), where St = j tej is the sum of independent random variables with 2d equally likely values ek IRd.

8.7  Markov Autoregressive processes

Suppose xk is a stationary Markov chain and let Xn be the solution of the difference equation Xn+1-a Xn = xn+1. One can write the transition Matrix for Markov process Xt and try find the stationary distribution for X0.

A more direct method is based on the fact that the solution of the difference equation is Xt = atX0+at-1x1+...+axt-1+xt. Therefore if |a| < 1, the stationary initial distribution is X0 = ak xk. Thus Xt = k = 0 ak xt-k

Problem 60 Suppose xk are i. i. d. Find the covariance EX0Xk.

Problem 61 Suppose xk are i. i. d. Find E(Xk|X0).

Solutions of higher order difference equations can be easily out into the Markov framework, too. If Xn+2+aXn+1+bXn = xn+1 then Yn = (Xn+1,Xn) is Markov and satisfies the corresponding equation in matrix form: Yn+1 = AYn+Xn+1. Therefore the stationary solution exist provided that the eigenvalues of A satisfy inequality |lj| < 1.

8.8  Sorting at random

Efficiency of sorting algorithms is often measured by the number of comparisons required.

To sort efficiently a set S of numbers into ascending order we should find a number y such that about half of the elements of S is below y. Then the total number of steps required is T(n) T(/n/2)+n+C(n), where C(n) is the number of comparisons required to find y.

Random Quick Sort is based on an idea that a random choice of y is good enough on average.

Theorem 26 The expected number of comparisons in random quick sort is at most 2nln(n+1).

Here is one possible realization of the subprogram:

PROGRAM qsrt.bas

SUB QuickSort (Index(), Aux(), First%, Last%)
'sorts two matrices in increasing order by the valuies of Index() from pocz to kon
' Note: mixes order of indices corresponding to equal Index(j)

  '** Quick-sort (ascending) the fields in Array$(), from field First% thru Field Last%
 IF First% >= Last% THEN EXIT SUB
 CONST max = 30
 DIM Lft%(max + 1), Rght%(max + 1)
 Temp% = 1
 Lft%(1) = First%
 Rght%(1) = Last%
   Start% = Lft%(Temp%)
   Ende% = Rght%(Temp%)
   Temp% = Temp% - 1
    IndexLft% = Start%
    IndexRght% = Ende%
    x = Index((Start% + Ende%) \ 2)
     WHILE Index(IndexLft%) < x AND IndexLft% < Ende%
     IndexLft% = IndexLft% + 1
     WHILE x < Index(IndexRght%) AND IndexRght% > Start%
     IndexRght% = IndexRght% - 1
     IF IndexLft% > IndexRght% THEN EXIT DO 
     SWAP Index(IndexLft%), Index(IndexRght%)    '** switch elements
     SWAP Aux(IndexLft%), Aux(IndexRght%)
      IndexLft% = IndexLft% - (IndexLft% < Ende%)
      IndexRght% = IndexRght% + (IndexRght% > Start%)
    IF IndexRght% - Start% >= Ende% - IndexLft% THEN
      IF Start% < IndexRght% THEN
        Temp% = Temp% + 1
        Lft%(Temp%) = Start%
        Rght%(Temp%) = IndexRght%
      END IF
      Start% = IndexLft%
      IF IndexLft% < Ende% THEN
        Temp% = Temp% + 1
        Lft%(Temp%) = IndexLft%
        Rght%(Temp%) = Ende%
      END IF
       Ende% = IndexRght%
    END IF
   LOOP WHILE Start% < Ende%
  IF Temp% > max THEN Temp% = 0

8.9  An application: find k-th largest number

The following theorem occasionally helps to estimate the average time of accomplishing a numerical task.

Theorem 27 Suppose g(x) is increasing (non-decreasing) function and Xt is a Markov chain on IN that moves left only and E(Xt+1|Xt = m) m+g(m). Let T be the time of reaching 1. Then En(T) 1n [1/ g(x)] dx.

Proof. By induction

En(T) 1+E1X [dx/ g(x)] = 1+1n[dx/ g(x)]-EXn[dx/ g(x)] 1+1n[dx/ g(x)]-EXn[dx/ g(n)] 1+1n [dx/ g(x)]+E[(X-n)/ g(n)] [¯]
As an application we consider the following algorithm to pick the k-th number in order from a set S of n numbers.

  1. Initialize S1 = S, S2 = .
  2. Pick y at random from S1.
  3. Revise sets S1 = {x: x < y}, S2 = {x: x > y}.
  4. If |S1| = k-1 then y was found.
  5. If |S1| > k then repeat the process with new S1.
  6. If |S1 < k-1 then swap S1 and S2, replace k by k-|S1|-1, and repeat the process.

Problem 62 Estimate the average running time of the above algorithm. (ANS: ET 4lnn).

Chapter 9
Branching processes

Suppose certain objects multiply independently and in discrete time intervals. Each object at the end of every period produces a random number x of descendants (offspring) with the probability distribution pk = Pr(x = k). Let Xt be the total number of objects at t-th generation.

Then in distribution Xt+1 = j = 1Xtxj. The three questions of interest are the average size of the population, its variance, and the probability of extinction.

Definition 21 By extinction we mean the event that the random sequence {Xt} consists of zeros for all but the finite number of values of t IN.

Probability of extinction by time n can be found directly from the first-step-analysis: numbers un = Pr(Xn = 0) satisfy

un+1 =

k = 0 
pk (un)k.

9.1  The mean and the variance

Let mn = E(Xn), Vn = Var(Xn). By Theorem 7.5.1 and induction we have

mn = mn
Vn = s2mn-1(1-mn)/(1-m).

9.2  Generating functions of Branching processes

Let g(z) = k = 0 pkzk be the generating function. Clearly g(z) p0+mz for small z. Equation (86) for probabilities un of extinction by the n-th generation is un+1 = g(un).

Theorem 28 [Watson(1874)] The generating function Hn(z) of Xn is the n-fold composition (iteration) of g.

Proof. E(zXn+1|Xn = k) = (g(z))k. Thus by total probability formula (46) Hn+1(z) = EzXn+1 = k(g(z))kPr(Xn = k) = Hn(g(z)). [¯]
In particular, EXn = [d/ dz] g(n)(z)|z = 1 = mn and un = Pr(Xn = 0) = g(n)(0).

Problem 63 Prove (87) using moment generating function directly.

9.2.1  Extinction

Notice that if there is a limit of q = limng(n)(0), then it has to satisfy the equation

g(s) = s.

Since Xt is integer valued and the events An = {Xt = 0} are decreasing, by continuity (1) the probability of extinction q = limnPr(Xn = 0) exists.

Theorem 29 If EX1 1, the extinction probability q = 1. If EX1 > 1, the extinction probability is the unique nonnegative solution less than 1 of the equation (89).

Proof. This is best understood by graphing g(s) and marking the iterates on the diagonal.

Check by induction that gn(0) < 1 for all n IN. If there is a solution s0 < 1 of g(s) = s, then it is the ``attractive point" of the iteration.


9.3  Two-valued case

Below we re-analyzes the growth model presented in Section 7.5.4. Suppose that the probabilities of offspring are p0 = q, p2 = 1-q.

The generating function is g(z) = q+(1-q) z2. Asymptotic probability of extinction solves quadratic equation (1-q)z2-z+q = 0. The roots are z1 = 1 and z2 = [(q)/( 1-q)]. In particular, probability of extinction is 1 when q 1/2.

When q = 1/4 probabilities of extinction at n-th generation are u0 = 0,u1 = .25, u2 = .296875, u3 = .3161, u4 = .3249, u5 = .33127, u6 = .3323.

9.4  Geometric case

Suppose that the probabilities of offspring are p0 = 1-rq,pk = q(1-r)rk.

The generating function is g(z) = (1-rq) +q(1-r) r[z/( 1-rz)]. The most interesting feature of this moment generating function is that it can be readily composed.

[ 2 The composition of fractional linear functions f(z) = [(a+bz)/( c + dz)] and g(z) = [(A+Bz)/( C + Dz)] is a fractional linear function (with the coefficients given by the matrix multiplication!).

Asymptotic probability of extinction has to solve quadratic equation 1-rq+q(1-r) r[z/( 1-rz)] = z. The roots are z1 = 1 and z2 = [(q)/( 1-q)]. In particular, probability of extinction is 1 when q 1/2.

Since the iterates of the generating function can actually be written down explicitly, in geometric case Pr(Xn = 0) = 1-mn(1-pe)/(mn-pe) is explicit. Here pe = z2, m = ....

Problem 64 Suppose that in a branching process the number of offspring of the initial seedling has a distribution with generating function F(z). Each member of the next generation has the number of offspring whose distribution has generating function G(z). The distributions alternate between generations.

Problem 65 In a simple model of linear immunological response, the doubling probability p of the population of bacteria changes with time due to the increased number of lymphocytes. If there are X(t) bacteria at t-th generation, then assume p = a/(t+a). Find the probability ut of extinction by t-th generation for infection by 10 bacteria. What is the average length of the disease?

Chapter 10
Multivariate normal distribution

Univariate normal distribution, standardization, and its moment generating function were introduced in Chapter 4. Below we define multivariate normal distribution.

10.1  Multivariate moment generating function

We follow the usual linear algebra notation. Vectors are denoted by small bold letters x, v, t, matrices by capital bold initial letters A, B, C and vector-valued random variables by capital boldface X, Y, Z; by the dot we denote the usual dot product in IRd, ie. x·y: = j = 1d xjyj; ||x|| = (x·x)1/2 denotes the usual Euclidean norm. For typographical convenience we sometimes write (a1,,ak) for the vector

]. By AT we denote the transpose of a matrix A.

Below we shall also consider another scalar product ·,· associated with the normal distribution; the corresponding semi-norm will be denoted by the triple bar |||·|||.

Definition 22 An IRd-valued random variable Z is multivariate normal, or Gaussian (we shall use both terms interchangeably; the second term will be preferred in abstract situations) if for every t IRd the real valued random variable t·Z is normal.

Example 71 Let x1,x2, be i. i. d. N(0,1). Then X = (x1,x2,,xd) is multivariate normal.

Example 72 Let x be N(0,1). Then X = (x,x,,x) is multivariate normal.

Example 73 Let x1,x2, be i. i. d. N(0,1). Then X = (X1, X2, , XT), where

Xk = j = 1kxj are partial sums, is multivariate normal.

Clearly the distribution of univariate t·Z is determined uniquely by its mean m = mt and its standard deviation s = st. It is easy to see that mt = t·m, where m = EZ. Indeed, by linearity of the expected value mt = Et·Z = t·EZ. Evaluating the moment generating function M(s) of the real-valued random variable t·Z at s = 1 we see that the moment generating function of Z can be written as

M(t) = exp( t·m+ st2

10.2  Bivariate normal distribution

In this section we consider a pair of (jointly) normal random variables X1, X2. For simplicity of notation we suppose EX1 = 0, EX2 = 0. Let Var(X1) = s12, Var(X2) = s2 2 and denote corr(X1, X2) = r. Then the covariance matrix is

C = [
] and the joint moment generating function is

M(t1, t2) = exp( 1
t12s12+ 1
t22 s2 2+t1t2r).
If s1 s2 0 we can normalize the variables and consider the pair Y1 = X1/s1 and Y2 = X2/s2. The covariance matrix of the last pair is

CY = [
]; it generates scalar product given by




= x1y1+x2y2+rx1y2+rx2y1
and the corresponding (semi)-norm is

] ||| = (x12+x22+2rx1x2)1/2. Notice that when r = 1 the semi-norm is degenerate and equals |x1x2|.

Denoting r = sin2q, it is easy to check that

Y1 = g1cosq+ g2sinq,
Y2 = g1sinq+ g2cosq
for some i.i.d normal N(0, 1) r. v. g1, g2. One way to see this, is to compare the variances and the covariances of both sides.

This implies that the joint density of Y1 and Y2 is given by

f(x, y) = 1
exp(- 1
( x2+ y2-2xysin2q))
which is a variant of (59).

Another representation

Y1 = g1,
Y2 = rg1+   æ

illustrates non-uniqueness of the linear representation.

The latter representation makes the following Theorem obvious in the bivariate case.

Theorem 30 Let X,Y be jointly normal.

(i) If X,Y are uncorrelated, then they are independent.

(ii) E(Y|X) = m+AX is linear

(iii) Y-AX and X are independent.

Returning back to random variables X1, X2, we have X1 = g1 s1cosq+ g2 s1sinq and X2 = g1 s2sinq+ g2 s2cosq; this representation holds true also in the degenerate case.

10.2.1  Example: normal random walk

In this example we analyze a discrete time Gaussian random walk {Xk}0 k T. Let x1,x2, be i. i. d. N(0,1). We are interested in explicit formulas for the moment generating function and for the density of the IRT-valued random variable X = (X1, X2, , XT), where

Xk = k

j = 1 
are partial sums.

Clearly, m = 0. Equation (91) expresses X as a linear transformation X = Ag of the i. i. d. standard normal vector with

A =


Therefore from () we get

M(t) = exp 1
To find the formula for joint density, notice that A is the matrix representation of the linear operator, which to a given sequence of numbers (x1,x2,,xT) assigns the sequence of its partial sums (x1, x1+x2,, x1+x2++xT). Therefore, its inverse is the finite difference operator D: (x1,x2,,xT)(x1, x2-x1,, xT-xT-1). This implies

A-1 =


Since det A = 1, we get

f(x) = (2p)-n/2exp- 1
Interpreting X as the discrete time process X1, X2,, the probability density function for its trajectory x is given by f(x) = Cexp(-1/2 ||Dx||2).

Expression 1/2 ||Dx||2 can be interpreted as proportional to the kinetic energy of the motion described by the path x; assigning probabilities by Ce-Energy/(kT) is a well known practice in statistical physics. In continuous time, the derivative plays analogous role.

10.3  Simulating a multivariate normal distribution

To simulate any d-dimensional normal distribution we need only to simulate d independent N(0,1) random variables and use linear representations like in Theorem . For such simulation the covariance matrix needs to be inverted and diagonalized, a numerical nuisance in itself. When the multivariate normal distribution occurs as the so-called time series, a method based on Fourier expansion is then convenient, see Section , or the introductory examples in Chapter .

10.3.1  General multivariate normal law

The linear algebra results imply that the moment generating function corresponding to a normal distribution on IRd can be written in the form

M(t) = exp( t·m+ 1

10.4  Covariance matrix

Theorem 3.2 identifies m IRd as the mean of the normal random variable Z = (Z1, , Zd); similarly, double differentiation M(t) at t = 0 shows that C = [ci, j] is given by ci, j = Cov(Zi, Zj). This establishes the following.

Theorem 31   The moment generating function corresponding to a normal random variable Z = (Z1, , Zd) is given by (93), where m = EZ and C = [ci, j], where ci, j = Cov(Zi, Zj), is the covariance matrix.

From (93) and linear algebra we get also

M(t) = exp( t·m+ 1

We have the following multivariate generalization of (55).

Theorem 32 Each d-dimensional normal random variable Z has the same distribution as m+Ag, where m IRd is deterministic, A is a (symmetric) d×d matrix and g = (g1, , gd) is a random vector such that the components g1, , gd are independent N(0, 1) random variables.

Proof. Clearly, Eexp( t·(m+Ag)) = exp(t·m) Eexp( t·(Ag)). Since the moment generating function of g is Eexp( x·g) = exp1/2 || x|| 2 and t·(Ag) = (ATtg, therefore we get

Eexp( t·(m+Ag)) = expt·mexp+1/2 || ATt|| 2, which is another form of (94). [¯]

10.4.1  Multivariate normal density

Now we consider the multivariate normal density. The density of independent g1,..., in Theorem 10.4 is the product of the one-dimensional standard normal densities, ie.
fg (x) = (2p)-d/2exp(- 1
|| x||2).
Suppose that det C 0, which ensures that A is nonsingular. By the change of variable formula, from Theorem 10.4 we get the following expression for the multivariate normal density.

Theorem 33 If Z is centered normal with the nonsingular covariance matrix C, then the density of Z is given by

fZ (x) = (2p)-d/2 (det
A)-1exp(- 1
|| A-1x||2),
fZ (x) = (2p)-d/2 (det
C)-1/2exp(- 1
where matrices A and C are related by C = A×AT.

In the nonsingular case the density expression implies strong integrability.

Theorem 34 If Z is normal, then there is e > 0 such that

Eexp(e|| Z||2) < .

Remark 3 Theorem 10.4.1 holds true also in the singular case and for Gaussian random variables with values in infinite dimensional spaces.

10.4.2  Linear regression

For general multivariate normal random variables X and Y we have the following linearity of regression result.

Theorem 35 If (X, Y) has jointly normal distribution on IRd1+d2, then

E{X|Y} = a+QY;
Random vectors Y-QY and X are stochastically independent.

Vector a = mX-QmY and matrix Q are determined by the expected values mX, mY and by the (joint) covariance matrix C (uniquely if the covariance CY of Y is non-singular). To find Q, multiply (95) (as a column vector) from the right by (Y-EY)T and take the expected value. By Theorem (i) we get Q = R×CY-1, where we have written C as the (suitable) block matrix

C = [

Problem 66 For the random walk from Section 10.2.1, what is E(Xk|X1,...,Xk-1,Xk+1,...,Xn)?

Problem 67 Suppose X1,...,Xd are jointly normal, EXj = 0, EXj2 = 1 and all covariances EXiXj = r are the same for i j.

Find E(X1|X2,X3,...,Xd). (Notice that in this example r > -1/d.)

10.5  Gaussian Markov processes

Suppose (Xt)t = 0,1,... is a Markov chain with multivariate normal distribution. That is, suppose X0 is normal, and the transition probabilities are normal, too.

Without loss of generality we assume EXt = 0, EXt2 = 1 and let EX0X1 = r. Then E(Xt+1|Xt) = rXt and therefore EX0Xt = rt. This means that the covariance matrix of the Markov Gaussian process depends on one parameter r only. Comparing the answer with Section 8.7 we find out that all homogeneous Markov Gaussian processes have the form Xt = k = 0 gk+trk, where gk are independent normal r. v.

Chapter 11
Continuous time processes

Continuous time processes are the families of random variables Xt, with t 0 interpreted as time.

11.1  Poisson process

Poisson distribution occurs as an approximation to binomial. Another reason for its occurrence is related to exponential random variables and counting customers in queues. The latter is perhaps the most efficient way of simulating Poisson random variables, see Section 6.2.4. A related reason for the frequent use of Poisson distribution in modeling is the law of rare events.

Definition 23 A Poisson process of intensity l > 0 is an integer-valued stochastic process {Nt:} with the following properties.

Suppose cars pass by an intersection and the times between their arrivals are independent and exponential. Thus we are given i. i. d sequence Tj of exponential r. v. (with parameter l). The number of cars that passed by within time interval (0,t) is a random variable Nt. Clearly, Nt is the first integer k such that T1+...Tk > t.

Theorem 36 For t > 0,s > 0 random variable N(t+s)-N(t) is independent of Nt and has Poisson distribution Poiss(sl).

Proof. We will prove only that Nt has the Poisson distribution. To simplify notation assume that exponential r.`v. have parameter l = 1. We prove the formula Pr(Nt = k) = [(tk)/ k!]e-t by induction.

k = 0: Pr(Nt = 0) = Pr(T > t) = t e-x  dx = e-t

Suppose the formula is true for k. Then

(Nt = k+1) = Pr
(T1+...+Tk+1 > t) =

1/G(k+1)xke-x  dx.
Integrating by parts we check that
(Nt = k+1) = t
(Nt = k).
Similar processes are considered in reliability theory, and in queueing theory, also for non-exponential sojourn times Tj.

Problem 68 Assume a device fails when a cumulative effect of k shocks occurs. If the shocks occur according to the Poisson process with parameter l, what is the density function for the life T of the device?

Problem 69 Let f(x,t) = Ef(x+Nt), where Nt is the Poisson process of intensity l = 1. Show that [(f)/(t)] = f(x+1)-f(x).

In particular, pt(k) = Pr(Nt = k) satisfies [(pt(k))/(t)] = pt(k+1)-pt(k).

Problem 70 Customers arrive at a facility at random according to a Poisson process of rate l. The customers are dispatched (processed) in groups at deterministic times T, 2T,3T,....

There is a waiting time cost c per customer per unit of time, and a dispatch cost K.

Problem 71 Find the mean ENt, variance Var(Nt) and the covariance cov(Nt,Ns).

Problem 72 Let X(t) = (-1)Nt. Find the mean EXt, variance Var(Xt) and the covariance cov(Xt,Xs).

The rate l in the Poisson process has probabilistic interpretation:

l =
h 0 
(Nt+h-Nt = 1)

In many applications we wish to consider rates l(t) that vary with time. The corresponding process is just a time-change of the Poisson process X(t) = NL(t), where L(t) = 0tl(s)ds.

11.1.1  The law of rare events

Let N(I) denote the number of events that occur in interval I. We make the following postulates.

  1. If intervals I1,... Ir are disjoint, then random variables {N(Ij)} are independent.
  2. For any t and h > 0 the probability distribution of N((t,t+h]) does not depend on t.
  3. [(Pr(N(Ih) 2))/ h] 0 as h 0
  4. [(Pr(N(Ih) = 1))/ h] l as h 0

Theorem 37 Nt = N((0,t]) is the Poisson process of intensity l.

Example 74 A careless programmer assigns memory locations to the variables in his program30 at random . Suppose that there are M locations and N = lM variables. Let Xi be the number of variables assigned to each location. If each location is equally likely to be chosen, show that

In the limit, what fraction of memory locations has two or more variables assigned?

Example 75 While testing a program, the number of bugs discoverd in the program follows the Poisson process with intensity l = 5 errors per hour. Tester's fiance enters the test area and agrees to wait for the tester to find just one more bug. How long will she wait on average: 12 minutes, or 6 minutes?

11.1.2  Compound Poisson process

The Poisson process Nt counts the number of events. If each event results in a random (and independent) outcome xj, then the total is the compound Poisson process Z(t) = j = 1Nt xj.

The moments and also the moment generating function of Z(t) can be determined through conditioning. Section 7.5.1 implies that if Ex = m, Var(x) = s2 then E(Z(t)) = lmt, Var(Z(t)) = l(s2+m2)t.

11.1.2 Examples

  1. Risk assessment: Insurance company has M customers. Suppose claims arrive at an insurance company in accordance with the Poisson process with rate lM. Let Yk be the magnitude of the k-th claim. The net profit of the company is then Z(t)-M qt, where q is the (fixed in this example) premium.
  2. A shock model: Let Nt be the number of shocks to a system up to time t and let xk denote the damage or wear incurred by the k-th shock.

    If the damage accumulates additively, then Z(t) represents the total damage sustained up to time t. Suppose that the system continues to operate until this total damage is less than some critical value c and fails otherwise. Then the (random) failure time T satisfies T > t if and only if Z(t) < c. Therefore Pr(T > t) = n = 0Pr(k = 1nxk z|Nt = n)(lt)n e-lt/n!. Thus ET = [1/( l)]n = 0 Pr(j = 1nxj c). In particular if xk are exponential ET = [(1+cm)/( l)].

11.2  Continuous time Markov processes

Given a discrete-time Markov chain, there are many ways of ``running it" in continuous time. One particular method is to make the moves at random moments of time. If these instances are exponential, then the resulting continuous-time process is Markov, too. This is the so-called embedded Markov chain.

Non-pathological continuous time Markov processes with countable state space have embedded Markov chain representation. In such representation we run a continuous time clock based on the independent exponential random variables. Once the time comes, we select the next position according to the transition probabilities of a discrete-time Markov chain.

The theory of continuous time Markov chains (that is - processes with countable state space) is similar to discrete time theory. The linear first-order difference equations for probabilities are replaced by the systems of first-order differential equations.

Example 76 Suppose Xn is a two-state Markov chain with the following transitions: 0 1 with probability a, 1 0 with probability b. From Section 8.1 we know that P(Xk = 1) a/(a+b) as k.

Let Tk be i. i. d. exponential r. v. and let Y(t) = Xk when T1+...+Tk < t < T1+...+Tk+1.

Function p(t) = Pr(Y(t) = 1) satisfies differential equation: p(t) = -lp(t)+lb p(t)+ la (1-p(t). Indeed, Pr(Y(t+h) = 1) Pr(Y(t) = 1, T > h)+Pr(Y(t) = 0,T < h).

Therefore p(t) = a/(a+b)+b/(a+b) exp(-l(1-b+a) t).

Here is a slightly different method to run the continuous time finite-valued Markov chain.

Pick the initial value according to prescribed distribution. Then select an exponential random variable for each of the possible transitions. Each of these can have different parameter lk. Then select the smallest, T = min Tj. It can be shown that T is exponential with parameter lj and that Pr(T = Tj) = [(lj)/( klk)].

11.2.1  Examples of continuous time Markov processes

Despite many similarities, continuous models differ from discrete ones in their predictions.

11.2.1 Growth model

In Section 7.5.4 we considered discrete-time growth model which assumed that bacteria divide at fixed time intervals. This assumption is well known not to be satisfied - mitosis is a process that consists of several stages of various lengths, of which the longest may perhaps be considered to have exponential density. In this section we shall assume that a colony of bacteria consists of X(t) cells which divide at exponential moments of time, or die. We assume individual cells multiply at a rate a and die at a rate b. One way to interpret these numbers is to assume that there are two competing effects: extinction or division. When there are k such cells, the population grows one cell at a time, rate of growth is ak, rate of death is kb. We assume a > b.

Let pk(t) = Pr(X(t) = k). Then pk(t) = -(a+b)kpk+(k+1)bpk+1+a(k-1)pk-1. Population average m(t) = k k pk(t) satisfies m(t) = (a-b)m(t), thus m(t) = e(a-b)t.

Problem 73 Suppose Xt is a Markov process whose birth rate is an+a and death rate is bn with b > a. This describes a population of species that die out in a given habitat, but have a constant rate of ``invasion". One would expect that such competing effects will result in some sort of equilibrium. Find the average population size as t.

Example 77 Suppose Xt = (-1)Nt, where Nt is the Poisson Process. Is Xt Markov?

[ 36 Customers arrive at a burger outlet at a rate l, and after exponential service time with mean 1/m leave.

[ 37 Customers arrive at a burger outlet at a rate l, and after exponential service time with parameter m leave. If there second cashier is opened, the service time will be reduced twice on average, but the cost of hiring the second cashier is $5 per hour. A customer purchases of average $5, with the profit of $2 over the costs. If k customers are waiting in the line, the next person driving by will stop with probability 2-k.

11.3  Gaussian processes

Continuous-time Markov processes with uncountable state space require more advanced mathematical tools. Only Gaussian case can be briefly mentioned here.

Definition 24 A stochastic process {Xt}0 t < is Gaussian, if the n-dimensional r. v. (Xt1,, Xtn) has multivariate normal distribution for all n 1 and all t1,,tn [0,).

11.4  The Wiener process

The simplest way to define the Wiener process is to list its properties as follows. Definition 25 The Wiener process {Wt} is a Gaussian process with continuous trajectories such that

0 for all t 0;
{t, s} for all t, s 0.

A stochastic process {Xt}t [0,1] has continuous trajectories if it is defined by a C[0,1]-valued random vector, or if all of its paths are continuous31. For infinite time interval t [0,), a stochastic process has continuous trajectories if its restriction to t [0,N] has continuous trajectories for all N IN.

Conditions (97)-(99) imply that the Wiener process has independent increments, ie. W0, Wt-W0, Wt+s-Wt, are independent.

Series expansions for the Wiener process are available in the literature. One way to obtain these is from Fourier expansion for the covariance function.

Problem 74 Let u(x,t) = Ef(Wt+x), where f is a smooth function. Show that u satisfies the parabolic equation [(u)/( t)] = 1/2 [(2 u)/( x2)].

Problem 75 What partial differential equation is solved by u(x,t) = Ef(aWt+x+bt) when a,b are non-zero constants?

Scaled Wiener process is a good model of diffusion. Use the two-dimensional Wiener process to model a two-dimensional diffusion. The two-dimensional Wiener process is obtained from two independent one-dimensional components. The diffusion coefficient a2 has units [length2/time] and is implemented by scaling the Wiener process: aWt has diffusion coefficient a2.

[ 38 An eye-irritant pollutant is emitted from a factory chimney located at x = 0, y = 10 on the xy plane, and the wind blows left-to-right with velocity v(y) = y at height y. At a distance L = 100 down-wind, there is a residential building of height 15. Which floor of the building is polluted the most? Is there a significant difference in pollution level between the floors? (Assume that the distances are in units such that the diffusion coefficient is 1.)

Chapter 12
Time Series

This chapter contains additional topics on discrete time processes. We begin with several examples of possible simulations. The resulting random curves are very different, but they have the same ``mean-square" behavior.

Example 78 Suppose we wish to pick a curve at random. In other words, we need a random function X(t) of integer parameter t. Here is one way to do it: take X(t) = kgkakcos(kt), where gk are i. i. d. normal N(0,1). To ensure the series is convergent we need k ak2 < ; thus we can assume ak = 02p g(q)cos(kq) dq. The theory of Fourier series tells us that if g2(q) < , then the coefficients ak are square-summable.

Example 79 Suppose we wish to pick a curve at random. Here is one way to do it: take X(t) = cj cos(2pt+Fj), where Fj are independent and uniformly distributed on the interval (0,2p). Again select ck = 02p g(q)cos(kq) dq.

[ 39 Write simulations of the curves as described. Pick as g a trigonometric polynomial, say g(q) = 1+cos(q)cos(2q).

12.1  Second order stationary processes

Stationary processes are those that their probabilistic characteristics (distributions, conditional distributions, moments, covariances) do not change with time. Example 80 Suppose Xn is a Markov chain, and Pr(X0 = j) = p(j), where pj is its invariant distribution. Then (X0,Xk) @ (Xn,Xn+k).

For second-order stationary processes only means, variances, and covariances do not change with time. That is, m(t) = EX(t) = const,cov(X(t), X(s)) = K(t-s). The second-order theory of processes is a very coarse theory. Nevertheless, it does solve the best linear prediction problem.

[ 4 In each of the introductory Examples, E X(t) = 0 and cov(X(t),X(s)) = K(t-s), where K(t) = 02p cos(qt)f(q)  dq.

12.1.1  Positive definite functions

The covariance K(t) of the weakly stationary process is a positive definite function. That is, ci cj K(ti-tj) = E|j cjX(tj)|2 0 for all cj \sf CC, tj IR. In addition, K(t) = K(-t). Theorem 38 Given a positive definite even function K:ZZ IR, there is a (0,2p)-valued random variable Q such that for all integer t

K(t) = K(0)Ecos(tQ).

[ 5 Suppose Q from Theorem 12.1.1 has density f(q). Let g(q) = {f(q)} and define Xt as in Example 12. Then Xt is Gaussian, mean zero, with covariance (100).

12.2  Trajectory averages

When a time series Xt is observed, 1/n(X1+...+Xn) is the ``trajectory average". It is interesting how does this compare to the ``probabilistic" average EX.

The following theorem follows immediately from the proof of Theorem 5.2. The assumption holds true in particular when Xn has spectral density. Theorem 39 Suppose Xn is weakly stationary. If cov(X0,Xn) 0 as n then 1/n(X1+...+Xn) EX in L2-norm.

Proof. Compute the variance, and use the Cesaro summability result (Theorem ). [¯]

The covariance argument can be rewritten in spectral notation. Suppose EX0Xk = (2p)-1-pp eiks f(s)  ds. Then E(X1+...+Xn)2 = (2p)-1-pp |k = 1neiks|2 f(s)  ds, so Var(1/n(X1+...+Xn)) = (2p)-1[1/( n2)]-pp [(sin2 (1/2ns))/( sin2 (1/2s))] f(s)  ds 0 as n.

12.3  The general prediction problem

The basic problem in filtering and prediction is as follows. Given variables X1,...,Xn, find the estimator of the value of Y with smallest quadratic error. Case n = 1 is presented is Sections 2.11.1 and 2.14 for the linear and non-linear case.

In the linear prediction problem we deal with linear estimators a0+jajXj. The quadratic error involves variances, covariances and averages only. Thus it is appropriate to handle this in through the second order processes, and the solution should depend on the density of Q - the so called spectral density only.

The general Hilbert space theory tells us that the best linear prediction of Xt+1 based on the past is j = 0t ajXj with coefficients aj such that EXj(Xt+1- j = 0t ajXj) = 0 for all 0leq j t. This is a system of t+1 linear equations for t+1 unknown coefficients. Gramm-Schmidt orthogonalization allows to replace Xj by orthonormal xj. Optimal prediction uses j = 0t aj xj with aj = EX0xt-j.

12.4  Autoregressive processes

Suppose xk are i. i. d. A stochastic process Xt is an autoregressive process, if it satisfies a linear difference equation

Xt+1 = d

j = 0 
aj Xt-j + xt+1
with random coefficients xt+1 . Example 81 Autoregressive process Xt+1 = aXt+xt+1 is Markov. For |a| < 1 it has a stationary distribution. What is it? What happens for |a| > 1?

Example 82 A moving average Xt+1 = 1/dj = 1dxt-j is an autoregressive process. What is the corresponding difference equation (101)?

For autoregressive process the optimal one-step prediction of Xt+1 is


j = 0 
The spectral theory asserts that this is best linear prediction. However if xj are i. i. d., then this is actually optimal non-linear prediction as well.

More general autoregressive sequences are defined as solutions of


j = 0 
aj Xt-j =

bi xt-i

These generalize simultaneously autoregressive and moving average processes.

Chapter 13
Additional topics

13.1  A simple probabilistic modeling in Genetics

First we describe the population at a single instance. We consider the model in which each hereditary character is carried by a pair of genes. For simplicity, we assume only one (pair) of a gene, and only one hereditary character corresponding to one locus. There are several possible alleles (categories) in a locus - we assume there are only two, denoted by a,A, of which A is dominant. Each individual thus is described by one of the pairs AA, Aa, aa, the so called phenotype . However, a is obstructed from the view by A, thus for an outside observer individuals AA and Aa are indistinguishable. A statistical study of such a population can only yield the proportion PA of individuals with "A-feature", and not the actual probabilities of the three possible phenotype.

Now we turn to the modelling of the generation change. Under simple assumptions we shall be able to find out what are the frequencies of phenotype and genotypes. Usually this information is not directly available. We assume that the next generation occurs by random mating. Let

pAA(0),pAa(0),paa(0) be the actual (and as yet unknown) probabilities of the phenotype. Under random mating with independent selection of parents, the probability that an offspring has phenotype AA is pAA(1) = (pAA(0)+1/2pAa(0))2.

Denoting by pA(t) = pAA(t)+1/2pAa(t) we get the Hardy-Weinberg equilibrium: after one generation the proportions pA(t) of genotypes stabilize and the phenotype frequencies become

PAA = pA2
PAa = 2papA
Paa = pa2

This determines the actual proportions of the phenotype from the proportion of observed A-carriers and a-carriers.

Problem 76 Show that PAa = 2({Paa}-Paa).

The next question is to study the effects of the selection, where, say phenotype aa has different chance of survival. This leads to total probability formula and Markov chains.

13.2  Application: verifying matrix multiplication

Suppose one has an algorithm to multiply large matrices and we want to check if the output is correct. A possible method is to pick the vector X of 0,1 and check if AB X = CX. This is the so called Freivalds technique

Theorem 40 If A,B,C are n×n matrices such that AB C then Pr(ABX = CX) 1/2.

Proof. For a non-zero v we have Pr( |v·X| > 0) < 1/2 [¯]

13.3  Exchangeability

Definition 26 A sequence (Xk) of random variables is exchangeable, if the joint distribution of Xs(1),Xs(2), , Xs(n) is the same as the joint distribution of X1, X2, , Xn for all n 1 and for all permutations s of {1, 2, ,n}.

The following beautiful theorem due to B. de Finetti points out the role of exchangeability as a substitute for independence.

Theorem 41 Suppose that X1, X2, is an infinite exchangeable sequence. Then there exist a s-field \cal N such that X1, X2, are \cal N-conditionally i. i. d., that is

P(X1 < a1, X2 < a2, , Xn < an |\cal N)
= P(X1 < a1|\cal N) P(X1 < a2 |\cal N)P(X1 < an |\cal N)
for all a1, , an IR and all n 1.

We will use the following (weak) version of the martingale32 convergence theorem.

Theorem 42 Suppose \cal Fn is a decreasing family of s-fields, ie. \cal Fn+1 \cal Fn for all n 1. If X is integrable, then E{X|\cal Fn} E{X|\cal F} in L1-norm, where \cal F is the intersection of all \cal Fn.

Proof. Suppose first that X is square integrable. Subtracting m = EX if necessary, we can reduce the convergence question to the centered case EX = 0. Denote Xn = E{X|\cal Fn}. Since \cal Fn+1 \cal Fn, by Jensen's inequality EXn2 0 is a decreasing non-negative sequence. In particular, EXn2 converges.

Let m < n be fixed. Then E(Xn-Xm)2 = EXn2+EXm2-2EXnXm. Since \cal Fn \cal Fm, by Theorem we have

EXnXm = EE{XnXm|\cal Fn} = EXnE{Xm|\cal Fn}
= EXnE{E{X|\cal Fm}|\cal Fn} = EXnE{X|\cal Fn} = EXn2.
Therefore E(Xn-Xm)2 = EXm2-EXn2. Since EXn2 converges, Xn satisfies the Cauchy condition for convergence in L2 norm. This shows that for square integrable X, sequence {Xn} converges in L2.

If X is not square integrable, then for every e > 0 there is a square integrable Y such that E|X-Y| < e. By Jensen's inequality E{X|\calFn} and E{Y|\cal Fn} differ by at most e in L1-norm; this holds uniformly in n. Since by the first part of the proof E{Y|\calFn} is convergent, it satisfies the Cauchy condition in L2 and hence in L1. Therefore for each e > 0 we can find N such that for all n, m > N we have E{|E{X|\cal Fn}-E{X|\cal Fm}|} < 3e. This shows that E{X|\cal Fn} satisfies the Cauchy condition and hence converges in L1.

The fact that the limit is X = E{X|\cal F} can be seen as follows. Clearly X is \cal Fn-measurable for all n, ie. it is \cal F-measurable. For A \cal F (hence also in \cal Fn), we have EXIA = EXnIA. Since

|EXnIA-EX IA| E|Xn-X |IA E|Xn-X | 0, therefore EXnIA EX IA. This shows that EXIA = EX IA and by definition, X = E{X|\cal F}. [¯]
Proof of Theorem 13.3. Let \cal N be the tail s-field, ie.

\cal N =

k = 1 
s(Xk, Xk+1, )
and put \cal Nk = s(Xk, Xk+1, ). Fix bounded measurable functions f,g, h and denote
Fn = f(X1, , Xn);
G n, m = g(Xn+1, , Xm+n) ;
H n, m, N = h(Xm+n+N+1, Xm+n+N+2, ),
where n, m, N 1. Exchangeability implies that

EFnGn, mHn, m, N = EFnGn+r, mHn, m, N
for all r N. Since Hn, m, N is an arbitrary bounded \cal Nm+n+N+1-measurable function, this implies

E{FnGn, m|\cal Nm+n+N+1} = E{FnGn+r, m|\cal Nm+n+N+1}.
Passing to the limit as N , see Theorem 13.3, this gives
E{FnGn, m|\cal N} = E{FnGn+r, m|\cal N}.

E{FnGn, m|\cal N} = E{Gn+r, mE{Fn|\cal Nn+r+1}|\cal N}.
Since E{Fn|\cal Nn+r+1} converges in L1 to E{Fn|\cal N} as r , and since g is bounded,
E{Gn+r, mE{Fn|\cal Nn+r+1}|\cal N}
is arbitrarily close (in the L1 norm) to

E{Gn+r, mE{Fn |\cal N}|\cal N} = E{Fn |\cal N} E{Gn+r, m |\cal N}
as r . By exchangeability E{Gn+r, m |\cal N} = E{Gn, m |\cal N} almost surely, which proves that
E{FnGn, m|\cal N} = E{Fn|\cal N} E{Gn, m |\cal N}.
Since f, g are arbitrary, this proves \cal N-conditional independence of the sequence. Using the exchangeability of the sequence once again, one can see that random variables X1, X2, have the same \cal N-conditional distribution and thus the theorem is proved. [¯]

13.4  Distances between strings

A string is a sequence of letters, or symbols from the finite alphabet. For the purpose of computer modelling, we can assume that a string is a sequence of natural numbers {1,... d}, parameter d being the size of the alphabet.

Three simple examples of strings are the words, sentences in, say, English, and DNA molecules. Here d = 26 (for lower-case words), d = 94 for sentences, and d = 6 for the DNA (there are only four proteins, but extra symbols are used to mark various undecided cases).

The question of comparing two strings for similarities arises in molecular biology and in designing a spell-checker, or a speech recognition system. Accordingly, one would like to say which strings are similar, and how likely it is that they are similar due to chance only. Additional complications arise from the fact that two strings compared do not necessarily have the same length.

A simple way to compare two strings is to measure the number of symbols that don't match (the hamming distance). For instance abbacd and babacd would then have distance 2. But abbacd and bbacda would have distance 6, even though they differ just by one transposition.

A less obvious way to compare two strings is to measure the edit distance: how many editing operations are needed to transform one of the strings into another. Usually the editing operations are:

These are suitable for spell-checkers, where it is known that about 80% of typing errors are of the above form, thus most of mistyped words have edit-distance 1 from the original.

Accordingly, the edit distance is set to 0, if the words are identical, 1 if they differ by a single error of one of the listed types, 2 if there were two such errors, etc. Formally, it is defined as the smallest number of elementary ``TE" transformations required to transform one of the words into another.

The method of computation is based on recurrence. It is easy to see that a number of transformations between and empty word and another one is exactly the length33 of the word. If the two words U, W are formed from shorter ones U0,W0 by adding letters at the end, then the distance Dist(U,W) is the smallest of the numbers

Here is a complete VB-listing:

Function Dist (U$, V$) As Integer
'Returns the edit distance
'(number of elementary changes: replacement, deletion, insertion,transposition)
'that are required to transform word U$ into V$
'Declare auxiliary variables
Dim m As Integer, n As Integer, j As Integer, i  As Integer
Dim x As Integer, y As Integer, z As Integer, A$, B$
If Len(U$) < Len(V$) Then A$ = U$: B$ = V$:  Else A$ = V$: B$ = U$
m = Len(A$)
n = Len(B$)
'Declare matrix of distances between substrings of i,j characters
ReDim D(m, n) As Integer
'Assign boundary values: distances from empty string
 For i = 0 To m: D(i, 0) = i: Next i
 For j = 1 To n: D(0, j) = j: Next j
'Main recurrence: Compute next distance D(i,j) from previously found values
  For i = 1 To m
        For j = 1 To n
          x = D(i - 1, j) + 1   'delete character
          y = D(i, j - 1) + 1   'inserte character
          x = Intmin(x, y)         'choose better (Integer Minimum)
          y = D(i - 1, j - 1) - (Mid$(A$, i, 1) <> Mid$(B$, j, 1)) 'swap characters i,j
          x = Intmin(x, y)        ' choose better
          z = 0
          If i > 1 And j > 1 Then
              'If Mid$(A$, i, 1) <> Mid$(B$, j, 1) Then
              z = (Mid$(A$, i, 1) = Mid$(B$, j - 1, 1)) * (Mid$(A$, i - 1, 1) = Mid$(B$, j, 1))
              'End If
          y = (1 + D(i - 2, j - 2)) * z + x * (1 - z)
          x = Intmin(x, y)
          End If
          D(i, j) = x
      Next j
  Next i
Dist = D(m, n)'current value
End Function

The main problem with using the edit distance to analyze DNA sequences is processing time. (Speed-ups are known.)

[ 40 Write a program computing the edit distance between strings, and another one, which does the editing by simulation. (Try the randomization based on random number of edit operations from the currently best candidate)

13.5  A model of cell growth

A cell in its growth goes through several phases, which have different probabilistic characteristics. In a simple model, the cell doubles after a random time, which is the sum of the exponential and deterministic portion. The average of the exponential phase can be assumed to depend on the external circumstances.

13.5 Questions of interest

How does the growth of cells affect other cells? How to control mixed populations of cells to stay within prescribed limits?

13.6  Shannon's Entropy

Let X1,... Xn be independent identically distributed (i. i. d.) discrete r. v with k values, say {v1,..., vk}. Put X = (X1,...,Xk).

For a fixed vector y we have thus the joint probability mass function f(x) = Pr(X = x). The average information H(X) contained in X is defined as

H(X) = -Elogf(X) = -


Problem 77 Prove Gibbs' inequality j pjlogpj pj logqj for all qj > 0, qj = 1.

Notice that H(X) 0 and H(X) logk

13.6.1  Optimal search


Huffman's code

Problem 78 Suppose you have 15 identical in appearance coins, except that one of them has a different weight. Find the optimal34 weighting strategy to identify the odd coin by using a scale.

13.7  Application: spread of epidemic

Modeling of the spread of disease is complicated due to multiple factors that influence its development. The birth-and-death process does not seem to be a good model for the spread of an epidemic in a finite population, since when a large proportion of the population has been infected, we cannot suppose that the rate of infections is independent of past history.

Appendix A
Theoretical complements

A.1  Lp-spaces

Inequalities related to expected values are best stated in geometric language of norms and normed spaces. We say that X Lp, if X is p-integrable, i.e. E|X|p < . In particular, X is square integrable if EX2 < .

The Lp norm is

||X||p =

if p 1;
ess sup|X|
if p = .

Notice that ||X-EX||2 is just the standard deviation.

We say that Xn converges to X in Lp, if ||Xn-X||p 0 as n . If Xn converges to X in L2, we shall also use the phrase sequence Xn converges to X in mean-square. An example of the latter is Theorem 5.2.

Several useful inequalities are collected in the following. Theorem 43

Special case p = q = 2 of Hölder's inequality (110) reads EXY [(EX2EY2)]. It is frequently used and is known as the Cauchy-Schwarz inequality.

For the proof of Theorem A.1 we need the following elementary inequality. [ 3 For a,b > 0, 1 < p < and 1/p+1/q = 1 we have

ab ap/p+bq/q.

Proof. Function t tp/p+t-q/q has the derivative tp-1-t-q-1. The derivative is positive for t > 1 and negative for 0 < t < 1. Hence the maximum value of the function for t > 0 is attained at t = 1, giving

tp/p+t-q/q 1.
Substituting t = a1/q b-1/p we get (112). [¯]
Proof of Theorem A.1 (ii). If either ||X||p = 0 or ||Y||q = 0, then XY = 0 a. s. Therefore we consider only the case ||X||p||Y||q > 0 and after rescaling we assume ||X||p = ||Y||q = 1. Furthermore, the case p = 1, q = is trivial as |XY| |X| ||Y||. For 1 < p < by (112) we have
|XY| |X|p/p+|Y|q/q.
Integrating this inequality we get |EXY| E|XY| 1 = ||X||p||Y||q. [¯]
Proof of Theorem A.1 (i). For p = 1 this is just Jensen's inequality; for a more general version see Theorem . For 1 < p < by Hölder's inequality applied to the product of 1 and |X|p we have

||X||pp = E{|X|p ·1} (E|X|q)p/q (E1r)1/r = ||X||qp,
where r is computed from the equation 1/r+p/q = 1. (This proof works also for p = 1 with obvious changes in the write-up.) [¯]
Proof of Theorem A.1 (iii). The inequality is trivial if p = 1 or if ||X+Y||p = 0. In the remaining cases

||X+Y||pp E{(|X|+|Y|)|X+Y|p-1} = E{|X||X+Y|p-1}+ E{|Y||X+Y|p-1}.
By Hölder's inequality
||X+Y||pp ||X||p||X+Y||pp/q+||Y||p||X+Y||pp/q.
Since p/q = p-1, dividing both sides by ||X+Y||pp/q we get the conclusion. [¯]

A.2  Properties of conditional expectations

In more advanced courses conditional expectation E(X|Y) is defined as a random variable f(Y) that satisfies EXf(Y) = Ef(Y)f(Y) for all bounded measurable (or just continuous) functions f.

The next theorem lists useful properties of conditional expectations.

Theorem 44

Remark 4 Inequality (iv) is known as Jensen's inequality and this is how we shall refer to it.

The abstract proof uses the following35. [ 4 If Y1 and Y2 are \cal F-measurable and AY1 dP A Y2 dP for all A \cal F, then Y1 Y2 almost surely. If AY1 dP = A Y2 dP for all A \cal F, then Y1 = Y2.

Proof. Let Ae = {Y1 > Y2+e} \cal F. Since AeY1 dP AeY2 dP + eP(Ae), thus P(Ae) > 0 is impossible. Event {Y1 > Y2} is the countable union of the events Ae (with e rational); thus it has probability 0 and Y1 Y2 with probability one.

The second part follows from the first by symmetry. [¯]
Proof of Theorem A.2.

(i) This is verified first for Y = IB (the indicator function of an event B \cal F). Let Y1 = E{XY|\cal F}, Y2 = YE{X|\cal F}. From the definition one can easily see that both AY1 dP and A Y2 dP are equal to A B X dP. Therefore Y1 = Y2 by the Lemma A.2.

For the general case, approximate Y by simple random variables and use (vi).

(ii) This follows from Lemma A.2: random variables Y1 = E{X|\cal F}, Y2 = E{X|\cal G} are \cal G-measurable and for A in \cal G both AY1 dP and A Y2 dP are equal to AX dP.

(iii) Let Y1 = E{X|\cal N\cal F}, Y2 = E{X|\cal F}. We check first that

Y1 dP =

Y2 dP
for all A = BC, where B \cal N and C \cal F. This holds true, as both sides of the equation are equal to P(B)CX dP. Once equality AY1 dP = A Y2 dP is established for the generators of the s-field, it holds true for the whole s-field \cal N\cal F; this is standard measure theory36.

(iv) Here we need the first part of Lemma A.2. We also need to know that each convex function g(x) can be written as the supremum of a family of affine functions fa, b (x) = ax+b. Let Y1 = E{g(X)|\cal F}, Y2 = fa, b(E{X|\cal F}), A \cal F. By (vi) we have

Y1 dP =

g(X) dP fa, b(

X) dP = fa, b(

E{X|\cal F}) dP =

Y2 dP.
Hence fa, b(E{X|\cal F}) E{g(X)|\cal F}; taking the supremum (over suitable a, b) ends the proof.

(v), (vi), (vii) These proofs are left as exercises. [¯]
Problem 79 Prove part (v) of Theorem A.2.

Problem 80 Prove part (vi) of Theorem A.2.

Problem 81 Prove part (vii) of Theorem A.2.

Problem 82 Prove the following conditional version of Chebyshev's inequality: if E|X| < , then

P(|X| > t |Y) E{|X| |Y}/t
almost surely.

Problem 83 Show that if (U,V,X) are such that in distribution (U,X) @ (V,X) then E{U|X} = E{V|X} almost surely.

Problem 84 Show that if X, Y are integrable non-degenerate random variables, such that

E{X|Y} = aY,  E{Y|X} = bX,
then |ab| 1.

Problem 85 Show that if X, Y are integrable such that E{X|Y} = Y and E{Y|X} = X, then X = Y a. s.

Appendix B
Math background

The following sections are short reference on material from general math (calculus, linear algebra, etc).

B.1  Interactive links

The following links are operational as of Mar 13, 1998 . Please note that these may change at any time.

B.2  Elementary combinatorics

The art of counting is called combinatorics. Here is a short listing of the formulas. All are the consequences of the product rule of counting.

Permutation is an arrangement (ordering) of k out of n distinct objects without repetitions. The number of permutations is [n!/((n-k)!)]. In particular, the number of ways to order n objects is n!
Combinations: Combinations are k-element subsets of n distinct elements. The number of combinations is (nk).
Variations: Variations are arrangements of k out of n distinct objects with repetitions allowed. The number of variations is nk

B.3  Limits

The following limit can be computed by L'Hospital rule.

(1+a/n)n = ea
The Cesaro summability formula is Theorem 45 If an a then 1/n(a1+...+an) a.

B.4  Power series expansions

The following power series expansions are of interest in this course.

ex =

k = 0 

k = 0 

They give immediately the expansions

ln1+x =

k = 0 
(-1)k xk+1

ex2/2 =

k = 0 


e-t2/2 dt =

k = 0 

In particular for x > 0 we have ln1+x < x and ln1+x > x-x2/2.

B.5  Multivariate integration

Suppose x = x(u,v), y = y(u,v). The change of variables formula in multivariate case is

f(x,y) dxdy =

f(x(u,v), y(u,v)) |J(u,v)|  dudv,
where the Jacobian J is given by

J = det




B.6  Differential equations

The solution of the linear differential equation y+a y = g(x) with y(0) = y0 is given by y(x) = y0 e-ax 0x eat g(t)/, dt.

Second order linear equation y+a y+by = g(x) is often solved by the method of varying a constant. The first step is to solve the homogenous equation y+a y+by = 0 first using y = erx. The general solution of homogenous equation is y = C1er1x+C2er2x, or y = C1erx+C2xerx if the characteristic equation has double root.

B.7  Linear algebra

Definition 27 A scalar product of vectors in \sf V is a bilinear, positive definite, non-degenerate mapping ·|·: \sf V× \sf V IR.

Definition 28 Vectors x,y are orthogonal with respect to scalar product |, if x|y = 0.

The length of a vector is ||x|| = {x|x}. Orthogonal vectors of unit length are called orthonormal.. If ej are the orthonormal vectors and x is in their linear span, then the coefficients in the expansion x = j ajej are given by aj = x|ej.

Example 83 Let \sf V be the vector space of all continuous functions on the interval [-p,p]. In the scalar product f|g = -ppf(t)g(t) dt the functions {sink}k IN, {coskt}k IN, 1 are orthogonal.

B.8  Fourier series

The Fourier series for a function f(x) is the expansion f(x) = n an sinpnx+ bn cospnx. Every periodic continuous function f can be expanded in the Fourier series. The coefficients are

b0 = 1


f(x) dx
bn = 1


f(x)cosnpx  dx for n 0
an = 1


f(x)sinnpx  dx
Example 84 Expand |x| into the Fourier series, and graph the corresponding partial sums.

B.9  Powers of matrices

The Cayley - Hamilton theorem asserts that each d×d matrix A satisfies the polynomial equation Qd(A) = 0, where Qd(x) = det(A-xI) is the characteristic polynomial of degree d.

This implies that An = a0(n)I+a1(n)A+...+ad-1(n)Ad-1, where xn = D(x)Q(x)+a0(n)+a1(n)x+...+ad-1(n)xd-1. If A has n distinct eigenvalues lj, the coefficients aj(n) can be found by solving the system of equations ljn = a0(n)+a1(n)lj+...+ad-1(n)ljd-1.

A quick method that finds a characteristic polynomial due to K... is to solve the system of linear equations for the coefficients: pick a vector X at random and solve37 for a0,..., ad-1 the equations a0X+a1 AX+...+ad-1Ad-1X = 0. If the resulting matrix is singular, re-sample X until a non-singular matrix is found.

Appendix C
Numerical Methods

Calculators and more sophisticated math-geared software have efficient numerical methods built-in. Here are short prescriptions that may be used by general programmer. A more complete source is e.g.Numerical Recipes: The Art of Scientific Computing series.

C.1  Numerical integration

To approximate the integral ab f(x) dx in calculus we use left and right sums: Ln = [(b-a)/ n]k = 0n-1 f(a+[(b-a)/ n]k), Rn = [(b-a)/ n]k = 1n f(a+[(b-a)/ n]k)

The more exact trapezoidal rule uses

abf(x) dx Sn = 1/2(Ln+Rn).

Still more powerful and easy to program integration method is the Simpson rule: ab f(x) dx 4/3S2n-1/3Sn. The Simpson rule is exact for cubic polynomials. Typical way to program it is to call the subroutine performing trapezoid method integration twice. Many calculators have Simpson rule build in. Before you use it, be sure to check if it is reliable enough. A simple test that catches some poorly written routines is 0500e-x2  dx {p/2}.

C.2  Solving equations

The fast and powerful bisection method requires correct end-points, and finds one solution only. But it has virtually no assumptions, except the continuity: If f(x) is continuous and f(a) < 0, f(b) > 0 then there is a < x < b such that f(x) = 0 and the interval length can be halved by computing f([(a+b)/ 2]).

Three difficulties arise in real applications.

The second point has satisfactory answer for polynomial equations. The third point can be tackled through search for minimum. Namely, if equations are f(x,y) = 0,g(x,y) = 0 then we need to find minima of f2(x,y)+g2(x,y).

C.3  Searching for minimum

The analog of the bisection method for finding minima is to begin with three points a < b < c such that f(b) is the smallest of the three. The next triple is produced by partitioning the larger of the two segments in proportion38 [(3-5)/ 2] 0.38197, and comparing the resulting four values to pick the narrower triplet.

Appendix D
Programming Help

Programming help is grouped by the type of software. Currently available (as of Mar 13, 1998 ) are preliminary versions of:

D.1  Introducing BASIC

Any DOS-running PC comes also with BASIC. You can start it with the command


from the DOS command line, or from Windows File Manager (click on an entry QBASIC.EXE, usually located in). If you are a devoted Windows user, you may install an Icon in Program Manager to run BASIC with a click of the mouse.

Correctly written programs halt at the end. But not all programs do that, so an ``emergency exit" is build in.

To stop the execution of any program, press simultaneously Ctrl+Break.

D.1 how to use this chapter

This text was written for a complete novice to BASIC. If you fit the description, read the pages below, and type in each of the sample programs. Once you have them in the QBASIC, run them to see the effects of various instrunctions.

There is no possibility that by running these programs you will do any harm to your equipment. Experiment, and if something goes realy wrong, you can always turn OFF or restart the computer.

D.2  Computing

The mathematical conventions in QBASIC are 2^(1/2) for 21/2 , SQR(13) for [13], LOG(2) for the natural logarithm ln2, etc. With these, one can use QBASIC as a calculator. For instance the instruction
PRINT LOG(1+2^(1/2))

will print the decimal value of the expression ln(1+2). This probably is the simplest program to begin with. It is so short that there is no point in saving it.

The real power comes from repetitive operations explained in Section .

D.3  The structure of a program

BASIC programs are actually text files. The instructions are read consecutively by the BASIC interpreter. The QBASIC interpreter that comes with DOS is sufficient for our purposes. Sample programs below introduce certain more exotic build-in functions. Example 85 The following simple program prints current date & time

PRINT "Today is "; DATE$

Besides the actual code, programs should contain comments with explanations of instructions and their purpose. Comments and explanations in the program can be hidden from the processor through REM command. Everything in a line after this command is skipped (the only exception being metacommands, which we don't have to concern ourselves with here). The sample39 programs, use a shortcut ' instead of the full command REM. This is faster to type and equivalent in action. Example 86 Here is the previous program with comments

'Prints current date & time
PRINT "HELLO" 'greet the user
PRINT "Today is "; DATE$ 'print date
PRINT TIME$ ' print time could have printer TIMER=# seconds since 1980

The typical program consists of the main module with fairly general instructions that call subroutines to perform specific jobs, and a number of subprograms that are marked in text by SUB ... END SUB, and are displayed separately by the QBASIC Editor. Subprograms make it easier to write, test, and maintain the program.

Within QBasic the SUBs are separated, each in each own text window, but the final program is saved as one file with SUBs following the main module. To create a SUB, choose New SUB from the Edit menu. More details follow in Section .

Larger projects often use several modules40 that are compiled and run together.

Example 87 Here is the revision of the previous program with a simple but useful SUB.

'Prints current date & time in the middle of the screen
CLS 'clear display
CenterPrint "HELLO" 'greet the user
CenterPrint "Today is " & DATE$ 'print date - string has to be concatenated
CenterPrint TIME$ ' print time could have printer TIMER=# seconds since 1980

SUB CenterPrint (Text$)
'**** print text Text$ centered on screen
if l<0 then l=0 'too long text cannot be centered
print TAB(l); Text$

New things: CLS, concatenation of strings, calling SUB, screen positioning by TAB, LEN("hello")

Every subprogram should begin with a (commented) short description of its purpose, and the meaning of parameters.

D.4  Conditionals and loops

Loops of fixed size are best handled by

FOR j=1 to 10 STEP .5 ... NEXT j.

STEP is optional, and if none is given, then 1 is used by default. Example 88 Program

FOR j=1 TO 100

computes and prints the total of the first 100 integers (5050).

Conditional action is accomplished by

IF ... THEN ... ELSE ... END IF

END IF is required only when multiple lines of instructions are to be executed.

Selection from several cases is perhaps easiest through

SELECT var ...
CASE 0 ...
CASE .5 ...

Conditional loops (the ones that last indefinitely, unless special circumstance is encountered) can be programmed through

WHILE cond ... WEND

or through






Other ways of breaking out of DO ... LOOP are available, too.

Example 89 The following program illustrates several conditional instructions. It finds consecutive max = 30000 prime numbers larger than n = 1

max = 30000
nc = n - 1
WHILE k < max
        nc = nc + 1
        prime = -1    'TRUE
        FOR j = 2 TO SQR(nc)
        IF (nc MOD j) = 0 THEN prime = 0: EXIT FOR
        NEXT j
        IF prime THEN EXIT DO
    k = k + 1
    Print k;"-th prime is "; nc

[ 41 Write the program solving recurrence (67).

D.5  User input

To stop the execution of any program, press simultaneously Ctrl+Break.

To have user supply a value for variable X, write INPUT "Real number=";X

Note: instruction INPUT stops the program. The user has to hit Enter for the program to continue.

To scan for the key pressed by the user without stopping the program, use INKEY$ function instead of INPUT. This allows the user to control the program by simple menu functions. For example, embed the following lines within DO ... LOOP and stop the program by pressing KeyQ.




END 'stop the program

CASE "q"

END 'stop the program

CASE "?"

PRINT "Did you ask for help?"




A simple way to let the user know that something went wrong is to BEEP. [ 42 Write a program that complains (beeps) when user presses any key.

[ 43 Write a program that will type the text provided by user from the keyboard to screen in upper-case regardless what the user selected. (Hint: Function T2$=UCASE$(T1$) converts to upper-case)

D.6  More about printing

Instruction PRINT is used to print to the screen, and in a slightly modified version, to the file on the disk.

To print to the printer, use LPRINT instead of PRINT. In the latter case, to force the page out of the printer, end every printing job with LPRINT CHR$(12).

QBASIC provides sophisticated ways of controlling the text output by format, colors, location on the screen. In addition it does have graphic statements, as long as the computer has graphics card. But the only thing needed for us is the regular

PRINT "New value is X=";X

which outputs the string (in quotation marks) and the value New value is X=1.234

Occasionally we may want to do minimal ``format" through the semi-colon, or TAB().

PRINT ".";

To see the effect of the semi-colon, run the above statement in a loop41. Then delete the semicolon, and run it again.

For professional formatting of output, look up the instruction PRINT USING "###.##" in any BASIC textbook (your public library is a good source!).

D.7  Arrays and matrices

Vectors and matrices are handled as arrays. They need to be declared to reserve room in memory. This is accomplished by the dimensioning statement
tt DIM ArrayName (Size1, Size 2, Size 3).
For instance DIM A(100) specifies a vector, DIM A(20,50) defines a 20×50 matrix. Arrays use up a lot of memory, so don't declare arrays larger than what you need, and pay attention to .

D.8  Data types

QBASIC supports many built in data types. Use declaration like DIM n AS INTEGER, or append the name by the corresponding marker n%.

These can be combined into more complex user defined data structures through Type declaration.

If variables are not declared, and the name isn't appended with the symbol indicating type, the variables are treated as single precision. This is admissible42 in small programs where speed and memory aren't of major concern.

D.9  User defined FUNCTIONs and SUBs

To create a SUB, choose New SUB from the Edit menu. To create a FUNCTION, choose New FUNCTION from the Edit menu.

User defined functions are listed as separate windows within QBASIC. Select View to switch between various functions.

Each user=defined function starts with BEGIN FUNCTION FunctName (x, y,z) and ends with END FUNCTION. The code between these two lines is executed whenever the function is envoked from main program, from another function, or SUB, or from itself. FunctName is a name for your function (choose a descriptive one). Arguments (x,y,z) are the variables passed to the function.

Example 90 The following function soves the linear equation ax+b = 0

FUNCTION Solution(a, b)

[ 44 Write a function that solves quadratic equation a x2+bx +c = 0.

There are just a couple things to remember when making functions:

Once a function is created, you can use it from within a program in the same way as the build in functions. The name of the function carries its value, eg x=FunctName(23)+FunctName(24). Notice that the function can also change values of the variables in its arguments, and perform any other tasks - this behavior is like that of any program.

SUBs act like functions, except that the name doesn't carry ``value" and the only change (if any) is to the variables passed. To invoke SUB from the program, use CALL SubName(Parameters). This method is recommended as it is more resistant to typographical errors.

You can pass a whole array as an argument of a function. The arrays are recognized by the parentheses: FuncName(A()) is a function of array A(). [ 45 Write functions SimulateBinomial (p,n), SimulateNormal(n,sigma), SimulateExponential(m) that simulate a single instance of the Binomial, Normal, Exponential distribution with given parameters.

D.10  Graphics

Graphic commands are available only on computers with a graphics card (all PC's that run Windows have a graphics card).

It is nice to be able to make simple graphs, but the full topic is beyond the scope of this introduction. Program LIMTHS.BAS draws lines and boxes of various colors, in the display rectangle covering a portion of the screen.

If you are seriously interested in graphing the results of your computations, and printing the outcomes, you may want to switch to Visual Basic in Windows.

D.11  Binary operations

Long integers can be used to represent subsets of an n-element set, for n up to n = 15 (Why?). Single number k corresponds to long integer 2k, the set {j, k} is represented by 2j+2k, etc.

If S, T are integers representing sets, then S AND T represents the intersection of sets, S OR T is the union, and S AND NOT T is the difference of sets.

To check if j S, verify if 2^j AND T is non-zero.

D.12  File Operations

Beginners in BASIC need no file operations to solve the exercises.

If you want to save your printout to file, print to the ASCII file sequentially. This is slower and less versatile than binary, but easier to master.

The syntax is quite rigid. The following example contains the basic idea:

OPEN FileName FOR OUTPUT as #17



You can print to file text, numbers, etc. If the file with the sam name as output file already exists, it is replaced by the new one (overwritten). To add rows to an existing file without loosing its contents use OPEN FileName FOR APPEND as #17

You can read input from programs back into the program by the corresponding INPUT statements. Beware that the file has to have the format expected by the INPUT.

The following simple program reads entries from the file FileName and prints it onto the screen, one by one.

OPEN FileName FOR INPUT as #17


INPUT #17, H$





Change INPUT #17, H$ to LINE INPUT #17, H$ to get full text rather than words.

D.13  Good programming

Adhering to good programming principles pays in clarity of programs, and facilitates debugging (finding errors). Adhering to the principles below is not a guarantee that the programs will work. It is also possible to write programs that execute correctly without any of the below. Nevertheless, it is a good habit to follow these recommendation. The gain is in clarity of the program, readability of its portions. Consequently, you will be able to design more complex programs that execute as expected. You will also re-use components easier.

If you wrote a program that uses GOTO statement(s), it is a good exercise to re-write it without a single GOTO instruction!

D.14  Example: designing an automatic card dealer

Modern BASIC is a structural language. The objective of this example is to show how various features of BASIC interplay in a design of a card-dealing application.

Here is the description of the situation.

A deck of cards consists of 52 cards. Each card has two attributes:

  • Suit (hearts, spades diamonds, and clubs)
  • Value (1 through 13)

Card games require shuffling the deck, and dealing some number of cards from the top of the deck. The objective of this example is to write a program that will print out the number of requested cards twice. That is, the first player will get as many cards as she requests, and then the second player will get as many cards as he requests (from the rest of the deck!

D.14.1  First Iteration

Once we realize what are the natural steps the real person would go through, the program is very easy to design. Here is the program(!)

'ask first player
'ask second player

What remains is only to decide what each step should do, and how the information about the cards will be stored. Since storing the information determines how it is passed between subprograms, we begin with determining this part.

We may want to use an array Deck(52) which will be of ``user defined" type. The advantage of this approach is that we may modify the information we w ``store" with each card with minimal changes in the program itself.

Suite as string
Value as integer ' we may want string here, too!
End type

Afterwards we may declare two shared arrays

Dim Shared Deck(52) as cards
DIM Shared Order(52)

Shared means that every SUB in the program can access values of Order(j), and Deck(j). The first card dealt will be Deck(Order(1)). The definition of type says that its value is Deck(Order(1)).value and its suit is Deck(Order(1)).suit.

D.14.1 SUBS

The simplest way to begin designing SUBS is to describe the purpose of each function/SUB with no code.

SUB InitializeDeck
'Initialize cards to their values.

The next routine is perhaps not easy to write for a beginner, but we have a good example in the book.

SUB ShuffleDeck
'Make a random permutation
'Store it in shared array Order()
'Order(1), Order(2) are distinct random numbers range 1,...n,

The next routine interacts with the user. User interaction should ALWAYS be implemented as a separate SUB. As straightforward as it seems, reliable coding requires extensive checking for errors resulting from unpredicted user reactions.

'ask first player
' Ask user how many cards (s)he requests
' store in variable.
' Check for possible errors in input
' return value if enough cards are left.

The following routine is in charge of giving out cards. Since the order of cards was already determined, it seems straightforward. But again complications arise, and this portion will be much easier to handle if coded as a separate SUB

SUB DealCards(n)
' Remember how many cards are left
' Check how many cards are left
' Print out Error message if not enough cards
'Print next n cards
' You have to decide here HOW the cards will appear on screen:
'words? pictures? numbers?
' (In this example, it will be numbers)

D.14.2  Second Iteration

Rather than beginning to code the actual functions, we may want to double check that the ``flow" of our program is as we expect it. We may write a ``dummy" versions of the more difficult parts of the program, and test its operation. Only after we are sure that the program behaves as expected, we can invest more time into coding more difficult parts.

Here is a test program. It was produced from the previously described code; all newly added parts are clearly marked so that they can be removed once not needed.

REM Suite as string
Suite as integer
'*** end test ***
Value as integer ' we may want string here, too!
End type

Dim Shared Deck(52) as cards
DIM Shared Order(52)

'ask first player
'ask second player

SUB InitializeDeck
'Initialize cards to their values.
'*** test ***
for j = 1 to 52
Deck(j).value=j mod 13 +1
Deck(j).suit=j mod 4 +1
next j
'*** end test ***

SUB ShuffleDeck
'Make a random permutation
'Store it in Order()
'Order(1), Order(2) are distinct random numbers range 1,...n,
'*** test ***
'Factory order
For j= 1 to 52
'*** end test ***

FUNCTION HowManyCards%
' Ask user how many cards (s)he requests
' store in variable.
' Check how many cards are left
' Print out Error message if not enough cards
' return value if enough cards are left.
'*** test ***
INPUT ``How many cards"; x
' should check for "crazy" answers here
'*** end test ***

SUB DealCards(n)
' Remember how many cards are left
'Print next n cards
'*** test ***
' just print first n cards for now
Print "Your cards are:"
FOR j=1 TO n
Print Deck(Order(j)).Value ; " of suit No" ;Deck(Order(j)).Suit
'*** end test ***

With this ``skeleton" program we can check the following things:

  1. Does the program do what we wanted? Are shared variables shared between subprograms?
  2. Are there any preliminary coding mistakes/typos? Are the variables of correct type (as declared in each SUB/FUNCTION)?
  3. Does the output routine operate correctly? (Ask for various numberst of cards. Reverse the order in SUB ShuffleDeck.)

D.14.3  Third iteration

Now we are ready to design/code each SUB. This is left for the reader to do. Here are some hints.


R. Motwani & P. Raghavan, Randomized Algorithms, Cambridge University Press, Cambridge 1995.

N. I. Akhiezer, The Classical Moment Problem, Oliver & Boyd, Edinburgh, 1965.

P. Billingsley, Probability and measure, Wiley, New York 1986.

P. Billingsley, Convergence of probability measures, Wiley New York 1968.

W. Bryc, Normal Distribution, characterizations with applications, Lecture Notes in Statist. v. 100 (1995).

A. Dembo & O. Zeitouni, Large Deviation Techniques and Applications, Jones and Bartlett Publ., Boston 1993.

R. Durrett, Probability: Theory and examples, Wadsworth, Belmont, Ca 1991.

K. T. Fang & T. W. Anderson, Statistical inference in elliptically contoured and related distributions, Allerton Press, Inc., New York 1990.

K. T. Fang, S. Kotz & K.-W. Ng, Symmetric Multivariate and Related Distributions, Monographs on Statistics and Applied Probability 36, Chapman and Hall, London 1990.

W. Feller, An Introduction to Probability Theory, Vol. I, Wiley 196. Vol II, Wiley, New York 1966.

J. F. W. Herschel, Quetelet on Probabilities, Edinburgh Rev. 92 (1850) pp. 1-57.

K. Ito & H. McKean Diffusion processes and their sample paths, Springer, New York 1964.

N. L. Johnson & S. Kotz, Distributions in Statistics: Continuous Multivariate Distributions, Wiley, New York 1972.

N. L. Johnson & S. Kotz & A. W. Kemp, Univariate discrete distributions, Wiley, New York 1992.

A. M. Kagan, Ju. V. Linnik & C. R. Rao, Characterization Problems of Mathematical Statistics, Wiley, New York 1973.

A. N. Kolmogorov, Foundations of the Theory of Probability, Chelsea, New York 1956.

H. H. Kuo, Gaussian Measures in Banach Spaces, Lecture Notes in Math., Springer, Vol. 463 (1975).

J. C. Maxwell, Illustrations of the Dynamical Theory of Gases, Phil. Mag. 19 (1860), pp. 19-32. Reprinted in The Scientific Papers of James Clerk Maxwell, Vol. I, Edited by W. D. Niven, Cambridge, University Press 1890, pp. 377-409.

W. Magnus & F. Oberhettinger, Formulas and theorems for the special functions of mathematical physics, Chelsea, New York 1949.

K. S. Miller, Multidimensional Gaussian Distributions, Wiley, New York 1964.

J. K. Patel & C. B. Read, Handbook of the normal distribution, Dekker, New York 1982.

B. L. S. Prakasa Rao, Identifiability in Stochastic Models Acad. Press, Boston 1992.

W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical recipes in C, Cambridge University Press, New York 1992.

M. Rosenblatt, Stationary sequences and random fields, Birkhäuser, Boston 1985.

R. Sikorski, Advanced Calculus, PWN, Warsaw 1969.

Y. L. Tong, The Multivariate Normal Distribution, Springer, New York 1990.

p-l theorem, 123
A model of cell growth, 118
A simple probabilistic modeling in Genetics, 113
Additional topics, 113
Aliasing, 105
Almost sure convergence, 55
An application: find k-th largest number, 94
Aperiodic Markov chains, 85
Application: length of a random chain, 41
Application: scheduling, 36
Application: spread of epidemic, 119
Application: Traveling salesman problem, 10
Application: verifying matrix multiplication, 114
Arrays and matrices, 135
Autoregressive processes, 111
Bayes theorem, 15
Best linear approximation, 40
Best mean-square approximation, 40, 43
Best non-linear approximations, 43
Binary operations, 136
Binomial experiment, 18
Binomial trials, 17
Binomial, 65
Bisection method, 129
Bivariate normal distribution, 53, 98
Bivariate normal, 53
Blind search, 9
Branching processes, 95
Cauchy-Schwarz inequality, 121, 39
Central Limit Theorem, 57
Characteristic functions, 49
Characteristic function, 49
Chebyshev's inequality, 29
Chebyshev-Markov inequality, 29
Combinations, 125
Combinatorics, 125
Compound Poisson process, 105
Computing, 131
Conditional distributions, 41
Conditional expectations (continued), 42
Conditional expectations as random variables, 42
Conditional expectations, 41
Conditional Expectation, 122
Conditional independence, 16
Conditional limit theorems, 60
Conditional probability, 14
Conditionals and loops, 133
Consequences of axioms, 5
Continuous distribution, 25
Continuous r. v., 24
Continuous time Markov processes, 106
Continuous time processes, 103
Convergence in distribution, 56
Convergence in probability, 55
Convergence of distributions, 56
Convolution, 38
Correlation coefficient, 40
Covariance matrix, 100
Covariance, 35
Cumulative distribution function, 17
Data types, 135
De Finetti, 114
Decreasing events, 3
Deterministic scheduling problem, 36
Difference Equations, 71
Differential equations associated with the Poisson process, 105
Differential equations, 126
Discrete r. v., 23
Disjoint events, 2
Distances between strings, 116
Distributions of functions, 37
Edit distance, 116
Elementary combinatorics, 125
Elementary probability models, 5
Embedded Markov chain, 106
Equality of distributions, 17
Events and their chances, 2
Events, 2
Example: a game of piggy, 89
Example: chromatography, 80
Example: designing an automatic card dealer, 137
Example: how long a committee should discuss a topic?, 87
Example: soccer, 87
Example: vehicle insurance claims, 89
Examples of continuous r. v., 25
Examples of discrete r. v., 24
Exchangeability, 114
Exchangeable r. v., 114
Exclusive events, 2
Expected values and simulations, 30
Expected values, 27
Extinction probability, 95
File Operations, 136
Finite state space, 83
First Iteration, 138
Fourier series, 127
Freivalds technique, 114
Functions of r. v., 33
Further notes about simulations, 19
Game theory, 87
Gaussian Markov processes, 102
Gaussian processes, 108
Gaussian process, 108
General continuous sample space, 6
General multivariate normal law, 100
Generating functions, 47, 95
Generating function, 49
Generating random numbers, 63
Generic Method - continuous case, 65
Generic Method - discrete case, 65
Geometric case, 96
Geometric Probability, 4
Geometric r. v., 18
Geometric, 65
Graphics, 136
Herschel's law of errors, 52
Herschel, on normal distribution, 52
Hoeffding's Lemma, 46
Holder's inequality, 121
Homogeneous Markov chains, 84
Importance sampling, 68
Improving blind search, 12
Independent events, 16
Independent i. i. d., 33
Independent increments, 78
Independent random variables, 33
Initial value, 72
Integration by simulations, 68
Intensity of failures, 44
Interactive links, 125
Introducing BASIC, 131
Introduction to stochastic processes, 71
Jensens's inequality, 123
Joint distributions, 32
Jointly (absolutely) continuous, 32
Lack of memory, 18, 43
Large deviation bounds, 59
Law of large numbers, 55
Limit theorems and simulations, 59
Limit theorems, 55
Limits, 125
Linear algebra, 126
Linear congruential random number generators, 64
Linear regression, 101
Markov Autoregressive processes, 92
Markov chain for annealing, 91
Markov chains, 83
Markov chain, 83
Markov processes, 83
Martingale Convergence Theorem, 114
Math background, 125
Maxwell, on normal distribution, 52
Mean square convergence, 55
Mean-square convergence, 121, 39
Minkowski's inequality, 121, 39
Mixed strategy, 87
Modeling and simulation, 77
Moment generating function-normal distribution, 100
Moment generating functione, 48
Moment generating functions, 47
Moment generating function, 47, 48, 48
Moments of functions, 34
Monte Carlo estimation of small probabilities, 68
Monte-Carlo methods, 30
More about printing, 134
Multivariate integration, 126
Multivariate moment generating function, 97
Multivariate normal density, 101
Multivariate normal distribution, 97
Multivariate random variables, 17
Normal distribution - moment generating function, 100
Normal distribution-bivariate, 98
Normal distribution-covariance, 100
Normal distribution-multivariate, 0, 97
Normal distribution, 51
NP-complete, 14
Numerical integration, 129
Numerical Methods, 129
One step analysis, 88
Order statistics, 39
Orthonormal vectors, 126
Overview of random number generators, 64
Pairwise disjoint, 5
Permutations, 125
Phenotype, 113
Poisson approximation, 44
Poisson process, 103
Poisson, 65
Positive definite functions, 110
Power series expansions, 126
Powers of matrices, 127
Probability density function, 25, 32
Probability mass function, 23
Probability measure, 2
Probability space, 2
Programming Help, 131
Properties of conditional expectations, 122
Properties, 47
Pseudo-random numbers, 7
Quadratic error, 40
Quick Sort, 92
Random digits, 63
Random growth model, 80
Random Permuatations, 67
Random permutations, 13
Random permutation, 8
Random phenomena, 1
Random subsets, 67
Random variables (continued), 23
Random variables, 17
Random walks, 78
Randomization, 66
Recurrence, 90
Recursive equations, chaos, randomness, 76
Rejection sampling, 66
Reliability function, 17
Risk assessment, 106
Ruining a gambler, 80
Sample space, 2
Searching for minimum, 129
Second order stationary processes, 109
Selective sampling, 68
Sequential experiments, 15
Shannon's Entropy, 118
Shocks, 106
Simpson rule, 129
Simulated annealing, 12, 91
Simulating a multivariate normal distribution, 100
Simulating continuous r. v., 65
Simulating discrete events, 7
Simulating discrete experiments, 67
Simulating discrete r. v., 65
Simulating Markov chains, 86
Simulating normal distribution, 66
Simulations of continuous r. v., 26
Simulations of discrete r. v., 24
Simulations, 63
Sobol sequences, 68
Solving difference equations through simulations, 91
Solving equations, 129
Sorting at random, 92
Spectral density, 110
Square integrable r. v., 121, 39
State space, 83
Stationarity equations, 85
Stationary processes, 84
Stochastic Analysis, 55
Stochastic scheduling problem, 37
Stochastically independent r. v., 33
Stopping QBASIC, 131
Stopping times, 79
Stratified sampling, 68
Strong law of large numbers, 56
Subset-Sum Problem, 14
Tail integration formulas, 28
The general prediction problem, 110
The mean and the variance, 95
The structure of a program, 132
The Wiener process, 108
Theoretical complements, 121
Time Series, 109
Total probability formula, 15, 42
Trajectory averages, 110
Trapezoidal rule, 129, 30
Triangle inequality, 121, 39
Trivial events, 16
Two-valued case, 96
Uncorrelated r. v., 40
Univariate normal, 51
User defined FUNCTIONs and SUBs, 135
Variance, 34
Variations, 125
Venn diagrams, 5
Weibull distribution, 44
Wiener process, 108
Exhaustive events, 15


1 Disjoint, or exclusive events A, B W are such that AB = is empty.

2 The price for flexibility is that now we have to decide how to chose pn.

3 Similar instruction on TI-85 is Display rand.

4 In BASIC, to avoid repetitions use instruction RANDOMIZE TIMER at the beginning of a simulation program.

5 You need to find out how to handle graphic in BASIC. Otherwise, make a table of values instead.

6 A related method is brute force - checking all possible cases.

7 If you want to include more cities, you have to type the distances in a suitable format. If you embark on this project, try first to implement a method for selecting an arbitrary subset of cities to visit.

8 Trivial events are those with probabilities 0, or 1.

9 This terminology arises under the interpretation that X represents failure time.

10 A possible heuristic argument may argue that Tp is on average 1.

11 The precise definition is in Section 1.6.1.

12 Actually, we need only a right-inverse, i.e a function such that F(f(u)) = u.

13 Provided that limited accuracy is admissible

14 Theorem gives a more general statement.

15 See eg. W. Feller, An Introduction to Probability Theory, Vol II, Wiley, New York 1966.

16 For the multivariate analog, see Theorem

17 J. F. W. Herschel, Quetelet on Probabilities, Edinburgh Rev. 92 (1850) pp. 1-57

18 J. C. Maxwell, Illustrations of the Dynamical Theory of Gases, Phil. Mag. 19 (1860), pp. 19-32. Reprinted in The Scientific Papers of James Clerk Maxwell, Vol. I, Edited by W. D. Niven, Cambridge, University Press 1890, pp. 377-409.

19 This is all we can hope for in the finite computer arithmetics. Whenever Uj (0,1) is selected and finite approximation is chosen, Nj = NUj is an integer, where N = 2b is the characteristic of the operating system.

20 Actually, we need only a right-inverse, i.e a function such that F(f(u)) = u.

21 Why variance?

22 The smallest possible variance is 0! Why doesn't this happen in the actual application??

23 See J. S. Sadowski, IEEE Trans. Inf. Th. IT-39 (1993) pp. 119-128, and the references therein.

24 Such as y0 = 0.

25 Following good mathematical practise, we simplify the real world and assume that a single mature rabbit will produce one offspring every month.

26 Most likely, you know the solution to this one, but the method is useful for other problems as well.

27 Clearly, we request y0 = 0.

28 Notice that this is a mathematical simplification that might be not worth pursuing if the actual state space has some convenient interpretation.

29 Source: football in A. Hornby, Oxford Advanced Learner's Dictionary of Current English , Oxford University Press 1974

30 Variable aliasing is the mistake of assigning the same location to two or more variables in a program.

31 This is a very imprecise statement!

32 A martingale with respect to a family of increasing s-fields \cal Fn is and integrable sequence Xn such that E(Xn+1|\cal Fn) = Xn. The sequence Xn = E(X|\cal Fn) is a martingale. The sequence in the theorem is of the same form, except that the s-fields are decreasing rather than increasing.

33 Only deletions are required.

34 That is, find the strategy that costs the least if you have to pay for each use of the scale.

35 Readers not familiar with measure theory should skip the proofs.

36 See p-l Theorem (Theorem 3.3) P. Billingsley, Probability and measure, Wiley, New York 1986.

37 Use Gaussian elimination.

38 This is golden section. The explanation why it occurs here can be found in numerical methods books.

39 The actual sample programs on the disk also contain commented LATEX typesetting instructions at the beginning and at the end. Their sole purpose is to insert the listings into this text.

40 For example, in many applied math programs there is a need for integration routines. These perhaps would be collected in a separate module to facilitate repeated usage.

41 FOR j=1 TO 1000
PRINT ".";

42 Use DEFINT I-L to over-ride the default and force all the variables with names that begin with I though L to be of Integer type.

File translated from TEX by TTH, version 1.31.