This course is aimed at students in applied fields and assumes a prerequisite of calculus. The goal is a working knowledge of the concepts and uses of modern probability theory. A significant part of such a ``working knowledge" in modern applications of mathematics is computerdependent.
The course contains mathematical problems and computer exercises. Students will be expected to write and execute programs pertinent to the material of the course. No programming experience is necessary or assumed. But the willingness to accept a computer as a tool is a requirement.
For novices, BASIC programming language, (QBASIC in DOS, Visual Basic in Windows, or BASIC on Macintosh) is recommended. BASIC is perhaps the easiest programming language to learn, and the first programming language is always the hardest to pick.
Programs in QBASIC 4.5 on IBMcompatible PC and, to a lesser extend, programs on Texas Instrument Programmable calculator TI85, and Windows_{TM} programs in Microsoft Visual Basic 3.0 are supported. This means that I will attempt to answer technical questions and provide examples. Other programming languages (SAS, C, C++, Fortran, Cobol, Assembler, Mathematica, LISP, T_{E}X (!), Excel, etc.) can be used, but I will not be able to help with the technical issues.
Moment generating functions, limit theorems, characteristic functions. Stochastic processes: random walks, Markov sequences, the Poisson process.
Multivariate normal distribution. Gaussian processes, white noise. Conditional expectations. Fourier expansions, time series.
This text is available through the Internet in PDF, PostScript, or DVI formats. A number of other math related resources can be found on WWW. Also available are supporting BASIC program files.
Auxiliary textbooks:
This textbook is copyrighted © W. Bryc 1996. All rights reserved. GTDR.Academic users are granted the right to make copies for their personal use, and for distribution to their students in academic year 1995/1996.
Reproduction in any form by commercial publishers, including textbook publishers, requires written permission from the copyright holder.
Definition of a Tree: A tree is a woody plant with erect perennial trunk of at least 3.5 inches (7.5 centimeters) in diameter at breast height (4^{1}/_{2} feet or 1.3 meters), a definitely formed crown of foliage, and a height of at least 13 feet (4 meters).
The Auborn Society Field Guide to North American Trees.
This chapter introduces fundamental concepts of probability theory; events, and their chances. For the readers who are familiar with elementary probability it may be refreshing to see the computer used for counting elementary events, and randomization used to solve a deterministic optimization problem.
The questions are
To begin with, we start with a truism. Real world is complicated, often to a larger degree than scientists will readily admit. Most real phenomena have multiaspect form, and can be approached in multiple ways. Various questions can be asked. Even seemingly the same question can be answered on many incompatible levels. For instance, the generic question about dinosaur extinction has the following variants.
Theses questions are ordered from the most individual level to the most abstract. The reader should be aware that probability theory, and stochastic modelling deal only with the most abstract levels of the question. Thus, a stochastic model may perhaps shed some light whether dinosaurs had to go extinct, or whether mammals will go extinct, but it wouldn't go into details of which comet had to be responsible for dinosaurs, or which is the one that will be responsible for the extinction of mammals.
It isn't our contention that individual questions have no merit. They do, and perhaps they are as important as the general theories. For example, a detective investigating the cause of a mysterious death of a young woman, will have little interest in the ``abstract statistical fact" that all humans eventually die anyhow. But individual questions are as many as the trees in the forest, and we don't want to overlook the forest, either.
Probabilistic models deal with general laws, not individual histories. Their predictions are on the same level, too. To come back to our motivating example, even if a stochastic model did predict the extinction of dinosaurs (eventually), it would not say that it had to happen at the time when it actually happened, some 60 million years ago. And the more concrete a question is posed, say if we want to know when Dino the dinosaur died, the less can be extracted from the stochastic model.
On the other hand, many concepts of modern science are define in statistical, or probabilistic sense.(If you think this isn't true, ask yourself how many trees do make a forest.) Such concepts are best studied from the probabilistic perspective. An extreme view is to consider everything random, deterministic models being just approximations that work for small levels of noise.
Let \cal M be a sfield of its subsets, called the events. Events A,B Î \cal M model sentences about the outcomes of the experiment to which we want to assign probabilities. Under this interpretation, the union AÈB of events corresponds to the alternative, the intersection AÇB corresponds to the conjunction of sentences, and the complement A¢ corresponds to the negation of a sentence. For A,B Î \cal M, A\B: = AÇB¢ denotes the settheoretical difference.
For an event A Î \cal M the probability Pr(A) is a number interpreted as the degree of certainty (in unique experiments), or asymptotic frequency of A (in repeated experiments). Probability Pr(A) is assigned to all events A Î \cal M, but it must satisfy certain requirements (axioms). A set function Pr(·) is a probability measure on (W,\cal M), if it fulfills the following conditions:
 (1) 
Probability axioms do not determine the probabilities in a unique way. The axioms provide only minimal consistency requirements, which are satisfied by many different models.
 (2) 
The uniform assignment of probability involves counting. For small sample spaces
this can be accomplished by examining all cases. Moderate size sample spaces can be
inspected by a computer program. Counting arbitrary large spaces is the domain of
combinatorics. It involves combinations, permutations, generating functions,
combinatorial identities, etc.
Short recalls
the most elementary counting techniques. Problem 1
Three identical dice are tossed. What is the probability of two of a kind?
The following BASIC program inspects all outcomes when five dice are rolled, and counts how many are ``four of a kind".
'declare function and variables DECLARE FUNCTION CountEq! (a!, b!, c!, d!, e!) 'prepare screen CLS PRINT "Listing four of a kind outcomes in Yahtzee..." '*** go through all cases FOR a = 1 TO 6: FOR b = 1 TO 6: FOR c = 1 TO 6: FOR d = 1 TO 6: FOR e = 1 TO 6 IF CountEq(a + 0, b + 0, c + 0, d + 0, e + 0) = 4 THEN PRINT a; b; c; d; e; : ct = ct + 1 IF ct MOD 5 = 0 THEN PRINT : ELSE PRINT ""; END IF NEXT: NEXT: NEXT: NEXT: NEXT 'print result PRINT PRINT "Total of "; ct; " four of a kind." FUNCTION CountEq (a, b, c, d, e) '*** count how many of five numbers are the same DIM x(5) x(1) = a x(2) = b x(3) = c x(4) = d x(5) = e max = 0 FOR j = 1 TO 5 ck = 0 FOR k = 1 TO 5 IF x(j) = x(k) THEN ck = ck + 1 NEXT k ck = ck IF ck > max THEN max = ck NEXT j 'assign value to function CountEq = max 'END FUNCTION
Here is a portion of its output:
The program is written explicitly for tossing five dice, and you may want to modify it to answer similar question for any number of dice, and an arbitrary kofakind question.
[ 1 Run and time YAHTZEE.BAS. Then estimate how long a similar problem would run if the question involved tossing 15 fair dice. The answer depends on your computer, and the software. Both Pascal and Cprograms seem to run on my computer about 15 times faster than the (compiled) QuickBasic. ANS: Running time would take years!
Exercise 1.2.1 shows the power of oldfashioned pencilandpaper calculation. Problem 2 Continuing Problem 1.2.1, suppose now n identical dice are tossed. What is the probability of n1 of a kind? ANS: [5 n/( 6^{n1})].
For bounded subsets W Ì IR^{d}, put
 (3) 
Geometric probability usually involves multivariate integrals. Example 1 A point is selected from the 32 cm wide circular dartboard. The probability that it lies within the 8cm wide bullseye is [1/ 16].
Example 2 A needle of length 2l < 2 is thrown onto a paper ruled every 2 inches. The probability that the needle crosses a line is 2l/p.
Analysis of Example 1.2.2 is available on WWW.
[ 2 Test by experiment (or simulation on the computer) if the above two examples give correct answers. (Before writing a program, you may want to read Section first.)
Example 3 Two drivers arrive at an intersection between 8:00 and 8:01 every day. If they arrive within 15 seconds of each other, both cars have to stop at the stopsign. How often do the drivers pass through the intersection without stopping?
[ 1 Continuing Example 1.2.2: What if there are three cars in this neighborhood? Four? How does the probability change with the number of cars? At what number of users a stop light should be installed?
 (4) 
 (5) 
 (6) 
 (7) 
For a finite or countable set W Ì IN and a given summable sequence of nonnegative numbers a_{n} put
 (8) 
p_{k} = [(a_{k})/( å_{n Î W} a_{n})] and rewrite (8) as Pr(A) = å_{n Î A} p_{n}.
Formula (8) generalizes the uniform assignment (2), which corresponds to the choice of equal weights a_{k} = 1. At the same time it is more flexible^{2} and applies also to infinite sample spaces. Example 4 Let W = IN and put p_{k} = [1/( 2^{k})]. Then the probability that an odd integer is selected is Pr(Odd) = å_{j = 0}^{¥} 2^{2j1} = ^{2}/_{3}. ()
Table list the most frequently encountered discrete probability assignments.
Name  W  Probabilities p_{k} 
Binomial  {0,...,n}  p_{k} = (^{n}_{k})p^{k}(1p)^{nk} 
Poisson  ZZ_{\scriptscriptstyle +}  p_{k} = e^{l}[(l^{k})/ k!] 
Geometric  IN  p_{k} = p(1p)^{k1} 
Equally likely outcomes  {x_{1},...,x_{k}}  p_{k} = ^{1}/_{k} 
Problem 3 For each of the choices of numbers p_{k} in Table 1.1 verify that (8) indeed defines a probability measure.
The reasons behind the particular choices of the expressions for p_{k} in Table 1.1 involve modeling.
 (9) 
The first step in a simulation is to decide how to generate ``randomness" within a deterministic algorithm in a computer program. Programming languages, and even calculators, provide a method for generating consecutive numbers from the interval (0,1) ``at random". For instance, BASIC instruction^{3} PRINT RND(1) returns different number at every use. We shall assume that the reader has access to a of generating uniform numbers from the unit interval (0,1). These are called pseudorandom numbers, since the program usually begin the same ``random" sequence at every run, unless special precautions^{4} are taken.
Once a pseudorandom number from the interval (0,1) is selected, an event that occurs with some known probability p can be simulated by verifying if { RAND(1) < p} occurs in the program. For instance, the number of heads in a toss of a 1,000 fair coins is simulated by the following BASIC program.
'Simulating 1000 tosses of a fair coins H = 0 FOR n = 1 TO 1000 'main loop IF RND(1) < .5 THEN H = H + 1 NEXT n 'print final message PRINT "Got "; H; " heads this time" END '
Here is its output:Got 520 heads this time
A simple method for simulating an outcome on a sixface die is to take the
integer part of a rescaled uniformly selected number INT(1+6*RND(1)). Can
you use this to write a simulation of a roll of 5 dice? Such simulations are often
used to evaluate probabilities empirically as a substitute for a real empirical
study.
[ 3 Write the simulation (as opposed to deterministic inspection of all sample points on page pageref) to estimate how often the event ``four of a kind" occurs in a roll of five dice.
[ 2 Run the (modified) coin tossing program in a loop and to answer the following questions.
More complicated objects are often of interest in simulations. For instance we may want to draw two cards from the deck of 52. One possible way to do it is to number the cards, and select two numbers a,b.
In more advanced exercises you may want to estimate probabilities that aren't known. In such cases it is always a good idea to run simulations of various lengths and compare the results. In this section we briefly discuss how such a simulation can be organized in a way that promotes multiple uses of the same program.
The key is organizing the programs carefully into manageable blocks of small size. Modern BASIc is a structural programming language. The generic program to study the effects of the length of simulation on its output can be written as follows
' PROGRAM Generic.bas
'Generic Simulator
'Size is simulation size varied from Min=100 to Max=10000
For Size 100 to 10000 Step 100
Simulate(Size, Result)
Print "Simulation size="; Size ; Öutput="; Result
Next Size
End
The actual simulation is performed by
SUB Simulate (SizeRequested, Result)
'Runs requested number of simulations and returns average
'Trial numbers consecutive simulations from 1 to SizeRequested
For Trial=1 to SizeRequested
SimulateOne(Score)
Result=Result+Score
Next Size
'Most simulations return averages of single trials
Result=Score/Size
End SUB
The actual modeling is performed in another SUB, which in the generic program we named SimulateOne. This SUB may be as simple as simulating a toss of a single coin
SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
OutCome=0
if RND(1)<1/2 THEN Outcome=1
END SUB
Or it can be as complicated as we wish. The example below simulates a toss of five dice, and uses previously introduced function CountEq(a,b,c,d,e). fourofakind.
SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
d1=int(RND(1)*6+1)
d2=int(RND(1)*6+1)
d3=int(RND(1)*6+1)
d4=int(RND(1)*6+1)
d5=int(RND(1)*6+1)
IF CountEq(d1,d2,d3,d4,d5)=4 THEN Outcome=1
END SUB
[ 3 Write a blindsearch program to find the maximum of a function.
Planning such a tour is easy by hand for 34 cities. For longer tours some help is needed. To check the performance of the blind search, you may want to know what the usual algorithms involve. A greedy method starts with the shortest distance, and then keeps going to the closest city not yet visited. Another heuristic method is to to select a pair of closest cities and accept the best (shortest) connection among those that do not complete a shorter loop, and do not introduce a spurious third connection to a city. Eventually, the disjoint pieces will form a path that often can be further improved upon inspection.
The program is longer but not at all sophisticated  it just selects paths at random. Notice that a solution to Exercise 1.4  how to generate a random permutation  is given in one of the subprograms (which one?). The method for the latter is simpleminded  the algorithm attempts to place consecutive numbers at random spots until an empty spot is selected.
The following is the main part of the program. You can use it as a template in designing your own version of Blind search programs. The full code with SUBs is online in RANDTOUR.BAS.
The program runs in the infinite loop until it is stopped by the user. Once stopped, the program prints the best route it found. For larger sets of cities we may have hard time deciding when to stop it. Here is a sample output (from the improved version, as marked in the actual code):'**** CLS '**** get number of cities (no choice which) from user LOCATE 2, 1 INPUT ; "Shortest distance between how many cities?", n '*** initialize program CLS nMax = 19 ' current data size. Make sure not exceeded! IF n > nMax THEN n = nMax 'declare arrays DIM SHARED dist(nMax, nMax) ' matrix of distances DIM SHARED city(nMax) AS STRING DIM P(n), BestP(n) 'read distances CALL AssignDistances(nMax) 'initial permutation FOR j = 1 TO n P(j) = j BestP(j) = j NEXT j 'initial length of trip MinLen = PathLen(P()) '*** main loop DO 'infinite loop till user stops 'count trials no = no + 1 '*** interacting with user 'check if user pressed key to stop k$ = INKEY$ IF k$ > "" THEN EXIT DO 'exit infinite loop 'display currrent progress display (no) '*** get any path CALL GetPermutation(P()) x = PathLen(P()) IF x < MinLen THEN 'Better path found, so memorize and display Dlen = MinLen  x MinLen = x '*Memorize best order and print FOR j = 1 TO n BestP(j) = P(j) PRINT city(P(j)); ">"; NEXT j 'Finish printing PRINT city(P(1)) PRINT "Best so far: "; MinLen PRINT "Progress rate "; Dlen / (no  Slen); " miles per trial" Slen = no END IF LOOP 'Print final message CLS PRINT "ALPHABETIC TOUR OF FIRST "; n; " CITIES in the USA" PRINT "Blind Search Recommended Route found in "; Slen; "th search" FOR j = 1 TO n  1 PRINT city(BestP(j)); ">"; NEXT j PRINT city(BestP(n)); ">"; city(BestP(1)) PRINT "Total distance: "; MinLen LOCATE 22, 40 PRINT "(Distances subject to change)" END
When you run this program on larger sets of cities, you will notice that the
program is not fast. One possible improvement in the design of this program is
to modify the randomization to be less likely to pick long paths. For instance,
you can attempt to modify paths that are known to be short, or weight the
modifications by lengths of resulting paths. Such methods are actually in use in
image restoration problems (simulated annealing),
see page .
It is possible to improve on the last aspect without complicating the program much. The idea is to make random modifications of the currently best found value. For example, in a onedimensional maximization of a function f(x), we would do the following steps
We want to allow for the chance of checking points far away from the ``bestsofar" answer. But we don't want this to happen too often, because x_{0} might be rather close to the optimum. The tradeoffs are that the program will tend to follow ``direct path" to the maximum, but the danger is that it will get stuck longer in local maxima.
Improved blind search with time/state dependent randomization is actually implemented within RANDTOUR. It is commented out in the version on the disk, so that it isn't operational. To make it active, uncomment the call to ImproveBest as a replacement for GetPermutation.
Program RANDTOUR.BAS selects permutations at random only in its ``simplest" variant. Here are a few examples of problems that require selecting random permutations.
Suppose there are 7 items hidden under 12 cups, and a person is allowed to try to find them. How often all seven will be recovered by pure lack (as opposed to, say, parapsychic abilities?
SUB GetPermutationFast(P())
'Put random integers into P()
n=UBOUND(P) 'how many entries
'this loop can be omitted is we are sure that P(j) list all the numbers we want
FOR j=1 TO n
P(j)=j
next j
for j=1 to UBOUND(P) 'not n as n changes in the loop
k=INT(RND(1)*n+1)
SWAP P(n), P(k)
n=n1
next j
[ 5 There are many incompatible measures of ``quality" of an algorithm. One of the ``objective" criteria is the number of ``If" verifications. Another ``objective criterion" might be the number of calls to a function. A ``subjective" criterion, which depends on the hardware and circumstances, is timing.
Does SUB GetPermutationFast deserve adverb Fast in its name?
[ 4 The SubsetSum Problem is stated as follows:
Let S be a set of positive integers and let t be a positive integer. Decide of there is a subset S¢ Ì S such that the sum of integers in S¢ is exactly t.
The associated optimization problem is to find a subset S¢ Ì S whose elements' sum is the largest but doesn't exceed t. This optimization problem is NPcomplete, ie it isn't known if there is a polynomial time algorithm (polynomial in the size of S) to find S¢.
Investigate how the blind search will do on sets S selected at random, and on sets S constructed in more regular fashion like arithmetic progression S = {a, 2a, 3a, ...}, geometric progression S = a,a^{2}, a^{3},....
To formalize this idea, suppose B is an event such that Pr(B) ¹ 0. The last condition merely says that B is an event that does have some chance of occurring. Conditional probability of event A given event B is denoted by Pr(AB). It is defined as
Conditional probability satisfies the axioms of probability, and Pr(AB) = 0 if A,B are disjoint. In particular, Pr(A¢B) = 1Pr(AB), Pr(BB) = 1.
The easiest way to find Pr(AB) by simulations is to repeatedly simulate the complete experiment, discarding all the outcomes except the ones resulting in B.
[ 6 Suppose we toss repeatedly a fair coin, and the ``success" is to get heads.
Use computer simulations to find the conditional probability that the very first trial was successful, if 10 consecutive (and independent) trials resulted in 8 successes.
Try to answer the same question under the condition that 50 independent trials resulted in 40 successes.
You should notice that it takes forever to simulate events that happen rarely. Section indicates one possible way out of this difficulty.
Example 5 Suppose we have a deck of 52 cards numbered 1 through 52. Since it isn't obvious how to simulate selecting 5 cards without replacement, we may want to select them with replacement instead. Let A denote the event that all five cards are different. What is the probability of A?
We may perform the experiment sequentially, drawing one card at a time. Let A_{k} denote the event that the k consecutive draws resulted in different cards. Then A = A_{5} Ì A_{4} Ì ... Ì A_{1}. Moreover, Pr(A_{1}) = 1.
Clearly, Pr(A_{2}A_{1}) = [51/ 52], so Pr(A_{2}) = Pr(A_{2}ÇA_{1}) = Pr(A_{2}A_{1})Pr(A_{1}) = [51/ 52]. Similarly, Pr(A_{3}) = Pr(A_{3}ÇA_{2}) = Pr(A_{3}A_{2})Pr(A_{2}) = [50/ 52][51/ 52]. Continuing this we get Pr(A_{5}) = [51 50 49 48/( 52^{4})] » 0.82.
The following identities are also of interest.
 (10) 
Example 6 A lake has 200 white fish and a 100 black fish, and a nearby pond contains 20 black fish and 10 white ones. No other fish live there.
A fish is selected at random from the lake and moved to the pond. Then a fish is selected from the pond and moved back to the lake. What is the probability that all fish in the pond are black?
Denoting by \cal P the generic path A_{1}ÇA_{2}Ç...ÇA_{n}, and by \cal P(k) = A_{k} we have the following path integral formula for the probability of an event \cal F specifying the outcome of the complete experiment, and consisting of many paths \cal P.
 (11) 
Example 7 Suppose that the double transfer operation from Example 1.5.1 was repeated twice. That is, a random selection was done four times. What is the probability that all fish in the pond are black?
[ 8 Check by simulation how the proportion of black fish in the pond changes when the random transfers from Example 1.5.1 are performed repeatedly for a long time.
Two events A,B are independent, if the conditional probability is the same as unconditional, Pr(AB) = Pr(A). This is stated in multiplicative form which exhibits symmetry and includes trivial events^{8} Definition 1 Events A, B are independent if Pr(AÇB) = Pr(A)Pr(B).
Independence captures the intuition of noninteraction, and lack of information. In modeling it is often assumed rather than verified. For instance, we shall assume that the events generated by consecutive outputs of the random generator are independent. We also assume that tosses of a coin (fair, or not!) are independent.
Beginners sometimes confuse disjoint versus independent events. Exclusive (ie. disjoint) events capture the intuition of noncompatible outcomes. Not compatible outcomes cannot happen at the same time. This is not the same as independent outcomes. If A, B are disjoint and you know that A occurred, then you do know a lot about B. Namely you know that B cannot occur. Thus there is an interaction between A and B. Knowing whether A occurred influences chances of B, which is not possible under independence.
Independence (or, more properly, mutual stochastic independence) of families of events is defined by requesting a much larger number of multiplicative conditions. The reason behind is Theorem , which provides a very convenient tool. Definition 2 Events A_{1}, A_{2}, ..., A_{n} are independent, if Pr(Ç_{j Î J} A_{j}) = Õ_{j Î J} Pr(A_{j}) for all finite subsets J Ì IN.
Example 8 A coin is tossed repeatedly. Find the probability that heads appears for the first time on the fourth toss.
Problem 4 SUB GetPermutation from the program RANDTOUR.BAS selects numbers between 1 and n at random until it finds a number not yet on the list. Then it ads the number to its list, and repeats the process.
Another important concept is the conditional independence. For example, many events in the past and in the future are dependent. But in many mathematical models, past and future are independent conditionally on the present situation. In such a model future depends on past only through present events!
Definition 3 Let C be a nontrivial event. Events A,B are Cconditionally independent if Pr(AÇBC) = Pr(AC)Pr(BC).
Random variables are introduced for convenient description of experiments with numerical outcomes. (The other option is to select W Ì IR, or W Ì IR^{d}.) If we want to run computer simulations, we need to represent even nonnumerical experiments (like tossing coins) in numerical terms anyhow. Thus the language of random variables becomes the natural extension of elementary probability theory, expressing many of the same concepts in a little different language.
A random variable is the numerical quantity assigned to every outcome of the experiment. In mathematical terms, random variable is a function X:W® IR with the property that sets {w Î W: X(w) £ a} are events in \cal M for all a Î IR. Recall that the last conditions means that we may talk about probabilities of events {w Î W: X(w) £ a}.
Probabilities for a onedimensional r. v. are determined by the cumulative distribution function
 (12) 
Cumulative distribution function can be used to express probabilities of intervals Pr(a < X £ b) = F(b)F(a). Since probability is continuous, (1) we can also compute Pr(X = a) = lim_{b® a+}Pr(a < X £ b) = F(a^{+})F(a). The right hand side limit F(a^{+}) exists, as F is a nondecreasing function.
Example 9 Suppose F(x) = (1e^{x})Ú0. Then Pr(X2 < 1) = e^{1}e^{3}.
In probability theory we are concerned with probabilities. Random variables that have the same probabilities are therefore considered equivalent. We write X @ Y to denote the equality of distributions, ie. Pr(X Î U) = Pr(Y Î U) for all Borel sets U Ì IR (say, all intervals U).
Vector valued r. v. are measurable the functions W® IR^{d}. In the vector case we also refer to X = (X_{1},... X_{d}) as the dvariate, or multivariate, random variable.
We will use the ordinary notation for sums and inequalities between random variables. There is however a word of caution. In probability theory, equalities and inequalities between random variables are interpreted almost surely. For instance X £ Y+1 means Pr(X £ Y+1) = 1; the latter is a shortcut that we use for the expression Pr({w Î W: X(w) £ Y(w)+1}) = 1. Problem 5 Show that F(x) = Pr(X £ x) is right continuous: lim_{x®a+}F(x) = F(a).
A binomial experiment, called also binomial trials, consists of the sequence of simpler identical experiments that have two possible outcomes each. The independent events S_{j} represent successes in consecutive experiments. We assume that we have an infinite sequence of events S_{1},S_{2},... S_{k},... that are independent and have the same probability p = Pr(S_{j}). We denote by F_{j} = S_{j}¢ the failure in the jth experiment, and put q = 1p.
Two important random variables are associated with the binomial experiment are the number X of successes in n trials, and the number T of trials until first success. Example 10 The probability that number X of successes in n trials is k is Pr(X = k) = (_{k}^{n})p^{k}q^{nk}. (Here k = 0,..., n.)
Example 11 The probability of more than n attempts needed for the first success is Pr(T > n) = q^{n}. The probability that first success occurs at the nth trial is Pr(T = n) = pq^{n1} (geometric).
Example 12 Geometric distribution has lack of memory property: Pr(T > n+kT > n) = Pr(T > k).
Random variables are often described solely in terms of cumulative distribution function F(x), or formulas for Pr(X = x) without reference to the underlying probability space W. For instance, the number of minutes T that we spend waiting for a bird to come to the bird feeder at the back of my house is random, and I believe Pr(T = n) = pq^{n1} because Pr(T > n+kT > n) = Pr(T > k).
It is intuitively obvious that on average we get np successes in n trials.
It is perhaps less obvious^{10}
that on average we need 1/p trials to get the first success.
[ 9 Write a simulation program to verify the claims about the averages for several values of p.
Example 13 The probability that in 2n tosses of a fair coin, half are heads is [(2n)!/( 4^{n}(n!)^{2})] » [1/( Ö{ pn})]® 0 as n®¥. The latter isn't easy to prove, but the computer printout is quite convincing, see Table (note that [1/( Ö{p})] » 0.5642).
2n  Pr(X = n)  Frequency in 1000 trials  ÖnPr(X = n) 
100  0.07959  0.08200  0.56278 
300  0.04603  0.06100  0.56372 
500  0.03566  0.03700  0.56391 
700  0.03015  0.02200  0.56399 
By now you should have written some simple simulation programs, and printed out the results. It is perhaps a good moment to pause and consider what are the aspects of simulations that we are interested in.
In general, we would like to get answers to questions that we don't know how to answer in any other way. But before we do that, we should develop some intuition on the cases that can check the answers. Therefore we begin with simulation of probabilities or averages that are known.
Problem 7 [Exercise] A die is thrown until an ace turns up.
Assuming the ace doesn't turn up on the first throw, what is the probability that more than three throws (ie. at least four) will be necessary? ANS: 197/198
Suppose that an ace turns up on an even throw. What is the probability that it turned up on the second throw?
(If you think this is too hard mathematics, do it as a computer assignment!)
[ 5 A deck of 52 cards has 4 suits and 13 values per suit.
If you think this is too difficult on a computer, compute the probabilities by hand.
[ 10 A math teacher in a certain school likes to give multiple choice tests, grade them as either right, or wrong, and then lets the students to go over the test and correct the ones they got wrong. This gives them two chances to get a problem right, and the chance of getting a question right increases even if the student just guesses the answers. Suppose a student simply guessed the first time, got the corrections, and guessed differently the second time on the wrong answers. How much his grade improves?
[ 6 This is the expanded version of Exercise 1.5.1. In a group of n people, how often at least two have the same birthday?
Tree species are not distributed at random but are associated with special habitats.
The Auborn Society Field Guide to North American Trees.
Intuitively, random variables are numerical quantities measured in an experiment. The concept^{11} is the core of probability theory; it leads outside of elementary probability and it touches advanced concepts of integration, function transforms and weak limits.
For convenience random variables are split into three groups: continuous, discrete, and the rest.
The definition says that X is a discrete r. v. if there is a finite, or countable set \cal V of numbers (values) of X such that p_{v} = Pr(X = v) > 0 and å_{v Î \cal V} p_{v} = 1. The function f(v) = p_{v} is then called the probability mass function of X. For completeness, the domain of the probability mass function is often extended to all x Î IR (or to x Î IR^{d} in the multivariate case) by f(x) = Pr(X = x).
It is easy to see that if f is a function which satisfies two natural conditions:

For discrete r. v. the cumulative distribution function (12) plays lesser role. It is a discontinuous function given by the expression
 (16) 
Name  Values  Probabilities  Symbol  Parameters 
Binomial  0,...,n  Pr(X = k) = (^{n}_{k})p^{k}(1p)^{nk}  Bin(n,p)  0 £ p £ 1, n Î IN 
Poisson  ZZ_{\scriptscriptstyle +}  Pr(X = k) = e^{l}[(l^{k})/ k!]  Poiss(l)  l > 0 
Geometric  ZZ_{\scriptscriptstyle +}  Pr(X = k) = p(1p)^{k1}  0 £ p £ 1  
Uniform  {x_{1},...,x_{k}}  Pr(X = x_{j}) = ^{1}/_{k}  k Î IN, x_{1},...,x_{k} Î IR  
Hypergeometric  
Negative Binomial 
Example 14 Let the random variable X denote the number of heads in three tosses of a fair coin.
Example 15 Let the random variable X denote the score of a randomly selected student on the final exam.
Problem 8 Let N be Poiss(l), and assume N balls are placed at random into n boxes. Find the probability that exactly m of the boxes are empty. ANS: (_{m}^{n})e^{lm/n}(1e^{l/n})^{nm}.
Other methods are also available for the distributions from Table 2.1. For example, program TOSSCOIN.BAS on page pageref simulates binomial distribution Bin(n=100, p=1/2).
The following exercise provides tools to run more involved simulations. [ 11 Write functions SimOneBin(n,p) and SimOneGeom(p) that will simulate a single occurrence of the Bin(n,p) r. v. and the geometric r. v. The sample usage: PRINT SimOneBin(15,.5) should simulate the number of heads in tossing 15 fair coins.
Also write function SimGeneric(p()) which simulates generic r. v. with values {0,1,...,n} and prescribed probabilities p_{k} = p(k).
Since probability satisfies continuity axiom (1), Pr(X Î (a,a+h))® 0 as h® 0 for all a. The main interest in continuous case is that the rate of convergence to 0 is also known, Pr(X Î (a,a+h)) » f(a) h +o(h). Function f(x) in this expansion is called the density function.
In terms of the cumulative distribution function 12), the probability is Pr(X Î (a,a+h)) = F(a+h)F(a), and thus f(a) = lim_{h® 0}[(F(a+h)F(a))/ h] = F¢(a) is the derivative of the cumulative distribution function F. Therefore when the Fundamental Theorem of Calculus can be invoked (say, when f is piecewise continuous)
 (17) 
Definition 5 Random variable X is (absolutely) continuous, if there is a function f such that Pr(X Î U) = ò_{U} f(x)dx. Function f is called the probability density function of X.
It is known that if f is a function which satisfies two natural conditions:

It is convenient to use the heuristic probability density function in continuous case corresponds to probability mass function in discrete case, and that expressions that involve in discrete case sums are replaced by integrals, compare (17) and (16).
Name  Range  Density  Symbol  Parameters 
Normal  ¥ < x < ¥  f(x) = [1/(Ö{2p}s)]exp[((xm)^{2})/(2s^{2})]  N(m,s)  m Î IR,s > 0 
Exponential  x > 0  f(x) = le^{lx}  l > 0  
Uniform  a < x < b  f(x) = [1/( ba)]  U(a,b)  a < b real 
Gamma  x > 0  f(x) = Cx^{a1}e^{x/b}  Gamma(a,b)  a > 0,b > 0,C = [1/( b^{a}G(a))] 
Weibull 
Example 16 A dart is thrown at a circular dart board of radius 6. Let X denote the distance of the dart from the center. Assuming uniform assignment of probability (3), the density of X is
f(x) = {
[x/ 18]
if 0 £ x £ 6
0
otherwise
Problem 9 Referring to Exercise 1.2.2, let X,Y denote the arrival times of the two drivers at the intersection. Find the density of the time lapse XY between their arrivals.
To create a useful histogram, split the range into the finite number of intervals. Then graph over each interval the rectangle with the area equal to the observed frequency. The number, and positioning of intervals depends on the amount of data available, and personal preference.
The generic method for simulating a continuous random variable is similar to the method used in the discrete case. Namely, we take X = f(U) with the suitable function f.
To find f assume it is increasing and thus invertible with inverse g. Then Pr(f(U) < x) = Pr(U < g(x)) = g(x). This completes the generic prescription: take as f(x) the inverse^{12} of the cumulative distribution functions F(x) = Pr(X £ x).
This method of simulation is quite effective if the inverse of F can be found analytically. It becomes slow when the inverse (or, worse still, cumulative distribution function F itself) is computed by numerical procedures. Since this is the case of the normal distribution, special methods are used to simulate the normal distribution.
Example 17 To simulate X which is exponential with parameter l, use X = [1/( l)]lnU.
Expected value captures the intuition of the average of a random quantity. It is also this intuition that leads to estimating the expected value by averaging the outcomes of simulations. Example 18 If X has values x_{1},...,x_{n} with equal probability, then EX = [`x] is the arithmetic mean of x_{1},...,x_{n}.
Simulationsare oten used to get answers that are too difficult to find analytically. The following exercise can be answered by simulation if you figure out how to shuffle cards from a deck at random (Exercise 1.4). [ 12 What is the expected number of cards which must be turned over in order to see each of one suit.
Example 19 If X takes two values a < b and Pr (X = a) = p, then EX = pa+(1p)b is the number in the closed interval [a,b].
Definition 7 For continuous random variable X the expected value EX is given by EX = ò_{IR}xf(x) dx, provided the integral converges.
Name  Probability distribution  EX 
Normal  f(x) = [1/(Ö{2p}s)]exp[((xm)^{2})/( 2s^{2})]  m 
Exponential  f(x) = le^{lx}  [1/( l)] > 0 
Uniform  f(x) = [1/( ba)]  ^{1}/_{2}(a+b) real 
Gamma  
Weibull  
Binomial  Pr(X = k) = (^{n}_{k})p^{k}(1p)^{nk}  np 
Poisson  Pr(X = k) = e^{l}[(l^{k})/ k!]  l 
Geometric  Pr(X = k) = p(1p)^{k1}  ^{1}/_{p} 
Hypergeometric  
Negative Binomial 
Problem 10 Compute EX for the entries in Table 2.3.
[ 13 Simulate EX for the entries in Table 2.3 (except normal) for different values of parameters involved.
Problem 11 [Exercise] Referring to Example 2.2.1, what is the average distance of a dart from the center?ANS: 4.
If you chose to do the simulations, you should perhaps notice that it is rather difficult to decide how many simulations to take in order to achieve the desired accuracy. Typically, you need to increase the size of a simulation four times to half the error.
Another point to keep in mind is that simulations do return numbers. But from numbers alone it is difficlut to see how parameters of the model change it, and this is a more interesting question.
 (21) 
Proof. To simplify the proof and expose the main idea more clearly, consider the discrete case with a finite number of values. Similarly, to avoid technicalities we consider continuous case with bounded range only.
Discrete case:
EX = x_{1}p_{1}+x_{2}p_{2}+...+x_{n}p_{n} = x_{1}(p_{1}+p_{2}+...+p_{n}( p_{2}+...+p_{n}))+x_{2}( p_{2}+...+p_{n}( p_{3}+...+p_{n}))+...+x_{n1}( p_{n1}+p_{n} p_{n})+x_{n}p_{n} = x_{1}( p_{1}+p_{2}+...+p_{n})+(x_{2}x_{1})( p_{2}+...+p_{n}(p_{3}+...+p_{n}))+...+( x_{n1}x_{n2})( p_{n1}+p_{n})+(x_{n}x_{n1})p_{n} The latter is ò_{0}^{xn} Pr(X > t) dt.
Continuous case: Let f denote the density and F be the cumulative distribution function.
Integrating by parts
EX = ò_{0}^{b} x f(x) dx = bò_{0}^{b}F(x) dx = ò_{0}^{b}(1F(x) dx.
^{[¯]}
If X is discrete integer valued X Î {0,1,...}, then (21) can be
written as
 (22) 
It is natural to define EX for more general r. v. in the same way through formula (21). Write X = X^{+}X^{} to decompose X into its nonnegative, and nonpositive parts, and then define EX = ò_{0}^{¥} P(X > t) dtò_{0}^{¥} P(X < t) dt. Clearly, if one of the integrals diverges, EX is not defined.
Example 20 Suppose X is exponential with the density f(x) = e^{x}. Let Y be X truncated at level 3. That is
Y = {
X
if X £ 3
3
if X ³ 3
Clearly, Y is not continuous, as Pr(Y = 3) = Pr(X > 3) = e^{3} > 0. On the other hand, Y is not discrete as it takes uncountable number of values; in fact all the numbers between 0 and 3 are possible. The definitions of the expected value we gave do not apply, but (21) can be used to show that E(Y) = 1e^{3}.
Indeed, Pr(Y > t) = Pr(X > t) = e^{t} for 0 < t < 3, and Pr(Y > t) = 0 for t > 3.
Example 21 To determine the mean of the geometric distribution we can either compute the sum på_{n = 1}^{¥} n (1p)^{n1}, or use (21) and find the easier sum å_{n = 0}^{¥} (1p)^{n}.
 (23) 
Indeed, by (21) we have EU = ò_{0}^{¥} Pr(U > x) dx ³ ò_{0}^{t}Pr(U > x) dx ³ tP(X > t).
Problem 12 Suppose U is uniform U(0,1). Then Pr(U > t) £ [1/ 2t]. This means that 1t < [1/ 2t].
Expected value EX is approximated without much difficulty^{13} on a computer by averaging large number of independent trials.
Example 22 To find the average of the uniform U(0,1) distribution, take ^{1}/_{n}(U_{1}+...+U_{n}).
This is the basis of MonteCarlo methods , which is a family of related probabilistic methods for computing the integrals. To find ò_{a}^{b} f(x) dx we simulate X_{j} = f(a+bU_{j}). The average ^{1}/_{n}å_{j = 1}^{n}X_{j} approximates [1/( ba)]ò_{a}^{b} f(u) du for large n.
The variance of the nth approximation is of the order n^{1/2}, which is worse than the trapezoidal rule for smooth functions. In return the approximation is insensitive to the smoothness of the integrands, and also to the dimension of the integral. Monte Carlo methods can be used effectively for multiple integrals over irregular domains.
[ 14 Use Monte Carlo method to approximate p = ò_{1}^{1}2[Ö(1x^{2})] dx. (You may want to compare the output with numerical procedures described in Section .)
Another method of similar nature is to pick points (X,Y) at random from the rectangle containing the graph of f and check if Y < f(X) holds. The proportion of ``successes" approximates the proportion of the area under the graph of f.
[ 15 Approximate p/4 by selecting points (X,Y) at random from the unit square, and checking if X^{2}+Y^{2} < 1.
The following sample program computes numerically double integral òò_{U}cos(10x+20y) dxdy over a circle x^{2}+y^{2} = 1. The only conceptual difficulty as compared to single integrals is how to select points at random from the unit disk. This is done by picking points from a bigger square and discarding those that didn't make it. Can you do this integral analytically? Or by another numerical procedure?
'declarations DECLARE FUNCTION Integrand! (X!, Y!) DECLARE FUNCTION InDomain! (X!, Y!) CONST True = 1 ' simulation loop NumTrials = 10000 FOR j = 1 TO NumTrials 'select random points from the square [1,1]x[1,1] X = 2 * RND(1)  1 Y = 2 * RND(1)  1 'check if this is in the domain IF InDomain(X, Y) THEN NumTested = NumTested + 1 Sum = Sum + Integrand(X, Y) Var = Var + Integrand(X, Y) ^ 2 END IF NEXT j 'Print the answer PRINT "Examined "; NumTested; " random points" IF NumTested = 0 THEN END 'nothing found N = NumTested PRINT "The integral is approximately "; Sum / N PRINT "With 95% confidence the error is less than "; 1.96 * SQR(Var / N  (Sum / N) ^ 2) / SQR(N) END FUNCTION InDomain (X, Y) 'This function checks if $x,y$ is in the integration domain 'The definition of the domain can be easily modified here, including 'more complicated domains IF X ^ 2 + Y ^ 2 < 1 THEN InDomain = True END FUNCTION FUNCTION Integrand (X, Y) 'This is the function to be integrated Integrand = COS(10 * X + 20 * Y) 'END FUNCTION
It is important to have some idea about how accurate the answer is. The program uses the to estimate the magnitude of the error. A less sharp (and thus safer to use!) error estimate can be obtained from (23)
[ 7 There are two natural choices to estimate by simulation events of small probability. We can pick a large sample size n, run the simulation experiment and hopefully get several data points. The outcome X of such an experiment is a binomial random variable with unknown probability p of success, and we would estimate p » X/n. The trouble is that if we don't know how small the chances are, we might get none, estimating probability to be zero. Or we could run the experiment until we get the prescribed number of successes. The observation would then consist of a set of geometric random variables T_{1},..., T_{k} with unknown mean ET = 1/p. We could then estimate p = 1/ET by taking the inverse of the arithmetic mean of T_{j}. In this approach we are guaranteed to get some observations, as long as p ¹ 0.
The question is Which of the methods would you recommend to use? (Of course, you would recommend a better method, but what the word ``better" might mean here?)
Often an experiment involves measuring two or more random numbers, say X and Y. The fact that we know the distribution of X, and the distribution of Y separately doesn't determine probabilities of events that involve both X and Y simultaneously. Example 23 Suppose

It is clear that if X is discrete, and Y is discrete, then (X,Y) is an IR^{2} valued discrete r. v. That is, the values of the pair are countable. Probabilities Pr(X = x, Y = y) are called the joint distribution. Corresponding Pr(X = x) and Pr(Y = y) are the so called marginals. Example 2.5 points out that marginals do not determine joint probabilities uniquely. But if we know the joint probabilities then we can compute the marginals, eg Pr(X = x) = å_{y}Pr(X = x, Y = y).
In contrast to the discrete case, joint continuity can not be recognized from the continuity of the components, and requires full definition.
Definition 8 Let X = (X_{1},...,X_{n}). Random variables X_{1},...,X_{n} are jointly (absolutely) continuous, if there is a function f such that

Example 24 Suppose X,Y have uniform distribution in the unit disk. Then the joint density is f(x,y) = 1/p for x^{2}+y^{2} £ 1 and the density of X is f_{X}(x) = [2/( p)][Ö(1x^{2})].
The relation between probabilities and the density is
 (25) 
 (26) 
(Similarly we define the stochastic independence of random vectors).
We say that X_{1},...,X_{n} are independent identically distributed (i. i. d) if (26) holds and Pr(X_{i} Î U) = Pr(X_{j} Î U) for all Borel U Ì IR. [ 2 If X,Y are discrete with the probability mass function f(x,y), then independence of X,Y is equivalent to f(x,y) = f_{X}(x)f_{Y}(y).
If X,Y are continuous with the joint density f(x,y) then independence of X,Y is equivalent to f(x,y) = f_{X}(x)f_{Y}(y).
Independence is often part of the model. Independence allows to determine joint distributions from marginals. Thus each independent random variable can be analyzed separately, and then more complex questions can be answered. From the mathematical perspective, under independence we can determine joint distributions if we know marginals. Example 25 Suppose X is binomial Bin(n,p) and Y is Poiss(l). If X,Y are independent, then the joint probability mass function of X,Y is given by f(x,y) = (_{x}^{n})p^{n}(1p)^{nx}e^{l}l^{y}/y! (or 0).
Example 26 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. Their arrival times are independent random variables. Indeed, using formula (25) and elementary area computation, Pr(X < a,Y < b) = ab for 0 < a,b < 1.
Sums and linear combinations, medians, maxima, and minima are perhaps the most often encountered functions of multidimensional random variables. Methods to compute the distribution, or the expected value of such a function are of considerable practical significance.
If X,Y are discrete and EZ exists, then
 (27) 
If X,Y are jointly continuous then Z might be continuous, discrete, or say of mixed type. Regardless of the case
 (28) 
In particular the expected value is linear
 (29) 
The fact that expected value is linear provides a simple method of computing some otherwise difficult sums. Example 27 Suppose X is Binomial Bin(n,p). Then X = X_{1}+...+X_{n}, where X_{j} is the number of successes in jth trial. Clearly each X_{j} is 0, or 1, and EX_{j} = p.
[ 16 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. On average how much time lapses between their arrivals?
Definition 10 Let m = EX. The variance Var(X) is defined as Var(X) = E(Xm)^{2}.
Notice that
 (30) 

The standard deviation is s = [ÖVar(X)].
Name  Probability distribution  Var(X) 
Normal  f(x) = [1/(Ö{2p}s)]exp[((xm)^{2})/( 2s^{2})]  s^{2} 
Exponential  f(x) = le^{lx}  [1/( l^{2})] 
Uniform  f(x) = [1/( ba)]  [1/ 12](ba)^{2} 
Gamma  
Weibull  
Binomial  Pr(X = k) = (^{n}_{k})p^{k}(1p)^{nk}  np(1p) 
Poisson  Pr(X = k) = e^{l}[(l^{k})/ k!]  l 
Geometric  Pr(X = k) = p(1p)^{k1}  [1/( p^{2})] 
Hypergeometric  
Negative Binomial 
Problem 13 Compute variances Var(X) for the entries in Table 2.4. (Some of these are a real challenge to your computational skills, so you may safely give up. Another method will make it easier in Chapter ).
Since Var(X) = EX^{2}(EX)^{2} ³ 0, the following inequality follows
 (31) 
Definition 11 The covariance of random variables X,Y with expected values m_{X}, m_{Y} is defined as cov(X,Y) = E(Xm_{X})(Ym_{Y}).
Clearly Cov(X,X) = Var(X) and Var(X+Y) = Var(X)+Var(Y)+2cov(X,Y).
Theorem 3 If X,Y are independent, then Var(X+Y) = Var(X)+Var(Y).
Example 28 let X_{1},X_{2},...,X_{n} be independent {0,1}valued random variables, and suppose that Pr(X_{j} = 1) = p. Then Var(å_{j = 1}^{n}X_{j}) = np(1p). What is the distribution of å_{j = 1}^{n}X_{j}?
Example 29 If X > 0 then Ee^{X} = 1+ò_{0}^{¥} e^{t} P(X > t) dt
Problem 14 Show that EX^{2} = ò_{0}^{¥} t P(X > t) dt.
Generalize this formula to EX^{p}.
Special cases of Chebyshev's inequality (23) are:
 (32) 
 (33) 
 (34) 
Chebyshev's inequality is one reason we often strive for small average quadratic error. If EXX_{0}^{2} < e then we can be sure that Pr(XX_{0} > ^{3}Ö{e}) < ^{3}Ö{e}. The following may be used (with some caution) in computer programs to asses error in estimating probabilities by sampling.
Example 30 If X is Bin(n,p) then Pr(^{X}/_{n}p > [1/( ^{3}Ön)]) £ [1/( 4^{3}Ön)]
[ 17 Run a simulation of the event that you know probability of, printing out the error estimate.
Do the same for the event that you don't know the probability analytically. (Use programs written for previous exercises)
From now on, in the output of your simulation programs you should provide some error estimates.
As an example of simple scheduling problem consider the following. Example 31 Suppose that we want to bake a batch of chocolatechip cookies. The tasks and their (estimated) times are:
The dependency graph is quite obvious here; for instance, we cannot start baking before batter is ready. What is the shortest time we can eat the cookies?
[ 18 Suppose for the sake of this exercise that the numbers presented in the cookiebaking example are just the average values of the exponential random variables.
Suppose X,Y are discrete and f(x,y) = Pr(X = x, Y = y) is their joint probability mass function. Then Z = X+Y is discrete with values z = x+y. Therefore
 (35) 
For independent random variables X, Y this takes a slightly simpler form.
 (36) 
Formula (36) can be used to prove the so called summation formulas. Theorem 4 If X, Y are independent Binomial with the same parameter p, ie. X is Bin(n,p) and Y is Bin(m, p), then X+Y is Binomial Bin(n+m,p).
If X, Y are independent Poisson Poiss(l_{X}) and Poiss(l_{Y}), then X+Y is Poisson Poiss(l_{X}+l_{Y}).
One can prove that if X,Y are independent and continuous than X+Y is continuous with the density
 (37) 
Formula (37) defines the convolution f_{X}*f_{Y}. It can be used to prove the so called summation formulas. Theorem 5 If X, Y are independent Normal, then X+Y is Normal.
If X, Y are independent Gamma with the same parameter b, then X+Y is Gamma(a_{X}+a_{Y},b).
Example 33 [Example 1.2.2 continued] Two drivers arrive at an intersection between 8:00 and 8:01 every day. What is the density of the time that lapsed between their arrivals?
Example 34 Suppose X,Y are independent U(0,1). The density of Z = X+Y is
f(z) = {
.
z
if 0 £ z £ 1
2z
if 1 £ z £ 2
Suppose X,Y are independent. If U = min{X,Y} then Pr(U > t) = Pr(X > t)Pr(Y > t). Therefore the reliability function of U can be computed from the two given ones.
Example 35 If X, Y are independent exponential, then U = min{X,Y} is exponential.
If U = max{X,Y} then Pr(U < t) = Pr(X < t)Pr(Y < t). Therefore the cumulative distribution function of U can be computed from the two given ones.
Example 36 Suppose X,Y are independent uniform U(0,1). Then U = max{X,Y} has the density f(u) = 2u for 0 < u < 1.
Problem 15 Let U_{1},..., U_{n} be independent uniform U(0,1). Find the density of
Problem 16 If X,Y are independent exponential random variables with parameters l, m, show that Pr(X < Y) = [(l)/( l+m)].
Let R_{1},..., R_{n} be the corresponding order statistics. This means that at the end of each experiment we rearrange the numbers X_{1},... X_{n} into the increasing sequence R_{1},..., R_{n}. This means that R_{1} = min_{j} X_{j} is the smallest, R_{2} = max_{i}min_{j ¹ i} X_{j} is the second largest, etc.
The density of R_{k} can be found by the following method. In order for the inequality R_{k} > x to hold, there must be at least k values among the X_{j} above level x. Since X_{j} are independent and have the same probability p = Pr(X_{j} > x) of ``success" in crossing over the xlevel, this means that Pr(R_{k} > x) is given by the binomial formula with n trials and probability of success p = 1G(x).
 (38) 
When the derivative is taken, the sum collapses into just one term, giving the elegant answer r_{k}(x) = [n!/( (k1)!(nk)!)](G(x))^{k}(1G(x))^{nk}g(x).
The L_{2} norm is

Notice that XEX_{2} is just another notation for the standard deviation. Thus standard deviation is the L_{2} distance of X from a constant.
We say that X_{n} converges to X in L_{2}, if X_{n}X_{2}® 0 as n® ¥. We shall also use the phrase sequence X_{n} converges to X in meansquare. An example of the latter is Theorem .
Several useful inequalities are collected in the following^{14}. Theorem 6 For all squareintegrable X,Y
 (39) 
 (40) 
 (41) 
Proof. Proof of (39): Quadratic function f(t) = EX+tY^{2} = EX^{2}+2tEXY+t^{2}EY^{2} is nonnegative for all t. Therefore its determinant D ³ 0. (Compute D to finish the proof.)
Proof of (40): Use (39) with Y = 1.
Proof of (41): By (39)
EX+Y^{2} £ X_{2}^{2}+Y_{2}^{2}+2X_{2}Y_{2}.
^{[¯]}

Random variables with r = 0 are called uncorrelated. Correlation coefficient close to one of the extremes ±1 means that there is a strong linear relation between X,Y; this is stated more precisely in ().
Theorem 2.7 states that independent random variables with finite variances are uncorrelated.
Problem 17 Give an example of dependent random variables that are uncorrelated.
Suppose we would like to approximate random variable Y by another quantity X that is perhaps better accessible. Of all the possible ways to do it, linear function Y » mX+b is perhaps the simplest. Of all such linear functions, we now want to pick the best. In a single experiment, the error is YmXb. We could minimize the average empirical error over many experiments ^{1}/_{n}å_{j}Y_{j}mX_{j}b. This approximates the average error EYmXb. Let us agree to measure the error of the approximation by a quadratic error E(YmXb)^{2} instead. (This choice leads to simpler mathematics.)
Question: For what values of m, b the error E(YmXb)^{2} is the smallest? When is it 0?
Let H(m,b) = E(YmXb)^{2}. Clearly, H is a quadratic function of m,b Î IR. The unique minimum is determined by the system of equations

The answer is m = [cov(X,Y)/ Var X], b = EYmEX, and the minimal error is
 (43) 
A chain in the x,yplane consists of n links each of unit length. The angle between two consecutive links is ±a, where a > 0 is a constant. Assume the sign is taken and random, with probability ^{1}/_{2} for each. Let L_{n} be the distance from the beginning to the end of the chain. The angle between the kth link and the positive xaxis is a random variable S_{k1}, where we may assume (why?) S_{0} = 0 and S_{k} = S_{k1}+x_{k}a, where x = ±1 with probability ^{1}/_{2}. The following steps determine the average length of the random chain.
EL_{n}^{2} = n [(1+cosa)/( 1cosa)] 2cosa[(1cos^{n}a)/( (1cosa)^{2})]

Definition 12 The conditional expectation E{XY = y} is defined as åxPr(X = xY = y) in discrete case, and as ò_{IR} x f(xY = y) dx in the continuous case. One can show that the expected values exist, when EX < ¥.
Example 37 A game consists of tossing a die. If the face value on the die is X then a coin is tossed X times. Let Y be the number of heads. Then E(YX = x) = ^{1}/_{2}x.
Example 38 Suppose Y is a discrete with different values on the events A_{1}, A_{2}, ¼, A_{n} which form a nondegenerate disjoint partition of the probability space W. Then


Example 39 Suppose that f(x, y) is the joint density with respect to the Lebesgue measure on IR^{2} of the bivariate random variable (X, Y) and let f_{Y}(y) ¹ 0 be the (marginal) density of Y. Put f(xy) = f(x, y)/f_{Y}(y). Then E{XY} = h(Y), where h(y) = ò_{¥}^{¥} x f(xy) dx.
Total probability formula for conditional expectations is as follows.
 (44) 
 (45) 
 (46) 
Similar question can be solved for N having a Poisson distribution.
Example 42 What is the distribution of a geometric sum of i. i. d. exponential r. v.?
Example 43 Stock market fluctuations can be modelled by Z = x_{1}+...+x_{N}, where N, the number of transactions, is Poisson(l) and x are normal N(0,s). There is no explicit formula for the density of Z, but there is one for the moment generating function. Thus Chebyshev inequality gives bounds of the form Pr(Z > t) £ exp....
Theorem gives geometric interpretation of the conditional expectation E{·Z}; for square integrable functions E{.Z} is just the orthogonal projection of the Banach (normed) space L_{2} onto its closed subspace L_{2}(Z), consisting of all 2integrable random variables of the form f(Z).
Theorem 7 For square integrable Y the quadratic error E(Yh(X))^{2} among all square integrable functions h(X) is the smallest if h(x) = E(YX = x).
Theorem 2.14 implies that the linear approximation from Section is usually less accurate. In Chapter we shall see that linear approximation is the best one can get in the all important normal case. Even in nonnormal case linear approximations offer quick solutions based on simple second order statistics. In contrast, the nonlinear approximations require elaborate numerical schemes to process the empirical data.
Suppose T represents a failure time of some device. If the device is working at time t, then the probability of surviving additional s seconds is Pr(T > t+sT > t). For a device that doesn't exhibit aging this probability should be the same as for the brand new device.
 (47) 
Problem 19 Show that geometric T satisfies (47) for integer t,s.
Equation (47) implies Pr(T > t+s) = Pr(T > t)Pr(T > s), an equation that can be solved. Theorem 8 If T > 0 satisfies (47) for all t,s > 0, then T is exponential.
Proof. The tail distribution function N(x) = P(T > x) satisfies equation
 (48) 
Formula (48) implies that for all integer n and all x ³ 0
 (49) 
This shows that for rational q > 0
 (50) 
Let T > 0 be a continuous r. v. interpreted as a failure time of a certain device. If the device is in operational condition at time t, then the probability that it will fail immediately afterwards may be assumed negligible. The probability of failing within h units of time is Pr(T < t+hT > t). The failure rate at time t is defined as
 (51) 
Example 44 If T is exponential then the failure rate is constant.
A family of failure rates that exhibit interesting aging patterns is provided by the family of power functions l(t) = t^{a}.
Theorem 9 If T is continuous with failure rate l(t) = t^{a}, where a > 0 then T has the Weibull density:
 (52) 
Proof.
Rewrite the expression
(_{k}^{n})[(l^{k})/( n^{k})](1[(l)/ n])^{nk} = l^{k}/k!(1l/n)^{n}/(1l/n)^{k}Õ_{j = 0}^{k}(1j/n). ^{[¯]}
Example 45 How many raisins should a cookie have on average so that no more
than one cookie in a hundred has no raisins? So that no more
than one cookie in a thousand has no raisins?
Example 46 A bag of chocolate chip cookies has 50 cookies. The manufacture claims there are a 1,000 chips in a bag. Is it likely to find a cookie with 15 or less chips in such a bag?
From numbers 1, ... n, select at random k > n numbers. On average, how many of these numbers repeat? (and hence should be thrown out)
Problem 21 The performance of the algorithm for selecting a random permutation in GetPermutation SUB of RANDTOUR.BAS can be estimated by analyzing the following ``worstcase" scenario.
When the algorithm attempts to select last of the random numbers 1, ... n, then
[ 19 What is the probability that in the group of 45 people one can find two born on the same day of the year? (Compare Example 1.5.1).
Problem 22 For a group of n person, find the expected number of days of the year which are birthdays of exactly k people. (Assume 365 days in a year and that all birthdays are equally likely.)
Problem 23 A large number N of people are subject to a blood test. The test can be administered in one of the two ways:
(i) Each person can be tested separately (N tests are needed).
(ii) The blood sample of k people can be pooled (mixed) and analyzed. If the test is positive, each of the k people must be tested separately and in all k+1 tests are then required for k people.
Assume the probability p that the test is positive is the same for all people and that the test results for different people are stochastically independent.
[ 20 Solve Exercise 1.2.2 on page pageref assuming that the arrival times are exponential rather than uniform. Assume independence.
[ 21 There are 3 stoplights spaced within 1km of each other and operating asynchronously. (They are reset at midnight.) Assuming each is red for 1 minute and then green for one minute, what is the average time to pass through the three lights by a car that can instantaneously accelerate to 60km/h.
This exercise can be developed into a simulation project that may address some of the following questions

Problem 25 Let X ³ 0 be a random variable and suppose that for every 0 < q < 1 there is T = T(q) such that

Problem 26 If x, U are independent, Pr(x = 0) = Pr(x+1) = ^{1}/_{2} and U is uniform U(0,1). What is the distribution of U+x?
Properties of a sequence {a_{n}} are often reflected in properties of the generating function h(z) = å_{n}a_{n}z^{n}.
A moment generating function is nonnegative, and convex (concave up). The typical example is the moment generating function of the {0,1}valued random variable.
A linear transformations of X changes the moment generating function by the following formula.
 (53) 
Important properties of moment generating functions are proved in more theoretical probability courses^{15}.
Theorem 11 (i) The distribution of X is determined uniquely by its moment generating function M(t).
(ii) If X, Y are independent random variables, then M_{X+Y}(t) = M_{X}(t) M_{Y}(t) for all t Î IR.
(iii) M(0) = 1, M¢(0) = EX, M¢¢(0) = EX^{2}
Name  Distribution  Moment generating function 
Normal N(0,1)  f(x) = [1/(Ö{2p})]exp[(x^{2})/ 2]  M(t) = e^{t2/2} 
Exponential  f(x) = le^{lx}  M(t) = [(l)/( lt)] 
Uniform U(1,1)  f(x) = ^{1}/_{2} for 1 £ x £ 1  M(t) = ^{1}/_{t}sinht 
Gamma  f(x) = 1/G(a)b^{a} x^{a1}expx/b  M(t) = (1bt)^{a} 
Binomial  Pr(X = k) = (^{n}_{k})p^{k}(1p)^{nk}  M(t) = (1p+pe^{t})^{n} 
Poisson  Pr(X = k) = e^{l}[(l^{k})/ k!]  M(t) = expl(e^{t}1) 
Geometric  Pr(X = k) = p(1p)^{k1}  M(t) = [(pe^{t})/( 1(1p)e^{t})] 
Problem 27 Find moment generating functions for each of the entries in Table 3.1.
Problem 28 Use moment generating functions to compute EX, Var(X) for each of the entries in Table 2.4.
Problem 29 Prove the summation formulas stated in Theorems 36 and 37
For a ddimensional random variable X = (X_{1}, ¼, X_{d}) the moment generating function M_{X}: IR^{d}® \sf CC is defined by M_{X}(t) = Eexp(t·X), where the dot denotes the dot (scalar) product, ie. x·y = åx_{k}y_{k}. For a pair of real valued random variables X, Y, we also write M(t, s) = M_{(X, Y)}((t, s)) and we call M(t, s) the joint moment generating function of X and Y.
The following is the multidimensional version of Theorem 3.2.
Theorem 12 (i) The distribution of X is determined uniquely by its moment generating function M(t).
(ii) If X, Y are independent IR^{d}valued random variables, then

The major nuisance in using the moement generating functions is the fact that the moment generating functions may not exist, when the definiting integral diverges.
For this reason it is more preferable to use an expression that is always bounded, and yet has the same convenient algebraic properties. The natural candidate is

 (54) 
For symmetric random variables complex numbers can be avoided at the expense of trigonometric identities. Example 47 If X is 1,1 valued, Pr(X = 1) = ^{1}/_{2}, then f(t) = cost.
Example 48 The characteristic function of the normal N(0,1) distribution is f(t) = e^{t2/2}.

(For a = 6 Pr(S_{n} = k) is the probability of scoring the sum k+n in a throw with n dice. The solution of this problem is due to de Moivre.)
Problem 31 Suppose the probability p_{n} that a family has exactly n children is ap^{n} when n ³ 1 and suppose p_{0} = 1a[p/( 1p)]. (Notice that this is a constaint on the admissible values of a, p since p_{0} ³ 0.
Suppose that all distributions of the sexes for n children are
equally likely.
Find the probability that a family has exactly k girls.
Hint: The answer is at first as the infinite series. To find its sum, use
generating function, or negative binomial expansion:
(1+x)^{k} = 1+[(k+1)/ 1!]x+[((k+1)k+2)/ 2!]x^{2}+....
ANS:
[(2ap^{k})/( (2p)^{k+1})].
Problem 32 Show that if X ³ 0 is a random variable such that

Problem 33 Show that if Eexp(lX^{2}) = C < ¥ for some a > 0, then

Problem 34 Prove that function f(t): = Emax{X, t} determines uniquely the distribution of an integrable random variable X in each of the following cases:
Problem 35 Let p > 2. Show that exp(t^{p}) is not a moment generating function.
Next to a stream in a forest you see a small tree with tiny, bellshaped, white flowers in dropping clusters.
The Auborn Society Field Guide to North American Trees.
The predominance of normal distribution is often explained by the . This theorem asserts that under fairly general conditions the distribution of the sum of many independent components is approximately normal. In this chapter we give another reason why normal distribution might occur. The usual definition of the standard normal variable Z specifies its density f(x) = [1/( Ö{2p})]e^{[(x2)/ 2]}, see Figure 2.1 on page pageref. In general, the so called N(m, s) density is given by


In multivariate case it is more convenient to use moment generating functions directly. For consistency we shall therefore adopt the following definition.
Definition 13 A real valued random variable X has the normal N(m, s) distribution if its moment generating function has the form

From Theorem 3.2 one can check by taking the derivatives that m = EX and s^{2} = Var(X). Using (53) it is easy to see that every univariate normal X can be written as
 (55) 
If X has given mean m = 123 and given variance s^{2} = 456, for what values of a we have Pr(X > a) = .8?with the help of tabularized values of the cumulative distribution function of standard normal Z.
[ 22 [Basketball coach's problem] Collect data on the heights of 25 to 50 randomly selected males. Make a histogram of the data, compute the empirical mean and standard deviation.
``Suppose a ball is dropped from a given height, with the intention that it shall fall on a given mark. Fall as it may, its deviation from the mark is error, and the probability of that error is the unknown function of its square, ie. of the sum of the squares of its deviations in any two rectangular directions. Now, the probability of any deviation depending solely on its magnitude, and not on its direction, it follows that the probability of each of these rectangular deviations must be the same function of its square. And since the observed oblique deviation is equivalent to the two rectangular ones, supposed concurrent, and which are essentially independent of one another, and is, therefore, a compound event of which they are the simple independent constituents, therefore its probability will be the product of their separate probabilities. Thus the form of our unknown function comes to be determined from this condition...''
Ten years after Herschel, the reasoning was repeated by J. C. Maxwell^{18}. The fact that velocities are normally distributed is sometimes called Maxwell's theorem.
The beauty of the reasoning lies in the fact that the interplay of two very natural assumptions: of independence and of rotation invariance, gives rise to the normal law of errors  the most important distribution in statistics.
Theorem 13 Suppose random variables X, Y have joint probability distribution m(dx, dy) such that
(i) m(·) is invariant under the rotations of IR^{2};
(ii) X, Y are independent.
Then X, Y are normal.
The following technical lemma asserts that moment generating function exists. [ 1 If X,Y are independent and X+Y, XY are independent, then EexpaX < ¥ for all a.
Proof. Consider a real function N(x): = P(X ³ x). We shall show that there is x_{0} such that
 (56) 
Let X_{1}, X_{2} be the independent copies of X. Inequality (56) follows from the fact that event {X_{1} ³ 2x} implies that either the event {X_{1} ³ 2x}Ç{X_{2} ³ 2x_{0}}, or the event {X_{1}+X_{2} ³ 2(x  x_{0})}Ç{X_{1}  X_{2}  ³ 2(x  x_{0})} occurs.
Indeed, let x_{0} be such that P(X_{2} ³ 2x_{0}) £ ^{1}/_{2}. If X_{1} ³ 2x and X_{2} < 2x_{0} then X_{1}±X_{2} ³ X_{1}  X_{2} ³ 2(x x_{0}). Therefore using independence and the trivial bound P(X_{1}+X_{2} ³ 2a) £ P(X_{1} ³ a)+P(X_{2} ³ a), we obtain



 (57) 
 (58) 

Therefore (take u = v) the second derivative Q¢¢(u) = Q¢¢(0) = const ³ 0.
This means M(u) = exp( au+bu^{2}).
^{[¯]}
If EX = EY = 0, the moment generating function M(t,s) = Ee^{tX+sY} is given by M(t,s) = e^{1/2(s12 t2+2tsrs1s2+s2t2)}
Clearly s_{1}^{2} = Var(X), s_{2}^{2} = Var(Y), r = Cov(X,Y).
When r ¹ ±1 the joint density of X,Y exists and is given by
 (59) 
Example 49 If X,Y are jointly normal with correlation coefficient r ¹ ±1 then the conditional distribution of Y given X is normal.
Problem 37 If X,Y are independent normal N(0,1), find the density of X^{2}+Y^{2}. (Hint: compute cumulative distribution function, integrating in polar coordinates.)
Problem 38 For jointly normal X,Y show that E(YX) = aX+b is linear.
Problem 39 If X,Y are jointly normal then YrXs_{Y}/s_{X} and X are independent.
Problem 40 If X, Y are jointly normal with variances s_{X}^{2},s_{Y}^{2} and the correlation coefficient r, then X = s_{X}(g_{1} cosq+g_{2}sinq, Y = s_{Y}(g_{1} sinq+g_{2}cosq, where g_{j} are independent N(0,1) and sin2q = r.
This is a short chapter on asymptotic behavior of sums and averages of independent observations. Theorem justifies simulations as the means for computing probabilities and expected values. Theorem provides error estimates.
Definition 15 X_{n}® X in probability, if Pr(X_{n}X > e)® 0 for all e > 0
Example 50 Let X_{n} be N(0,s = ^{1}/_{n}). Then X_{n}® o in probability.
Definition 16 X_{n}® X almost surely, if Pr(X_{n}® X) = 1.
Example 51 Let U be the uniform U(0,1) r. v. Then ^{1}/_{n} U® 0 almost surely.
Example 52 Let U be the uniform U(0,1) r. v. et
X_{n} = {
. Then X_{n}® 0 almost surely.
0
if U > ^{1}/_{n}
1
otherwise
Definition 17 X_{n}® X in L_{2} (mean square), if EX_{n}X^{2} ® 0 as n® ¥.
Remark 2 If X_{n}® X in L_{2}, then by Chebyshev's inequality X_{n}® X in probability.
If X_{n}® X almost surely, then X_{n}® X in probability.
Theorem 14 Suppose X_{k} are such that EX_{k} = m, Var(X_{k}) = s^{2}, and cov(X_{i},X_{j}) = 0 for all i ¹ j. Then ^{1}/_{n}å_{j = 1}^{n} X_{j}® m in L_{2}
Proof. To show that ^{1}/_{n}å_{j = 1}^{n} X_{j}® m in mean square,
compute the variance.
^{[¯]}
[ 1
If X_{n} is Binomial Bin(n,p) then ^{1}/_{n} X_{n} ® p in probability.
The following method can be used to prove strong law of large numbers for Binomial r. v.
å_{n}Pr(X_{n}/np > e) < ¥.
In addition to the types of convergence introduced in Section , we also have the convergence in distribution.
Definition 18 X_{n} converges to X in distribution, if Ef(X_{n})® Ef(X) for all bounded continuous functions f.
Theorem 15 If X_{n} ® X in distribution, then Pr(X_{n} Î (a,b))® Pr(X Î (a,b)) for all a < b such that Pr(X = a) = Pr(X = b) = 0.
Theorem 16 If M_{Xn}(u)® M_{X}(u) for all u, then X_{n}® X in distribution.
Theorem 2.17 states convergence in distribution to Poisson limit. Here is a proof that uses moment generating functions.
Proof of Theorem 2.17.
M_{Xn}(u) = (1+p_{n}(e^{u}1))^{n}® e^{l(eu1)}.
^{[¯]}
We state first normal approximation to binomial. Theorem 17 If X_{n} is Binomial Bin(n,p) and 0 < p < 1 is constant, then [(X_{n}np)/( [Önpq])]® Z in distribution to N(0,1) random variable Z.
Proof. For p = 1/2 only. The moment generating function of [(X_{n}np)/( [Önpq])] = [(2X_{n}n)/( Ön)] is
M_{n}(u) = e^{Önu}(^{1}/_{2}+^{1}/_{2}e^{2u/Ön})^{n} = ([(e^{u/Ön}+e^{u/Ön})/ 2])^{n} = cosh^{n} (u/Ön)® e^{u2/2}.
(Can you justify the limit?)
^{[¯]}
Limit theorems are illustrated by the program LIMTHS.BAS. This program graphs empirical proportions [^p]_{t} as the (random) function of t £ n, makes a histogram, and compares it with the Normal and Poisson histograms. By trying various lengths of path n, one can see the almost sure Law of Large Numbers, and for moderate n see the normal approximation.
Problem 41 The graph of [^p]_{n} as the function of n as given by LIMTHS.BAS suggests a pair of ``curves" between which the averages are ``squeezed". What are the equations of these curves?
[ 23 Suppose a poll of size n is to be taken, and the actual proportion of the voters supporting an issue in question is p = ^{1}/_{2}. Determine the size n such that the observed proportion [^p] = ^{1}/_{n}X satisfies Pr([^p] > .8) £ .01.
[ 24 Plot the histogram for a 100 independent Binomial Bin(n = 100,p = .5) random variables.
Theorem 18 If X_{j} are i.i.d. with EX = m, Var(X) = s^{2} then [(å_{j = 1}^{n} X_{j} nm)/( sÖn)]® Z
Proof. For m = 0, s = 1 only.
The moment generating function is
M_{n}(u) = (M_{1}(u/Ön))^{n} = (1+[u/( Ön)]M¢(0)+[(u^{2})/ 2n]M¢¢(0)+O([1/( n^{2})]))^{n}® e^{u2/2} ^{[¯]}
A lesser known aspect of the central limit theorem is that one can actually
simulate paths of certain continuous time processes by taking
X_{n}(t) = [1/( Ön)]å_{k £ nt} x_{k}, where x_{k} are independent
meanzero r. v.
The following program uses this to simulate random curves. The program requires
a graphics card on the PC.
PROGRAM firewalk.bas
'
'This program simulates random walk paths (with uniform incerments) 'reflected at the boundaries of a region ' 'declarations of subs DECLARE SUB CenterPrint (Text$) ' minimal error handling  graphics card is required ON ERROR GOTO ErrTrap CLS 'request good graphics (some cards support SCREEN 7, etc) SCREEN 9 LOCATE 1, 1 'title CenterPrint "Path of reflected random walk" LOCATE 9, 1 'timer location PRINT "Timer" scale = 10 ' WINDOW (0, 0)(scale, scale): VIEW (150, 100)(600, 300) LINE (0, 0)(scale, scale), 10, B': LINE (scale, scale)(2 * scale, 0), 11, B FOR j = 4 TO 4 LINE (0, scale / 2 + j)(scale / 50, scale / 2 + j), 2 NEXT j X = scale / 2 Y = scale / 2 dead = 0 T = 0 col = 14 * RND(1) speed = scale / 100 WHILE INKEY$ = "" 'infinite loop until a key is pressed T = T + 1 X0 = X Y0 = Y X = X + (RND(1)  1 / 2) * speed Y = Y + (RND(1)  1 / 2) * speed IF X < 0 THEN X = X: col = 14 * RND(1) IF Y < 0 THEN Y = Y: col = 14 * RND(1) IF X > scale THEN X = 2 * scale  X: col = 14 * RND(1) IF Y > scale THEN Y = 2 * scale  Y: col = 14 * RND(1) IF X > 2 * scale THEN X = 4 * scale  X: col = 14 * RND(1) LINE (X0, Y0)(X, Y), col LOCATE 10, 1 PRINT T WEND END ErrTrap: 'if there are errors then quit CLS PRINT "This error requires graphics card VGA" PRINT "Error running program" PRINT "Press any key ...'" WHILE INKEY$ = "" WEND END SUB CenterPrint (Text$) ' Print text centered in 80 column screen offset = 41  LEN(Text$) \ 2 IF offset < 1 THEN offset = 1 LOCATE , offset PRINT Text$ 'END SUB
One role of limit theorems is to justify the simulations and provide error estimates. The simulation presented on page pageref uses the Central Limit Theorem to print out the so called 95% confidence interval.
Central limit theorem is also a basis for a fast simulation of the normal distribution, see Section .
We begin with the bound for binomial distribution.
Recall that the relative entropy is H(qp) = qlnq/p  (1q) ln[(1q)/( 1p)].
Theorem 19 Suppose X_{n} is Binomial Bin(n,p). Then for p¢ ³ p
 (60) 
Proof. Use Chebyshev's inequality (34) and the moment generating function. Pr(X_{n} ³ n p¢) = Pr(uX_{n} ³ un p¢) £ e^{np¢u} E^{uXn} =
( e^{p¢u} (1p+pe^{u}))^{n} = ( (1p) e^{p¢u}+pe^{(1p¢)u})^{n}.
Now the calculus question: for what u ³ 0 the function
f(u) = (1p)e^{p¢u}+pe^{(1p¢)u} attains a minimum? The answer is u = 0 if p¢ £ p, or
u = ln([(1p)/ p][(p¢)/( 1p¢)]). Therefore the minimal value is
f(u_{min}) = [p/( p¢)]e^{(1p¢)u} = exp(p¢lnp¢/p(1p¢)ln(1p¢)/(1p))
^{[¯]}
[ 2 If p = 1/2 then
 (61) 
Proof. This follows from the inequality
(1+x)ln(1+x) +(1x)ln(1x) ³ x^{2}.
^{[¯]}
Suppose X_{j} are i.i.d. Conditional limit theorems say what is the conditional distribution of X_{1}, given the value of the empirical average ^{1}/_{n}å_{j = 1}^{n} h(X_{j}). Such probabilities are difficult to simulate when ^{1}/_{n}å_{j = 1}^{n} h(X_{j}) differs significantly from Eh(X).
This chapter collects information about how to simulate special distributions.
Considerations of machine efficiency, in particular convenience of doing fixed precision arithmetics, make the uniform U(0,1) the fundamental building block of simulations. We do not consider in much detail how such sequences are produced  efficient methods are hardwaredependent. We also assume these are i. i. d., even though the typical random number generator returns the same pseudorandom sequence for each value of the seed, and often the sequence is periodic and correlated. This is again a question of speed and hardware only.
The classical Peano curve actually maps the unit interval onto the unit square preserving the Lebesgue measure. Thus two independent U(0,1) are as ``random" as one!
Experiments with discrete outcomes aren't necessarily less random than continuous models. Expansion into binary fractions connects infinite tosses of a coin with a single uniform r. v.
Example 53 Let U be uniform U(0,2p). Random variables X_{k} = sign (sin(2^{k} U)) are i. i. d. symmetric independent.
Theorem 20 If x_{j} are independent identically distributed discrete random variables with values {0,1} and Pr(x = 1) = 1/2 then å_{k = 1}^{¥} [1/( 2^{k})]x_{k} in uniform U(0,1).
Proof. We show by induction that if U is independent of {x_{j}} uniform U(0,1) r. v., then [1/( 2^{n})]U+å_{k = 1}^{n} [1/( 2^{k})]x_{k} is uniform for all n ³ 0. For induction step, notice that in distribution
[1/( 2^{n})]U+å_{k = 1}^{n}[1/( 2^{k})]x_{k} @ ^{1}/_{2}x_{1}+^{1}/_{2}U. This reduces the proof to n = 1 case.
The rest of the proof is essentially the solution of Problem 2.18. Clearly Pr(^{1}/_{2}x_{1}+^{1}/_{2}U < x) = Pr(x_{1} = 0)Pr(U/2 < x)+Pr(x_{1} = 1)Pr(^{1}/_{2}+U/2 < x) Expression Pr(U/2 < x) is 0, 2x, or 1. Expression Pr(^{1}/_{2}+U/2 < x) is 0, 2x1, or 1. Their average (check carefully ranges of x!) is
{

 

 


^{[¯]}
Coefficient 2 does not play any special role. The same fact holds true in any number system 
if N > 1 is fixed, and U = åX_{j} N^{j} is expanded in pase N, then
X_{j} are independent uniform {0,1,..., N1}valued discrete random variables.
From the mathematical point of view all of the simulations can be based on a single U(0,1) random variable. In particular, to generate independently uniform integer numbers^{19} in the prescribed range 0 ... N we need to pick any u Î (0,1), and define X_{j} = N^{j}u mod N.
Clearly X_{j} is the integer part of NU_{j}, where U_{j} solve the recurrence
 (62) 
Many programs provide access to uniform numbers. These however might be platformdependent, and are often of ``low quality". Often a person performing simulation may want to use the code they have explicit access to. What is ``random enough" for one application may not be random enough for another.
There is good evidence, both theoretical (see (62)) and empirical, that the simple multiplicative congruential algorithm
 (63) 
The linear congruential method has the advantage of being very fast. It has the disadvantage that it is not free of sequential correlations between successive outputs. This shows up clearly in the fact that the consecutive kpoints lie in at most N^{1/k} subspaces of dimension k1.
Many systemsupplied random number generators are linear congruential generators, which generate a sequence of integers n_{1}, n_{2}, ... each between 0 and N1 by the recurrence relation
 (64) 
This is rather easy to convert into the computer algorithm.
An exact method to simulate Poisson distribution is based on the fact that it occurs the Poisson process, and that sojourn times in the Poisson process are exponential (see Theorem on page ). Therefore, to simulateX with Poiss(l) distribution, simulate independent exponential r. v. T_{1}, T_{2},..., with parameter l = 1 and put as the value of X the first value of n such that T_{1}+...+T_{n} > l.
A reasonable approximation to Poisson Poiss(l) random variable X is obtained by simulating binomial Bin(n,p) random variable X¢ with l = np. Since X¢ £ n, use n large enough to exceed any realistic values of X. Run program LIMTHMS.BAS to compare the histograms  why is Poisson distribution more spread out than the binomial?
Problem 42 Given a cumulative distribution function G(x) with inverse H(), and density g(x) = G¢(x), let U_{1},U_{2} be two independent uniform U(0,1) random variables. Show that X = H(U_{1}), Y = U_{2}g(X) are uniformly distributed in the region {(x,y): 0 < y < g(x)}.
If the conditional density of Y given X = x is f(yx), and the density of X is g(x), then the density of Y is òf(yx)g(x) dx.
Example 54 Suppose Y has density cye^{y} = Cå_{n = 0}^{¥}y(1y)^{n}/n!. Let Pr(X = n) = c/n!. Then the conditional distribution is f(yX = n) = Cy(1y)^{n}, which is the distribution of a median from the sample of size 2n.

The exact simulation of normal distribution uses the fact that an independent pair X_{1},X_{2} of normal N(0,1) random variables can be written as

We simulate two independent normal N(0,1) r. v. from two independent uniform r. v. U_{1}, U_{2} by taking Q = 2pU_{1}, R = Ö{2lnU_{2}} and using formulas (65)(66).
The name comes from the technique suggested by Problem 6.3.1. Instead of selecting points under the graph of f(x), we pick another function g(x) > f(x), which has known antiderivative with explicitly available inverse. We pick points under the graph of g(x), and "reject" those that didn't make it below the graph of f(x).
Often used density g(x) = c/(1+x^{2}) leads to X = H(U_{1}), where H(u) = tan(pu/c).
Rejection sampling can be used to simulate continuous or discrete distributions. The idea behind using it in discrete case is to convert discrete distribution to a function of continous random variable. For example, to use rejection sampling for Poisson distribution, simulate the density f(x) = e^{l}l^{[x]}/[x]! and take the integer part [X] of the resulting random variable.
SUB RandomSubset( S())
n = UBOUND(S) ' read out the size of array
FOR j = 1 TO n
'select entries at random
S(j) = INT(RND(1) + 1)
NEXT j
For subsets of low dimension, a sequence of 0,1 can be identified with binary numbers. Set operations are then easily converted to binary operations on integers/long integers.
Suppose we want to rearrange elements a(1),..., a(n) into a random order. A quick method to accomplish this goal is to pick one element at a time, and set it aside. This can be easily implemented within the same sequence.
SUB Rearrange( A())
n = UBOUND(A) ' read out the size of array
ns = n 'initial size of randomization
FOR j = 1 TO n
'take a card at random and put it away
k = INT(RND(1) * ns + 1)
SWAP A(k), A(ns)
ns = ns  1 'select from remaining a(j)
NEXT j
Problem 43 Let a_{1},... a_{n} be numbers such that å_{j} a_{j} = 0, å_{j} a_{j}^{2} = 1 Let X denote the sum of the first half of those numbers, after a random rearrangement.
Program listing on page pageref explains how to perform double integration by simulation. Here we concentrate on refining the procedure for single integrals by reducing the variance.
To evaluate J = ò_{a}^{b} f(x) dx we may use the approximation ^{1}/_{N}å_{j = 1}^{N} f(X_{j})/g(X_{j}), where X_{j} are i.i.d. with the density g(x) such that ò_{a}^{b}g(x) dx = 1.
The error, as measured by the variance^{21}, is

For smooth functions, a better approximation is obtained by selecting points more uniformly than the pure random choice. The so called Sobol sequences are based on sophisticated mathematics. Cookbook prescriptions can be found eg in Numerical recipes.
Note: The reliability of the Monte Carlo Method, and the associated error bounds depends on the quality of the random number generator. It is rather unwise to use an unknown random number generator in questions that require large number of randomizations. Example 55 The following problem is a challenge to any numerical integration method due to rapid oscillations, ò_{0}^{1}2sin^{2}(1000 x) dx = 1[1/ 2000]sin2000 » 1.0004367
The idea of stratified sampling is to select different number of points from different subregions. As a simple example, suppose we want to integrate a smooth function over the interval [0,1] using n points. Instead of following with the standard Monte Carlo prescription, we can divide [0,1] into k nonoverlapping segments and choose n_{j} points from the jth subinterval I_{j}. An extreme case is to take k = n and n_{j} = 1  this becomes a variant of the trapezoidal method. The optimal choice of n_{j} is to select them proportional to the local standard deviation of the usual Monte Carlo estimate of ò_{Ij} f(x) dx. Indeed, denoting by F_{j} the estimator of the integral over I_{j}, the variance of the answer is Var(å_{j = 1}^{k} F_{j}) = å_{j} Var(F_{j}) = å_{j} s_{j}^{2}/n_{j}. The minimum under the constraint å_{j}n_{j} = n is n_{j} » s_{j}.
A simple variant of recursive stratified sampling is to generate points and subdivisions based on estimated values of the variances.
Pr(X_{1}+...+X_{n} > an) = ò_{y1+...+yn > an}e^{ly1}... e^{lyn}f(y_{1})... f(y_{n}) dy_{1}... dy_{n}. This leads to the following procedure.
Example 56 Suppose X_{j} are {0,1}valued so that X_{1}+... X_{n} is binomial Bin(n,p). What is the distribution of Y_{j}? Simplify the expression e^{lY1}... e^{lYn}.
[ 26 Write the program computing by simulation the probability that in a n = 10 tosses of a fair coin, at least 8 heads occur. Once you have a program that does for n = 10 a comparable job to the ``naive" program below, try Exercise 1.5 on page pageref.
Here is a naive program, that does the job for n = 10, but not for n = 100. It should be used to test the more complex ``tilted density" simulator.
'Simulating N fair coins ' declarations DECLARE FUNCTION NumHeads% (p!, n%) DEFINT IN ' declare integer variables 'prepare screen CLS PRINT "Simulating toss of n fair coins" 'get users input n = 10 'number of trials INPUT "Number of coins n=", n pr = .5 ' fairness of a coin frac = .8 ' percentage of heads seeked ' get the frac from the user PRINT " How often we get more than f heads? (where 0<f<"; n; ")" INPUT "f=", frac IF frac >= 1 THEN frac = frac / n 'rescale if too large 'tell user what is going on PRINT "Hit any key to see the final answer" LOCATE 20, 10 PRINT "With some patience you may see digits stabilize" DO 'main loop T = T + 1 IF NumHeads(pr, n) > frac * n THEN s = s + 1 IF INKEY$ > "" THEN EXIT DO LOCATE 10, 10 PRINT "Trial #"; T PRINT "Current estimate"; USING "##.#####"; s / T LOOP 'print the answer PRINT PRINT "Prob of more then "; INT(frac * n); " heads in "; n; " trials is about "; s / T END DEFINT AH, OZ FUNCTION NumHeads (p!, n) 'simulate a run of n coins h = 0 FOR k = 1 TO n IF RND(1) < p! THEN h = h + 1 NEXT k NumHeads = h 'END FUNCTION
stochastic, a. conjectural; able to conjecture
Webster's New Universal Unabridged Dictionary
Stochastic processes model evolution of systems that either exhibit inherent randomness, or operate in an unpredictable environment. This unpredictability may have more than one form, see Section .
Probability provides models for analyzing random or unpredictable outcomes. The main new ingredient in stochastic processes is the explicit role of time. A stochastic process is described by its position X(t) at time t Î [0,1], t Î [0,¥), or t Î {0,1,...}.
From the conceptual point of view, stochastic processes that use discrete moments of time t Î {0,1,...} are the simplest. Since the discrete moments of time can represent arbitrarily small time increments, discrete time models are rich enough to model realworld phenomena. Continuous time versions are convenient mathematical idealizations.
Mathematical models of time evolution of deterministic systems often involve differential equations.
Difference equations are discrete analogues of the differential equations. Difference equations occur in applied problems, and also when solving differential equations by series expansions, or by Euler's method. The concepts of numerical versus analytical solution, initial values, boundary values, linearity, and superposition of solutions occur in the discrete setup in complete analogy with the theory of differential equations.
A difference equation determines an unknown sequence (y_{n}) through a
relation that specifies the pattern. The
relation often has the form of an equation that ties the
consecutive values of the sequence. The technical
term for such an equation is recurrence.
We will consider only special cases of classes of equation

Suppose a sequence y_{n} is to satisfy
 (67) 
It is easy to write a short program that computes the values of y_{n}. But to compute the actual sequence we need to specify the initial value y_{0}. Table gives sample outputs of such a program for several choices of y_{0}.
y_{0}  y_{1}  y_{2}  y_{3}  y_{4}  y_{5}  y_{6}  y_{7}  y_{8} 
1  .5403023  .8575532  .6542898  .7934803  .7013688  .7639596  .7221025  .7504177 
.5  .8775826  .6390125  .8026851  .694778  .7681958  .7191654  .7523558  .7300811 
0  1  .5403023  .8575532  .6542898  .7934803  .7013688  .7639596  .7221025 
.5  .8775826  .6390125  .8026851  .694778  .7681958  .7191654  .7523558  .7300811 
1  .5403023  .8575532  .6542898  .7934803  .7013688  .7639596  .7221025  .7504177 
1.5  .0707372  .9974992  .542405  .8564697  .6551088  .7929816  .7017242  .7637303 
2  .4161468  .9146533  .6100653  .8196106  .6825058  .7759946  .7137247  .7559287 
Similar numerical procedures show up in differential equations, where they are used to approximate continuous solutions by discretizing time. The procedure is called Euler method, or tangent line method.
 (68) 
On the other hand, we may notice that equation (68) defines the arithmetic progression. Instead of a table like Table 7.1, we can write down the solution for all possible values of y_{0} and for all n. Namely,
 (69) 
Example 58 Here is another well known difference equation with obvious solution. Suppose
 (70) 
 (71) 
Examples 7.1.1 and 7.1.1 are deceptively simple. Not every difference equation has a simple or easy to guess answer.
 (72) 

Many interesting difference equations, including (68), (70), (72) fall into the category of linear difference equations with constant coefficients. The first order linear difference equations with constant coefficients has the form


A method to solve difference equation of order 2 consists of the following steps:
Suppose you borrow y_{0} dollars on fixed monthly interest rate r. If you do not make any payments on your loan, then your balance will ``balloon" exponentially. Formula y_{n} = (1+r)^{n}y_{0} expresses monthly balance after n months when no loan payments are made. Some people prefer to pay a fixed amount of p dollars at the end of each month. If p is large enough, then the balance may even shrink down! This situations is easily described by a difference equation. To write it down, compute the next month balance, if the previous month balance is known:
 (73) 
In differential equations, we often solve a similar problem in continuous time, with continuous compounding and continuous payments schedule. This is a mathematical simplification of the actual banking situation, but the answers are reasonably close, and they are easier to get.
Here is an example of the mathematical problem that leads to a natural, but nontrivial^{26} difference equation.
Example 59 Suppose we want to find the formula for the sum of all consecutive integers from 1 to n. Let the answer by y_{n}. Then the recurrence formula
 (74) 
Now (74) can be written as y_{n+1}y_{n} = n+1 and actually this is why we use the name difference equations rather than recurrence relations. Notice that we do know the solution of its continuous analogue. Equation y¢ = t+1 for unknown function y = y(t) resembles (74). Its solution^{27} is y(t) = ^{1}/_{2}t^{2}+t. So to find the answer to (74) we may try substituting u_{n} = ^{1}/_{2}n^{2}+n into the equation. Unfortunately, simple arithmetics shows that we don't get the right answer, as
 (75) 

The method we used  subtracting the equations  works for the so called linear difference equations (and also linear differential) equations. It is closely related to the Principle of Superposition for linear differential equations.
Before jumping to models that use explicit randomness in their evolution, it is quite illuminating to analyze first some mathematically simple and well defined ``deterministic evolutionary processes". The following set of examples describes deterministic evolution in discrete time of a system described by a single numerical parameter x Î (0,1). All of the examples fall into the category of (nonlinear) difference equations: given initial value x_{0} Î (0,1) and a simple evolution equation of the form x_{n+1} = g(x_{n}), we are supposed to make inferences about the behavior of the solutions.
There is no randomness in the evolution itself. But since we are allowed to choose any initial condition, and no initial condition can be measured exactly, we may as well consider the initial value to be random. Example 60 Equation

Example 61 Let {x} denote the fractional part of x. Equation

Here is a rather surprising fact. Suppose x_{0} is selected at random. Since x_{k} is a function of x_{0}, and x_{0} is random, x_{k} becomes a random variable. It may happen that x_{k} < 1/2. Call this event A_{k}. The reasoning presented in Section 6.1.1 implies that events {A_{k}} are independent and have the same probability Pr(A_{k}) = 1/2. Therefore the deterministic evolution equation contains at least as much randomness as the tosses of a coin.
The last example may perhaps give the impression that it is the discontinuity of the evolution equation that is the source of difficulties. This is not at all the case.

Example 7.3 indicates that ``naive" prediction of the future of a system is unreliable even within the realm of deterministic evolution equations.
On the other hand, there are aspects of the evolution that we can analyze reliably. These deal with the average behavior of the evolution.
There are two classes of questions that we can answer, but both deal with statistical nature of the evolution:
The quick way to get insights into operation of real systems is to model their behavior. Here are examples of enterprises that operate under randomness. Mathematicians devised methods of modelling each of these. But it is interesting also to simulate their behavior. Notice that simulations require assumptions, but so do the analytical methods. Regardless of the method, we have to be careful about what are the questions we can answer.
Example 63 Suppose we want to study how much capital is needed to run a casino. We need to answer the following.
There is also a number of questions that we do not want to answer.
Example 64 Suppose we want to study how much capital is needed to insure cars. We need to answer the following.
There is also a number of questions that we do not want to answer. For instance we do not want to predict whether I'll file an insurance claim today, driving back home on I74 without enough sleep since I was preparing this class until 3AM.
Example 65 A construction company has n jobs to be performed in the future. For each of these jobs, experts provide estimate of the cost. Then, after questioning, they complement the estimates by the lower/upper bound for the costs. How much money should be allotted?
Example 66
A store averages l(t) customers per hour at time t of the day. Each customer brings some (random) profit. However, a customer may just leave the store without shopping, if the lines are too long.
A random walk is a process of the form X_{n} = å_{j = 1}^{n} x_{j}, where x_{j} are i. i. d. Random walks have independent increments , and describe the accumulation of independent contributions over time.
Random walks are examples of Markov processes which will be studied in detail in Chapter . Their special structure allows to analyze them independently of the general theory.
Definition 19 T:W® INÈ¥ is a stopping time, if the event {T = n} is independent of x_{n+1},x_{n+2},....
This definition is specialized to random walks. In a more general Markov case the definition is less transparent but captures the same idea.
The most important example of a stopping time is the first entrance time T = inf{k: X_{k} Î A}. An example that is not a stopping time is the last exit from a set.
When T < ¥ we define random sums X_{T} = å_{j £ T}x_{j}. The following theorem is an exercise when T is independent of x.
Theorem 21 If x_{j} are i. i. d., Ex = m, and ET < ¥ is a stopping time then
 (76) 
Proof.
EX_{T} = å_{n} EX_{n} Pr(T = n) = å_{n} å_{k = 1}^{n}Ex_{k} Pr(T ³ k).
Since x_{k} and {T ³ k} = {T < k}¢ are independent,
therefore ES_{T} = må_{n} Pr(T ³ n) = mET
by tail integration formula (21).
^{[¯]}
Theorem 22 If x_{j} are i.i.d., Ex = m, Var(x) = s^{2} < ¥, and
ET < ¥ then
 (77) 
(These formulas are of interest in branching processes, and in chromatography.)
Example 67 The number of checks cashed at a bank per day is Poisson random variable N with mean l = 200. The amount of each check is a random variable with a mean of $30 and a standard deviation of $5. If the bank has $6860 on hand, is the demand likely to be met?
Problem 44 Suppose x_{k}, T are independent. Find the variance of X_{T} in terms of the first two moments of x, T.
The sample is injected into a column, and the molecules are transported along the length by electric potential, flow of gas, or liquid. The basis for chromatographic separation of a sample of molecules is difference in their physical characteristics. The molecules switch between two phases: mobile, and stationary, and the separation of compounds is caused by the difference in times spend in each of the phases.
Suppose that the molecules of a compound spend random independent amounts of time U_{k} in mobile phase and random amount of time W_{k} in the stationary phase. Thus at time t the position of a molecule is given by a random sum vå_{j = 1}^{T(t)} U_{j}, where T(t) = inf{k: å_{j = 1}^{k}U_{j}+W_{j} > t}.
Section 7.5.1 gives formulas for the mean and the variance of the position. Since the number of transitions T is likely to be large for a typical molecule, it isn't surprising that the actual position has (asymptotically) normal distribution. (The actual Central Limit Theorem for random sums is not stated in these notes.)
[ 27 Simulate the output of the chromatography column of fixed length separating a pair of substances that have different distributions of mobile and stationary phases U_{k}, W_{k}. Select a hundred particles of each substance, and measure the degree of separation at the end of the column.
Suppose a gambler can afford to loose amount L > 0, while the casino has capital C < 0. Let x_{j} = ±1 be i. i. d. random variables modelling the outcomes of consecutive games, S_{n} be the partial sums (representing gains of the gambler), and let T = inf{k:S_{k} ³ L or S_{k} £ C} be the total number of games played. Then Pr(T > k) = Pr(C < S_{k} < L) and thus ET = å_{k}Pr(C < S_{k} < L).
The special case Pr(x = ±1) = 1/2 is easily solved, since in this case ES_{T} = ExET = 0. Let p = Pr(S_{T} = C) denote the probability of ruining the casino. Since S_{T} is either C, or L we have 0 = ES_{T} = pC+(1p)L, giving p = L/(LC). This formula means that a gambler has a fair chance of ruining a casino in a fair game, provided he brings with him enough cash L.
For more general random walks (and less fair games) probability of gambler's ruin can be found explicitly using the onestepanalysis (Section ). It is also interesting to find how long a game like that would last on average. (The expression for ET given above is not explicit.)
Let X_{t} denote the number of bacteria in tth generation, with X_{0} = 1. Assume that a bacteria can die with probability q > 0, or divide into two cells with probability p = 1q, and that all deaths occur independently. Our goal here is to find the average number of bacteria m(t) = E(X_{t}) in the tth generation. This can be recovered from Theorem 7.5.1. Instead, we show another method based on conditioning.
The number of bacteria in the next generation is determined by binomial probabilities: Pr(X_{t+1} = 2kX_{t} = n) = (^{n}_{k})p^{k} q^{nk}. Therefore E(X_{t+1}X_{t}) = 2p X_{t} and the average population size m(t) = E(X_{t}) satisfies difference equation m(t+1) = 2p m(t). We have m(t) = (2p)^{t}. In particular, the population on average grows exponentially when p > 1/2.
It is perhaps surprising that a population of bacteria with p = 3/4, which on average grows by 50% per generation, has still a ^{1}/_{3} chance of going extinct. One way to interpret this is to say that infections by a ``deadly" and rapidly developing desease may still have a large survival rate without any intervention of medicine, or immune system. (The methods to compute such probabilities will be introduced in Section . The answer above assumes infection by a single cell.)
Problem 45 Find the formula for the variance Var(X_{t}) of the number of bacteria in tth generation.
Problem 46 What is the probability that an infection by 10 identical bacteria with the doubling probability p = 3/4 dies out?
evolution, n. [L. evolutio (onis)], an unrolling or opening...
Webster's New Universal Unabridged Dictionary
Markov processes are perhaps the simplest model of a random evolution without longterm memory.
Markov process is a sequence X_{t} of random variables indexed by discrete time t Î ZZ_{\scriptscriptstyle +} , or continuous t ³ 0 that satisfies the so called Markov property. The set of all possible values of variables X_{t} is called the state space of the Markov chain. Typical examples of state spaces are IR, IN, the set of all nonnegative pairs of integers, and finite sets.
Markov chains are Markov processes with discrete time. Thus a Markov chain is an infinite sequence {X_{k}}_{k Î ZZ\scriptscriptstyle + } of (usually, dependent) random variables with shortterm (onestep) memory.
The formal definition of the Markov property is as follows. Definition 20 A family of discrete r. v. {X_{k}}_{k Î ZZ\scriptscriptstyle + } is a Markov chain, if

Examples of Markov chains are:
Examples of nonMarkov processes are easy to construct, but lack of Markov propertry is not obvious to verify. In general, if X_{k} is a Markov process, Y_{k} = f(X_{k}) may fail to be Markov.
Markov condition
 (78) 
Example 68 Suppose X_{n} is a Markov chain with periodic transition probabilities P_{n} = P_{n+T}. Then Y_{n} = (X_{n+1},X_{n+2},..., X_{n+T}) is a homogeneous Markov chain.
Problem 47 Suppose x_{j} are independent {0,1}valued with Pr(x = 1) = p. Let X_{n} = ax_{n}+bx_{n+1}, where ab ¹ 0.
Explain why X_{n} is a Markov chain.
Write the transition matrix for the Markov chain X_{n}.
[ 3 The probabilities Pr(X_{n} = jX_{0} = i) are given by the i,jentries of the matrix P^{n}
Proof.
This is the consequence of Markov property (78) and
the total probability formula (10).
^{[¯]}
Powers of moderately sized matrices are easy to compute on the computer.
Section indicates a mathematical method of
computing P^{n} for small dimensions using the CayleyHamilton theorem. Under certain conditions the
powers converge.
[ 28
Find lim_{n®¥}P^{n} for the matrix from Problem 8.1.1.
Suppose Pr(X_{0} = k) = p_{k}, where p_{k} solve stationarity equations

Then the resulting process is stationary: the distribution of each ktuple (X(t_{1}),...,X(t_{k})) is invariant under shifts in time,
(X(t_{1}+s),...,X(t_{k}+s)) @ (X(t_{1}),...,X(t_{k})).
This is interpreted as ``equilibrium", or ``steady state". Notice that ``steady state" is a statistical concept, and is not easily visible in a simulation of the single trajectory. In order to be able to see it, one has to simulate a large number of independently evolving Markov chains that begin with the same initial distribution and have the same transition matrix.If X_{t} is a stationary Markov process and f is a function on its state space, then Y(t) = f(X_{t}) is also stationary, although not necessarily Markov.
If X_{j}(t) are independent realizations of the same Markov process and f is a function on their state space then Y_{n}(t) = n^{1/2}å_{j = 1}^{n} (f(X_{j}(t))m) is stationary and approximately normal random sequence.
Finite state Markov chain is regular, if there is deterministic number N such that all states are connected by paths of length at most N. Problem 48 Show that regular Markov chain is aperiodic and irreducible.
Additive functionals of a Markov process are expressions of the form ^{1}/_{n}å_{j = 0}^{n1} f(X_{j}). Under certain conditions, the averages converge and the limit doesn't depend on the initial distribution. Under certain conditions, partial sums are approximately normal.
Problem 49 Let X_{k} be an irreducible {0,1}valued Markov chain with invariant initial distribution.
For regular (finite state space) Markov chains the limit actually exists independently of the initial state. Therefore the stationarity equations can be used to find the limit.
Problem 50 Let
P = [
].
1/4
3/4
3/4
1/4
Problem 51 Let
P = [
].
0
1
1
0
Problem 52 Let
P = [
].
1
0
0
1
Suppose X_{k} is a Markov chain with transition matrix
P = [

 



If 0 < a,b < 1

Problem 53 Suppose X_{k} is a Markov chain with transition matrix
Then Y_{n} = (X_{n},X_{n+1}) is also a Markov process.
Find its transition matrix and the stationary distribution.
P = [
a
1a
1b
b
].
[ 29 Suppose n people on a committee discuss a certain issue. When one person finishes speaking, we assume that it is equally likely that any of the other n1 begins. We also assume that each person speaks for exponentially distributed random time. How long does it take on average for all participants to take part in the discussion?
For each possible configurations of teams, we could determine the average score, and thus determine which player distribution is the best, if there is one. If there is no best single arrangement that wins against all other arrangements, then we should still be able to find the optimal mixed strategy.
[ 30 For the Markov chain defined above, select a pair of configurations and determine the following.
The optimal strategy in a game is the best (in the above minimax sense) randomized strategy against every deterministic strategy of the opponent. Problem 54 What is the ``optimal" arrangement of players in soccer?
Example 69 In the setup of Section 7.5.3, suppose Pr(x = 1) = p = 1Pr(x = 1) . Let T be the first time random walk reaches L > 0 or C < 0. Find the probability of winning (ruining the casino) Pr(X_{T} = CX_{0} = 0).
Note that evven though we are interested in a single number Pr(X_{T} = CX_{0} = 0), the fiststep analysis requires computing probabilities p(x) = Pr(X_{T} = CX_{0} = x) for all initial positions x.
Example 70 Suppose Pr(x = 1) = p, Pr(x = 1)1p . Let T be the first time random walk reaches L > 0 or C < 0. Find the average length of the game E_{0}(T).
Note that the fiststep analysis requires computing m(x) = E_{x}(T) for all initial positions x.
Problem 55 On average, how long does it take for a symmetric random walk on a square lattice to exit from a d×d square?
Problem 56 For the Markov chain defined in Section 8.2.1, select a pair of player configurations and determine the following.
Problem 57 A fair coin is tossed repeatedly until k = 2 successive heads appear. What is the average number of tosses required? (Hint: see Problem 8.1.1, or run a simulation.)
Problem 58 One of the quality control rules is to stop the process when on k = 8 successive days the process runs above, or below, specification line. On average, how often a process that does not need any adjustment will be stopped by this rule?
Suppose that you have an insurance contract for a vehicle. The premium payment is due yearly in advance and there are four levels P_{1} > P_{2} > P_{3} > P_{4}. The basic premium is P_{1}, unless no claims were made in the previous year. If no claims were made in a given year, then the premium for the following year is upgraded to the next category.
Suppose that the probability of the damage larger than s hundred dollars is e^{s} (substitute your favorite density for the exponential density used in this example). Because of the incentive of lowered premium, not all damage should be reported. The goal of this example is to determine numbers s_{1},s_{2},s_{3},s_{4} above which the claims should be filed; s_{j} is a cutoff used in a year when premium P_{j} is paid.
Let X_{t} be the type of premium paid in tth year. Clearly, the transitions are i® 1 with probability e^{si}, 1® 2 with probability 1e^{s1}, etc.
In the long run, the yearly cost is
 (82) 
[ 31 Find optimal s_{j} when P_{1} = 800, P_{2} = 720, P_{3} = 650, P_{4} = 600 (about 10% discount).
Each player has to toss the dice at least once per turn. The player can choose to toss the dice as many times as (s)he wishes as long as no ace occurs. However the current total is added to player's score only when the player ends his turn voluntarily.
If an ace occurs, then the player's turn is over, and his score is not increased. If two aces occur then the score is set to zero.
A player chooses the following strategy: toss the dice for as long as an ace occurs, or the sum of current subtotal+score exceeds number K. (If her score exceeds K she tosses the dice once, as this is required by the rules.) The quantity to optimize is the average number of turns that takes the player to reach the score of a hundred. The number of turns under this strategy is the number of aces when the score is less than K, plus the number of throws when the score is larger than K.
If X_{k} denotes players score after the kth throw, then clearly X_{k} is a Markov chain with the finite number of states, and with chance [1/ 36] of returning to zero at each turn.
Which value of K minimizes the average number of turns that takes the player to reach a hundred?
A simpleminded computation would just take into account the average gain and disregard the score completely. If the players subtotal is t then after the additional throw the total is on average 23/36×(t+7). This is less then t for t ³ 12, thus there is no point continuing beyond 12. Is this conclusion correct? Should the player choose to stop once the total on the dice exceeds 12? [ 32 What is the average number of turns it takes to reach a 100 under this strategy?
A less simpleminded computation would take into account the average gain: If the players score is s then after the additional throw his score on average is 1/36 ×0+ 1/3 ×s + 23/36×(s+7) which is more then s for s < 7×23. Is this conclusion correct? Should the player always choose to toss again? [ 33 What is the average number of turns it takes to reach a 100 under this strategy?
[ 34 What is the average number of turns it takes to reach a 100 under the cautious strategy of no additional tosses?
Another computation would take into account the current score s and current subtotal t. After the additional throw the numbers on average are s_{1} = 1/36 ×0+ 35/36 ×s, t_{1} = 23/36×(t+7). On average, t_{1}+s_{1} > t+s when t < 12s/13.
More complicated strategies could depend on current totals of other players, current score of the player, and the subtotal in some more complicated fashion. For instance, if s denotes the score, t denotes current subtotal, one could stop using twoparameter criterion t > ABs. This may have seemingly different effects than the previous strategy: when A is large, and score s is close to 0 there is no need for stopping; but if accumulated score s is large, the player may behave more cautiously.
One could optimize the probability of winning against k = 3,4 players instead of just minimizing the average number of tosses. The latter puts this problem into game theory framework. Simulations seem to indicate that the strategy based on t < 25^{1}/_{9}s works well against inexperienced human players.
Theorem 23 State x is recurrent iff åP_{x,x}(t) = ¥.
Proof. Let M be the number of times X_{t} returns to state x.
Let f be the probability of returning to x. State x is recurrent iff f = 1. Suppose f < 1 and let M be the number of returns. Clearly
Pr_{x}(M ³ k) = f^{k}, thus E_{x}(M) = f/(1f) < ¥. Since M = å_{t} I_{{Xt = x}}, E_{x}(M) = åP_{x,x}(t).
^{[¯]}
Interesting fact: simple random walks in R^{d} are recurrent when
d < 3, transient when d ³ 3. However the return times have infinite
average.
Theorem 24 Let T be a return time, and suppose m(x) = E_{x}T < ¥. Then P_{x,x}(t)® 1/m(x) as t®¥.
Problem 59 Suppose Markov chain X_{t} moves to the right k®k+1 with probability 1/k or returns to 1 with probability 1^{1}/_{k}, k = 1,2,.... Find its stationary distribution, and the average return time to state k.
This sections describes a more refined method for randomized search of minima of functions.
Suppose a finite set V is given, and we are to minimize a function U:V®IR.
 (83) 
Theorem 25 The invariant measure of transition matrix (83) is p_{q}(u) = ^{1}/_{Z} e^{qU(u)}
Proof. To check the invariance condition, denote \cal N(u) = {v: (u,v) Î E}.
^{[¯]}
An efficient realization of the above is to
pick v Î \cal N_{u} at random and accept it with
probability
{
 

'Declarations of subprograms DECLARE SUB Modify (Path!(), i!) DECLARE SUB GetNeighbor (P!(), T!) DECLARE SUB GetRandomPair (i0!, P!(), n!) DECLARE SUB display () DECLARE SUB AssignDistances (n!) DECLARE SUB GetPermutation (P!()) DECLARE FUNCTION PathLen! (P!()) ' 'Clear display and write a title CLS LOCATE 2, 1 INPUT ; "Shortest distance between how many cities?", n CLS nMax = 19 'only 19 cities in DATA statement IF n > nMax THEN n = nMax 'declare matrices required DIM P(n), BestP(n) ' permutation and best permutation P(i)=# of ith city visited 'shared matrices DIM SHARED dist(nMax, nMax) 'distances between cities i, j DIM SHARED city(nMax) AS STRING 'names of cities 'read distances CALL AssignDistances(nMax) 'initial permutation FOR j = 1 TO n P(j) = j + 1 BestP(j) = j + 1 NEXT j P(n) = 1 BestP(n) = 1 MinLen = PathLen(P()) T = 1 / 10000 DO 'infinite loop 'theta =1/temperature T = T * 1.1 PRINT "T="; 1 / T FOR ct = 1 TO 100 * n 'check if key pressed k$ = INKEY$ IF k$ > "" THEN EXIT DO 'exit infinite loop CALL GetNeighbor(P(), T) display x = PathLen(P()) IF x < MinLen THEN 'memorize MinLen = x FOR j = 1 TO n BestP(j) = P(j) PRINT city(P(j)); ">"; NEXT j PRINT city(P(1)) PRINT "Best so far: "; MinLen END IF NEXT ct LOOP 'Print final message CLS PRINT "ALPHABETIC TOUR OF FIRST "; n; " CITIES in the USA" PRINT "Simulated Annealing Recommended Route "; FOR j = 1 TO n  1 PRINT city(BestP(j)); ">"; NEXT j PRINT city(BestP(n)); ">"; city(BestP(1)) PRINT "Total distance: "; MinLen PRINT "Can you find a better one?" LOCATE 22, 40 PRINT "(Distances subject to change)" END cities: 'FIRST 19 cities in alphabetic order (Source: AAA map) DATA " Albany", "Albuquerque", "Atlanta", "Atlantic City", "Augusta" DATA "Austin", "Baltimore","Baton Rouge", "Billings", "Birmingham", "Bismarck", "Boise", "Boston" DATA "Buffalo", "Calgary", "Charlotte", "Cheyenne", "Chicago", "Cincinnati" distances: 'from city number RowNumber to RowNumber+k, k=0 to nMax DATA 0,2046,1067,315,336,1985,358,1585,2071,1167,1654,2615,174,285,2485,808,1789,823,724 DATA 0,1430,1986,2384,849,1949,1110,1006,1298,1423,1205,2222,1761,1712,1663,541,1289,1347 DATA 0,872,1287,1011,671,580,2019,158,1571,2451,1125,926,2425,258,1625,740,487 DATA 0,535,1790,163,1390,2085,968,1668,2629,373,455,2522,614,1803,837,664 DATA 0,2205,578,1805,2407,1383,1990,2951,162,621,2677,1029,2125,1159,1060 DATA 0,1627,436,1650,853,1558,2011,2043,1591,2356,1269,1204,1155,1152 DATA 0,1227,1948,805,1531,2492,416,459,2385,451,1666,700,527 DATA 0,1878,422,1589,2239,1643,1309,2443,838,1432,976,870 DATA 0,1861,417,732,2245,1786,706,2019,465,1248,1542 DATA 0, 1521,2293,1221,922,2375,416,1467,690,483 DATA 0,1149,1828,1369,927,1602,882,831,1125 DATA 0,2789,2330,875,2495,826,1792,2040 DATA 0,459,2515,867,1963,997,898 DATA 0,2200,693,1504,538,439 DATA 0,2456,1171,1685,1979 DATA 0,1669,771,490 DATA 0,966,1214 DATA 0,294 DATA 0 SUB AssignDistances (n) 'This reads all distances to dist(i,j) RESTORE cities FOR j = 1 TO n READ city(j) NEXT j RESTORE distances REDIM dist(n, n) FOR j = 1 TO n FOR i = j TO n READ x dist(i, j) = x dist(j, i) = x 'minimal error handling IF x = 0 AND i <> j THEN PRINT "Error in DATA before row "; j; "col"; i: STOP NEXT i NEXT j END SUB SUB display 'show information about what the program does STATIC tst tst = tst + 1 y = CSRLIN x = POS(1) LOCATE 1, 45 PRINT " Hit any key to stop" LOCATE 2, 60 PRINT " " LOCATE 2, 60 PRINT "Trying path #"; tst LOCATE y, x END SUB SUB GetNeighbor (P(), T) 'Move at random to another nearby permutation P n = UBOUND(P) CP = PathLen(P()) DIM Q(n) DO FOR k = 1 TO n: Q(k) = P(k): NEXT i = INT(n * RND(1)) + 1 j = INT(i * RND(1)) + 2 IF j > n THEN j = 1 SWAP Q(i), Q(j) Prob = EXP(T * (CP  PathLen(Q()))) 'IF Prob > 1 THEN EXIT DO IF RND(1) < Prob THEN EXIT DO LOOP SWAP P(i), P(j) END SUB FUNCTION PathLen (P()) 'Compute length of path P according to distances in Dist(i,j) n = UBOUND(P) s = dist(P(1), P(n)) FOR j = 1 TO n  1 s = s + dist(P(j), P(j + 1)) NEXT j PathLen = s 'END FUNCTION
[ 35 Notice that the DO ... LOOP selecting the next entry in GetNeighbor has random length. What is the average length of its execution?
In the example below we limit our attention to the onedimensional problem. This simplifies the notation, while the basic ideas are the same.
Let u(x,t) be an unknown function of x Î IR^{d}, t ³ 0. The difference equation we may want to solve is the following discrete analog of the diffusion equation.

The solution is u(x,t) = E(u_{0}(S_{t})), where S_{t} = å_{j £ t}e_{j} is the sum of independent random variables with 2^{d} equally likely values ±e_{k} Î IR^{d}.
Suppose x_{k} is a stationary Markov chain and let X_{n} be the solution of the difference equation X_{n+1}a X_{n} = x_{n+1}. One can write the transition Matrix for Markov process X_{t} and try find the stationary distribution for X_{0}.
A more direct method is based on the fact that the solution of the difference equation is X_{t} = a^{t}X_{0}+a^{t1}x_{1}+...+ax_{t1}+x_{t}. Therefore if a < 1, the stationary initial distribution is X_{0} = åa^{k} x_{k}. Thus X_{t} = å_{k = 0}^{¥} a^{k} x_{tk}
Problem 60 Suppose x_{k} are i. i. d. Find the covariance EX_{0}X_{k}.
Problem 61 Suppose x_{k} are i. i. d. Find E(X_{k}X_{0}).
Solutions of higher order difference equations can be easily out into the Markov framework, too. If X_{n+2}+aX_{n+1}+bX_{n} = x_{n+1} then Y_{n} = (X_{n+1},X_{n}) is Markov and satisfies the corresponding equation in matrix form: Y_{n+1} = AY_{n}+X_{n+1}. Therefore the stationary solution exist provided that the eigenvalues of A satisfy inequality l_{j} < 1.
To sort efficiently a set S of numbers into ascending order we should find a number y such that about half of the elements of S is below y. Then the total number of steps required is T(n) £ T(/n/2)+n+C(n), where C(n) is the number of comparisons required to find y.
Random Quick Sort is based on an idea that a random choice of y is good enough on average.
Theorem 26 The expected number of comparisons in random quick sort is at most 2nln(n+1).
Here is one possible realization of the subprogram:
SUB QuickSort (Index(), Aux(), First%, Last%) 'sorts two matrices in increasing order by the valuies of Index() from pocz to kon ' Note: mixes order of indices corresponding to equal Index(j) '** Quicksort (ascending) the fields in Array$(), from field First% thru Field Last% IF First% >= Last% THEN EXIT SUB CONST max = 30 DIM Lft%(max + 1), Rght%(max + 1) Temp% = 1 Lft%(1) = First% Rght%(1) = Last% DO Start% = Lft%(Temp%) Ende% = Rght%(Temp%) Temp% = Temp%  1 DO IndexLft% = Start% IndexRght% = Ende% x = Index((Start% + Ende%) \ 2) DO WHILE Index(IndexLft%) < x AND IndexLft% < Ende% IndexLft% = IndexLft% + 1 WEND WHILE x < Index(IndexRght%) AND IndexRght% > Start% IndexRght% = IndexRght%  1 WEND IF IndexLft% > IndexRght% THEN EXIT DO SWAP Index(IndexLft%), Index(IndexRght%) '** switch elements SWAP Aux(IndexLft%), Aux(IndexRght%) IndexLft% = IndexLft%  (IndexLft% < Ende%) IndexRght% = IndexRght% + (IndexRght% > Start%) LOOP IF IndexRght%  Start% >= Ende%  IndexLft% THEN IF Start% < IndexRght% THEN Temp% = Temp% + 1 Lft%(Temp%) = Start% Rght%(Temp%) = IndexRght% END IF Start% = IndexLft% ELSE IF IndexLft% < Ende% THEN Temp% = Temp% + 1 Lft%(Temp%) = IndexLft% Rght%(Temp%) = Ende% END IF Ende% = IndexRght% END IF LOOP WHILE Start% < Ende% IF Temp% > max THEN Temp% = 0 LOOP WHILE Temp% 'END SUB
Theorem 27 Suppose g(x) is increasing (nondecreasing) function and X_{t} is a Markov chain on IN that moves left only and E(X_{t+1}X_{t} = m) £ m+g(m). Let T be the time of reaching 1. Then E_{n}(T) £ ò_{1}^{n} [1/ g(x)] dx.
Proof. By induction
E_{n}(T) £ 1+Eò_{1}^{X} [dx/ g(x)] = 1+ò_{1}^{n}[dx/ g(x)]Eò_{X}^{n}[dx/ g(x)] £ 1+ò_{1}^{n}[dx/ g(x)]Eò_{X}^{n}[dx/ g(n)] £ 1+ò_{1}^{n} [dx/ g(x)]+E[(Xn)/ g(n)]
^{[¯]}
As an application we consider the following algorithm to pick the kth number
in order from a set S of n
numbers.
Problem 62 Estimate the average running time of the above algorithm. (ANS: ET £ 4lnn).
Suppose certain objects multiply independently and in discrete time intervals. Each object at the end of every period produces a random number x of descendants (offspring) with the probability distribution p_{k} = Pr(x = k). Let X_{t} be the total number of objects at tth generation.
Then in distribution X_{t+1} = å_{j = 1}^{Xt}x_{j}. The three questions of interest are the average size of the population, its variance, and the probability of extinction.
Definition 21 By extinction we mean the event that the random sequence {X_{t}} consists of zeros for all but the finite number of values of t Î IN.
Probability of extinction by time n can be found directly from the firststepanalysis: numbers u_{n} = Pr(X_{n} = 0) satisfy
 (86) 
Let m_{n} = E(X_{n}), V_{n} = Var(X_{n}). By Theorem 7.5.1 and induction we have

Theorem 28 [Watson(1874)] The generating function H_{n}(z) of X_{n} is the nfold composition (iteration) of g.
Proof. E(z^{Xn+1}X_{n} = k) = (g(z))^{k}. Thus by total probability
formula (46)
H_{n+1}(z) = Ez^{Xn+1} = å_{k}(g(z))^{k}Pr(X_{n} = k) = H_{n}(g(z)).
^{[¯]}
In particular, EX_{n} = [d/ dz] g^{°(n)}(z)_{z = 1} = m^{n} and
u_{n} = Pr(X_{n} = 0) = g^{°(n)}(0).
Problem 63 Prove (87) using moment generating function directly.
Notice that if there is a limit of q = lim_{n®¥}g^{°(n)}(0), then it has to satisfy the equation
 (89) 
Since X_{t} is integer valued and the events A_{n} = {X_{t} = 0} are decreasing, by continuity (1) the probability of extinction q = lim_{n®¥}Pr(X_{n} = 0) exists.
Theorem 29 If EX_{1} £ 1, the extinction probability q = 1. If EX_{1} > 1, the extinction probability is the unique nonnegative solution less than 1 of the equation (89).
Proof. This is best understood by graphing g(s) and marking the iterates on the diagonal.
Check by induction that g^{n}(0) < 1 for all n Î IN. If there is a solution s_{0} < 1 of g(s) = s, then it is the ``attractive point" of the iteration.
^{[¯]}
The generating function is g(z) = q+(1q) z^{2}. Asymptotic probability of extinction solves quadratic equation (1q)z^{2}z+q = 0. The roots are z_{1} = 1 and z_{2} = [(q)/( 1q)]. In particular, probability of extinction is 1 when q ³ ^{1}/_{2}.
When q = 1/4 probabilities of extinction at nth generation are u_{0} = 0,u_{1} = .25, u_{2} = .296875, u_{3} = .3161, u_{4} = .3249, u_{5} = .33127, u_{6} = .3323.
The generating function is g(z) = (1rq) +q(1r) r[z/( 1rz)]. The most interesting feature of this moment generating function is that it can be readily composed.
[ 2 The composition of fractional linear functions f(z) = [(a+bz)/( c + dz)] and g(z) = [(A+Bz)/( C + Dz)] is a fractional linear function (with the coefficients given by the matrix multiplication!).
Asymptotic probability of extinction has to solve quadratic equation 1rq+q(1r) r[z/( 1rz)] = z. The roots are z_{1} = 1 and z_{2} = [(q)/( 1q)]. In particular, probability of extinction is 1 when q ³ ^{1}/_{2}.
Since the iterates of the generating function can actually be written down explicitly, in geometric case Pr(X_{n} = 0) = 1m^{n}(1p_{e})/(m^{n}p_{e}) is explicit. Here p_{e} = z_{2}, m = ....
Problem 64 Suppose that in a branching process the number of offspring of the initial seedling has a distribution with generating function F(z). Each member of the next generation has the number of offspring whose distribution has generating function G(z). The distributions alternate between generations.
Problem 65 In a simple model of linear immunological response, the doubling probability p of the population of bacteria changes with time due to the increased number of lymphocytes. If there are X(t) bacteria at tth generation, then assume p = a/(t+a). Find the probability u_{t} of extinction by tth generation for infection by 10 bacteria. What is the average length of the disease?
Univariate normal distribution, standardization, and its moment generating function were introduced in Chapter 4. Below we define multivariate normal distribution.
We follow the usual linear algebra notation. Vectors are denoted by small bold letters x, v, t, matrices by capital bold initial letters A, B, C and vectorvalued random variables by capital boldface X, Y, Z; by the dot we denote the usual dot product in IR^{d}, ie. x·y: = å_{j = 1}^{d} x_{j}y_{j}; x = (x·x)^{1/2} denotes the usual Euclidean norm. For typographical convenience we sometimes write (a_{1},¼,a_{k}) for the vector
[
 
 

Below we shall also consider another scalar product á·,·ñ associated with the normal distribution; the corresponding seminorm will be denoted by the triple bar ·.
Definition 22 An IR^{d}valued random variable Z is multivariate normal, or Gaussian (we shall use both terms interchangeably; the second term will be preferred in abstract situations) if for every t Î IR^{d} the real valued random variable t·Z is normal.
Example 71 Let x_{1},x_{2},¼ be i. i. d. N(0,1). Then X = (x_{1},x_{2},¼,x_{d}) is multivariate normal.
Example 72 Let x be N(0,1). Then X = (x,x,¼,x) is multivariate normal.
Example 73 Let x_{1},x_{2},¼ be i. i. d. N(0,1). Then X = (X_{1}, X_{2}, ¼, X_{T}), where
X_{k} = å_{j = 1}^{k}x_{j} are partial sums, is multivariate normal.
Clearly the distribution of univariate t·Z is determined uniquely by its mean m = m_{t} and its standard deviation s = s_{t}. It is easy to see that m_{t} = t·m, where m = EZ. Indeed, by linearity of the expected value m_{t} = Et·Z = t·EZ. Evaluating the moment generating function M(s) of the realvalued random variable t·Z at s = 1 we see that the moment generating function of Z can be written as

C = [

 



C_{Y} = [

 



[
 

Denoting r = sin2q, it is easy to check that


This implies that the joint density of Y_{1} and Y_{2} is given by
 (90) 
Another representation


The latter representation makes the following Theorem obvious in the bivariate case.
Theorem 30 Let X,Y be jointly normal.
(i) If X,Y are uncorrelated, then they are independent.
(ii) E(YX) = m+AX is linear
(iii) YAX and X are independent.
Returning back to random variables X_{1}, X_{2}, we have X_{1} = g_{1} s_{1}cosq+ g_{2} s_{1}sinq and X_{2} = g_{1} s_{2}sinq+ g_{2} s_{2}cosq; this representation holds true also in the degenerate case.
 (91) 
Clearly, m = 0. Equation (91) expresses X as a linear transformation X = Ag of the i. i. d. standard normal vector with



 (92) 
Expression ^{1}/_{2} Dx^{2} can be interpreted as proportional to the kinetic energy of the motion described by the path x; assigning probabilities by Ce^{Energy/(kT)} is a well known practice in statistical physics. In continuous time, the derivative plays analogous role.
To simulate any ddimensional normal distribution we need only to simulate d independent N(0,1) random variables and use linear representations like in Theorem . For such simulation the covariance matrix needs to be inverted and diagonalized, a numerical nuisance in itself. When the multivariate normal distribution occurs as the socalled time series, a method based on Fourier expansion is then convenient, see Section , or the introductory examples in Chapter .
The linear algebra results imply that the moment generating function corresponding to a normal distribution on IR^{d} can be written in the form
 (93) 
Theorem 31 The moment generating function corresponding to a normal random variable Z = (Z_{1}, ¼, Z_{d}) is given by (93), where m = EZ and C = [c_{i, j}], where c_{i, j} = Cov(Z_{i}, Z_{j}), is the covariance matrix.
From (93) and linear algebra we get also
 (94) 
We have the following multivariate generalization of (55).
Theorem 32 Each ddimensional normal random variable Z has the same distribution as m+Ag, where m Î IR^{d} is deterministic, A is a (symmetric) d×d matrix and g = (g_{1}, ¼, g_{d}) is a random vector such that the components g_{1}, ¼, g_{d} are independent N(0, 1) random variables.
Proof. Clearly, Eexp( t·(m+Ag)) = exp(t·m) Eexp( t·(Ag)). Since the moment generating function of g is Eexp( x·g) = exp^{1}/_{2}  x ^{2} and t·(Ag) = (A^{T}t)·g, therefore we get
Eexp( t·(m+Ag)) = expt·mexp+^{1}/_{2}  A^{T}t ^{2}, which is another form of
(94).
^{[¯]}

Theorem 33 If Z is centered normal with the nonsingular covariance matrix C, then the density of Z is given by


In the nonsingular case the density expression implies strong integrability.
Theorem 34 If Z is normal, then there is e > 0 such that

Remark 3 Theorem 10.4.1 holds true also in the singular case and for Gaussian random variables with values in infinite dimensional spaces.
Theorem 35 If (X, Y) has jointly normal distribution on IR^{d1+d2}, then
 (95) 
Vector a = m_{X}Qm_{Y} and matrix Q are determined by the expected values m_{X}, m_{Y} and by the (joint) covariance matrix C (uniquely if the covariance C_{Y} of Y is nonsingular). To find Q, multiply (95) (as a column vector) from the right by (YEY)^{T} and take the expected value. By Theorem (i) we get Q = R×C_{Y}^{1}, where we have written C as the (suitable) block matrix
C = [

 


Problem 66 For the random walk from Section 10.2.1, what is E(X_{k}X_{1},...,X_{k1},X_{k+1},...,X_{n})?
Problem 67 Suppose X_{1},...,X_{d} are jointly normal, EX_{j} = 0, EX_{j}^{2} = 1 and all covariances EX_{i}X_{j} = r are the same for i ¹ j.
Find E(X_{1}X_{2},X_{3},...,X_{d}). (Notice that in this example r > 1/d.)
Suppose (X_{t})_{t = 0,1,...} is a Markov chain with multivariate normal distribution. That is, suppose X_{0} is normal, and the transition probabilities are normal, too.
Without loss of generality we assume EX_{t} = 0, EX_{t}^{2} = 1 and let EX_{0}X_{1} = r. Then E(X_{t+1}X_{t}) = rX_{t} and therefore EX_{0}X_{t} = r^{t}. This means that the covariance matrix of the Markov Gaussian process depends on one parameter r only. Comparing the answer with Section 8.7 we find out that all homogeneous Markov Gaussian processes have the form X_{t} = å_{k = 0}^{¥} g_{k+t}r^{k}, where g_{k} are independent normal r. v.
Poisson distribution occurs as an approximation to binomial. Another reason for its occurrence is related to exponential random variables and counting customers in queues. The latter is perhaps the most efficient way of simulating Poisson random variables, see Section 6.2.4. A related reason for the frequent use of Poisson distribution in modeling is the law of rare events.
Definition 23 A Poisson process of intensity l > 0 is an integervalued stochastic process {N_{t}:} with the following properties.
Suppose cars pass by an intersection and the times between their arrivals are independent and exponential. Thus we are given i. i. d sequence T_{j} of exponential r. v. (with parameter l). The number of cars that passed by within time interval (0,t) is a random variable N_{t}. Clearly, N_{t} is the first integer k such that T_{1}+...T_{k} > t.
Theorem 36 For t > 0,s > 0 random variable N(t+s)N(t) is independent of N_{t} and has Poisson distribution Poiss(sl).
Proof. We will prove only that N_{t} has the Poisson distribution. To simplify notation assume that exponential r.`v. have parameter l = 1. We prove the formula Pr(N_{t} = k) = [(t^{k})/ k!]e^{t} by induction.
k = 0: Pr(N_{t} = 0) = Pr(T > t) = ò_{t}^{¥} e^{x} dx = e^{t}
Suppose the formula is true for k. Then


Problem 68 Assume a device fails when a cumulative effect of k shocks occurs. If the shocks occur according to the Poisson process with parameter l, what is the density function for the life T of the device?
Problem 69 Let f(x,t) = Ef(x+N_{t}), where N_{t} is the Poisson process of intensity l = 1. Show that [(¶f)/(¶t)] = f(x+1)f(x).
In particular, p_{t}(k) = Pr(N_{t} = k) satisfies [(¶p_{t}(k))/(¶t)] = p_{t}(k+1)p_{t}(k).
Problem 70 Customers arrive at a facility at random according to a Poisson process of rate l. The customers are dispatched (processed) in groups at deterministic times T, 2T,3T,....
There is a waiting time cost c per customer per unit of time, and a dispatch cost K.
Problem 71 Find the mean EN_{t}, variance Var(N_{t}) and the covariance cov(N_{t},N_{s}).
Problem 72 Let X(t) = (1)^{Nt}. Find the mean EX_{t}, variance Var(X_{t}) and the covariance cov(X_{t},X_{s}).
The rate l in the Poisson process has probabilistic interpretation:
 (96) 
Theorem 37 N_{t} = N((0,t]) is the Poisson process of intensity l.
Example 74 A careless programmer assigns memory locations to the variables in his program^{30} at random . Suppose that there are M® ¥ locations and N = lM variables. Let X_{i} be the number of variables assigned to each location. If each location is equally likely to be chosen, show that
Example 75 While testing a program, the number of bugs discoverd in the program follows the Poisson process with intensity l = 5 errors per hour. Tester's fiance enters the test area and agrees to wait for the tester to find just one more bug. How long will she wait on average: 12 minutes, or 6 minutes?
The Poisson process N_{t} counts the number of events. If each event results in a random (and independent) outcome x_{j}, then the total is the compound Poisson process Z(t) = å_{j = 1}^{Nt} x_{j}.
The moments and also the moment generating function of Z(t) can be determined through conditioning. Section 7.5.1 implies that if Ex = m, Var(x) = s^{2} then E(Z(t)) = lmt, Var(Z(t)) = l(s^{2}+m^{2})t.
If the damage accumulates additively, then Z(t) represents the total damage sustained up to time t. Suppose that the system continues to operate until this total damage is less than some critical value c and fails otherwise. Then the (random) failure time T satisfies T > t if and only if Z(t) < c. Therefore Pr(T > t) = å_{n = 0}^{¥}Pr(å_{k = 1}^{n}x_{k} £ zN_{t} = n)(lt)^{n} e^{lt}/n!. Thus ET = [1/( l)]å_{n = 0}^{¥} Pr(å_{j = 1}^{n}x_{j} £ c). In particular if x_{k} are exponential ET = [(1+cm)/( l)].
Given a discretetime Markov chain, there are many ways of ``running it" in continuous time. One particular method is to make the moves at random moments of time. If these instances are exponential, then the resulting continuoustime process is Markov, too. This is the socalled embedded Markov chain.
Nonpathological continuous time Markov processes with countable state space have embedded Markov chain representation. In such representation we run a continuous time clock based on the independent exponential random variables. Once the time comes, we select the next position according to the transition probabilities of a discretetime Markov chain.
The theory of continuous time Markov chains (that is  processes with countable state space) is similar to discrete time theory. The linear firstorder difference equations for probabilities are replaced by the systems of firstorder differential equations.
Example 76 Suppose X_{n} is a twostate Markov chain with the following transitions: 0® 1 with probability a, 1® 0 with probability b. From Section 8.1 we know that P(X_{k} = 1)® a/(a+b) as k®¥.
Let T_{k} be i. i. d. exponential r. v. and let Y(t) = X_{k} when T_{1}+...+T_{k} < t < T_{1}+...+T_{k+1}.
Function p(t) = Pr(Y(t) = 1) satisfies differential equation: p¢(t) = lp(t)+lb p(t)+ la (1p(t). Indeed, Pr(Y(t+h) = 1) » Pr(Y(t) = 1, T > h)+Pr(Y(t) = 0,T < h).
Therefore p(t) = a/(a+b)+b/(a+b) exp(l(1b+a) t).
Here is a slightly different method to run the continuous time finitevalued Markov chain.
Pick the initial value according to prescribed distribution. Then select an exponential random variable for each of the possible transitions. Each of these can have different parameter l_{k}. Then select the smallest, T = min T_{j}. It can be shown that T is exponential with parameter ål_{j} and that Pr(T = T_{j}) = [(l_{j})/( å_{k}l_{k})].
Let p_{k}(t) = Pr(X(t) = k). Then p_{k}¢(t) = (a+b)kp_{k}+(k+1)bp_{k+1}+a(k1)p_{k1}. Population average m(t) = å_{k} k p_{k}(t) satisfies m¢(t) = (ab)m(t), thus m(t) = e^{(ab)t}.
Problem 73 Suppose X_{t} is a Markov process whose birth rate is an+a and death rate is bn with b > a. This describes a population of species that die out in a given habitat, but have a constant rate of ``invasion". One would expect that such competing effects will result in some sort of equilibrium. Find the average population size as t®¥.
Example 77 Suppose X_{t} = (1)^{Nt}, where N_{t} is the Poisson Process. Is X_{t} Markov?
[ 36 Customers arrive at a burger outlet at a rate l, and after exponential service time with mean 1/m leave.
[ 37 Customers arrive at a burger outlet at a rate l, and after exponential service time with parameter m leave. If there second cashier is opened, the service time will be reduced twice on average, but the cost of hiring the second cashier is $5 per hour. A customer purchases of average $5, with the profit of $2 over the costs. If k customers are waiting in the line, the next person driving by will stop with probability 2^{k}.
Continuoustime Markov processes with uncountable state space require more advanced mathematical tools. Only Gaussian case can be briefly mentioned here.
Definition 24 A stochastic process {X_{t}}_{0 £ t < ¥} is Gaussian, if the ndimensional r. v. (X_{t1},¼, X_{tn}) has multivariate normal distribution for all n ³ 1 and all t_{1},¼,t_{n} Î [0,¥).
The simplest way to define the Wiener process is to list its properties as follows. Definition 25 The Wiener process {W_{t}} is a Gaussian process with continuous trajectories such that

A stochastic process {X_{t}}_{t Î [0,1]} has continuous trajectories if it is defined by a C[0,1]valued random vector, or if all of its paths are continuous^{31}. For infinite time interval t Î [0,¥), a stochastic process has continuous trajectories if its restriction to t Î [0,N] has continuous trajectories for all N Î IN.
Conditions (97)(99) imply that the Wiener process has independent increments, ie. W_{0}, W_{t}W_{0}, W_{t+s}W_{t}, ¼ are independent.
Series expansions for the Wiener process are available in the literature. One way to obtain these is from Fourier expansion for the covariance function.
Problem 74 Let u(x,t) = Ef(W_{t}+x), where f is a smooth function. Show that u satisfies the parabolic equation [(¶u)/( ¶t)] = ^{1}/_{2} [(¶^{2} u)/( ¶x^{2})].
Problem 75 What partial differential equation is solved by u(x,t) = Ef(aW_{t}+x+bt) when a,b are nonzero constants?
Scaled Wiener process is a good model of diffusion. Use the twodimensional Wiener process to model a twodimensional diffusion. The twodimensional Wiener process is obtained from two independent onedimensional components. The diffusion coefficient a^{2} has units [length^{2}/time] and is implemented by scaling the Wiener process: aW_{t} has diffusion coefficient a^{2}.
[ 38 An eyeirritant pollutant is emitted from a factory chimney located at x = 0, y = 10 on the xy plane, and the wind blows lefttoright with velocity v(y) = y at height y. At a distance L = 100 downwind, there is a residential building of height 15. Which floor of the building is polluted the most? Is there a significant difference in pollution level between the floors? (Assume that the distances are in units such that the diffusion coefficient is 1.)
This chapter contains additional topics on discrete time processes. We begin with several examples of possible simulations. The resulting random curves are very different, but they have the same ``meansquare" behavior.
Example 78 Suppose we wish to pick a curve at random. In other words, we need a random function X(t) of integer parameter t. Here is one way to do it: take X(t) = å_{k}g_{k}a_{k}cos(kt), where g_{k} are i. i. d. normal N(0,1). To ensure the series is convergent we need å_{k} a_{k}^{2} < ¥; thus we can assume a_{k} = ò_{0}^{2p} g(q)cos(kq) dq. The theory of Fourier series tells us that if òg^{2}(q) < ¥, then the coefficients a_{k} are squaresummable.
Example 79 Suppose we wish to pick a curve at random. Here is one way to do it: take X(t) = åc_{j} cos(2pt+F_{j}), where F_{j} are independent and uniformly distributed on the interval (0,2p). Again select c_{k} = ò_{0}^{2p} g(q)cos(kq) dq.
[ 39 Write simulations of the curves as described. Pick as g a trigonometric polynomial, say g(q) = 1+cos(q)cos(2q).
Stationary processes are those that their probabilistic characteristics (distributions, conditional distributions, moments, covariances) do not change with time. Example 80 Suppose X_{n} is a Markov chain, and Pr(X_{0} = j) = p(j), where p_{j} is its invariant distribution. Then (X_{0},X_{k}) @ (X_{n},X_{n+k}).
For secondorder stationary processes only means, variances, and covariances do not change with time. That is, m(t) = EX(t) = const,cov(X(t), X(s)) = K(ts). The secondorder theory of processes is a very coarse theory. Nevertheless, it does solve the best linear prediction problem.
[ 4 In each of the introductory Examples, E X(t) = 0 and cov(X(t),X(s)) = K(ts), where K(t) = ò_{0}^{2p} cos(qt)f(q) dq.
The covariance K(t) of the weakly stationary process is a positive definite function. That is, åc_{i} c_{j} K(t_{i}t_{j}) = Eå_{j} c_{j}X(t_{j})^{2} ³ 0 for all c_{j} Î \sf CC, t_{j} Î IR. In addition, K(t) = K(t). Theorem 38 Given a positive definite even function K:ZZ® IR, there is a (0,2p)valued random variable Q such that for all integer t
 (100) 
[ 5 Suppose Q from Theorem 12.1.1 has density f(q). Let g(q) = Ö{f(q)} and define X_{t} as in Example 12. Then X_{t} is Gaussian, mean zero, with covariance (100).
The following theorem follows immediately from the proof of Theorem 5.2. The assumption holds true in particular when X_{n} has spectral density. Theorem 39 Suppose X_{n} is weakly stationary. If cov(X_{0},X_{n})® 0 as n® ¥ then ^{1}/_{n}(X_{1}+...+X_{n})® EX in L_{2}norm.
Proof. Compute the variance, and use the Cesaro summability result (Theorem
).
^{[¯]}
The covariance argument can be rewritten in spectral notation. Suppose EX_{0}X_{k} = (2p)^{1}ò_{p}^{p} e^{iks} f(s) ds. Then E(X_{1}+...+X_{n})^{2} = (2p)^{1}ò_{p}^{p} å_{k = 1}^{n}e^{iks}^{2} f(s) ds, so Var(^{1}/_{n}(X_{1}+...+X_{n})) = (2p)^{1}[1/( n^{2})]ò_{p}^{p} [(sin^{2} (^{1}/_{2}ns))/( sin^{2} (^{1}/_{2}s))] f(s) ds® 0 as n®¥.
In the linear prediction problem we deal with linear estimators a_{0}+å_{j}a_{j}X_{j}. The quadratic error involves variances, covariances and averages only. Thus it is appropriate to handle this in through the second order processes, and the solution should depend on the density of Q  the so called spectral density only.
The general Hilbert space theory tells us that the best linear prediction of X_{t+1} based on the past is å_{j = 0}^{t} a_{j}X_{j} with coefficients a_{j} such that EX_{j}(X_{t+1} å_{j = 0}^{t} a_{j}X_{j}) = 0 for all 0leq j £ t. This is a system of t+1 linear equations for t+1 unknown coefficients. GrammSchmidt orthogonalization allows to replace X_{j} by orthonormal x_{j}. Optimal prediction uses å_{j = 0}^{t} a_{j} x_{j} with a_{j} = EX_{0}x_{tj}.
Suppose x_{k} are i. i. d. A stochastic process X_{t} is an autoregressive process, if it satisfies a linear difference equation
 (101) 
Example 82 A moving average X_{t+1} = ^{1}/_{d}å_{j = 1}^{d}x_{tj} is an autoregressive process. What is the corresponding difference equation (101)?
For autoregressive process the optimal onestep prediction of X_{t+1} is
 (102) 
More general autoregressive sequences are defined as solutions of
 (103) 
These generalize simultaneously autoregressive and moving average processes.
First we describe the population at a single instance. We consider the model in which each hereditary character is carried by a pair of genes. For simplicity, we assume only one (pair) of a gene, and only one hereditary character corresponding to one locus. There are several possible alleles (categories) in a locus  we assume there are only two, denoted by a,A, of which A is dominant. Each individual thus is described by one of the pairs AA, Aa, aa, the so called phenotype . However, a is obstructed from the view by A, thus for an outside observer individuals AA and Aa are indistinguishable. A statistical study of such a population can only yield the proportion P_{A} of individuals with "Afeature", and not the actual probabilities of the three possible phenotype.
Now we turn to the modelling of the generation change. Under simple assumptions we shall be able to find out what are the frequencies of phenotype and genotypes. Usually this information is not directly available. We assume that the next generation occurs by random mating. Let
p_{AA}(0),p_{Aa}(0),p_{aa}(0) be the actual (and as yet unknown) probabilities of the phenotype. Under random mating with independent selection of parents, the probability that an offspring has phenotype AA is p_{AA}(1) = (p_{AA}(0)+^{1}/_{2}p_{Aa}(0))^{2}.
Denoting by p_{A}(t) = p_{AA}(t)+^{1}/_{2}p_{Aa}(t) we get the HardyWeinberg equilibrium: after one generation the proportions p_{A}(t) of genotypes stabilize and the phenotype frequencies become

This determines the actual proportions of the phenotype from the proportion of observed Acarriers and acarriers.
Problem 76 Show that P_{Aa} = 2(Ö{P_{aa}}P_{aa}).
The next question is to study the effects of the selection, where, say phenotype aa has different chance of survival. This leads to total probability formula and Markov chains.
Theorem 40 If A,B,C are n×n matrices such that AB ¹ C then Pr(ABX = CX) £ ^{1}/_{2}.
Proof. For a nonzero v we have Pr( v·X > 0) < ^{1}/_{2}
^{[¯]}
The following beautiful theorem due to B. de Finetti points out the role of exchangeability as a substitute for independence.
Theorem 41 Suppose that X_{1}, X_{2}, ¼ is an infinite exchangeable sequence. Then there exist a sfield \cal N such that X_{1}, X_{2}, ¼ are \cal Nconditionally i. i. d., that is


We will use the following (weak) version of the martingale^{32} convergence theorem.
Theorem 42 Suppose \cal F_{n} is a decreasing family of sfields, ie. \cal F_{n+1} Ì \cal F_{n} for all n ³ 1. If X is integrable, then E{X\cal F_{n}}® E{X\cal F} in L_{1}norm, where \cal F is the intersection of all \cal F_{n}.
Proof. Suppose first that X is square integrable. Subtracting m = EX if necessary, we can reduce the convergence question to the centered case EX = 0. Denote X_{n} = E{X\cal F_{n}}. Since \cal F_{n+1} Ì \cal F_{n}, by Jensen's inequality EX_{n}^{2} ³ 0 is a decreasing nonnegative sequence. In particular, EX_{n}^{2} converges.
Let m < n be fixed. Then E(X_{n}X_{m})^{2} = EX_{n}^{2}+EX_{m}^{2}2EX_{n}X_{m}. Since \cal F_{n} Ì \cal F_{m}, by Theorem we have


If X is not square integrable, then for every e > 0 there is a square integrable Y such that EXY < e. By Jensen's inequality E{X\calF_{n}} and E{Y\cal F_{n}} differ by at most e in L_{1}norm; this holds uniformly in n. Since by the first part of the proof E{Y\calF_{n}} is convergent, it satisfies the Cauchy condition in L_{2} and hence in L_{1}. Therefore for each e > 0 we can find N such that for all n, m > N we have E{E{X\cal F_{n}}E{X\cal F_{m}}} < 3e. This shows that E{X\cal F_{n}} satisfies the Cauchy condition and hence converges in L_{1}.
The fact that the limit is X_{¥} = E{X\cal F} can be seen as follows. Clearly X_{¥} is \cal F_{n}measurable for all n, ie. it is \cal Fmeasurable. For A Î \cal F (hence also in \cal F_{n}), we have EXI_{A} = EX_{n}I_{A}. Since
EX_{n}I_{A}EX_{¥} I_{A} £ EX_{n}X_{¥} I_{A} £ EX_{n}X_{¥} ® 0,
therefore EX_{n}I_{A}® EX_{¥} I_{A}.
This shows that EXI_{A} = EX_{¥} I_{A} and by definition,
X_{¥} = E{X\cal F}.
^{[¯]}
Proof of Theorem 13.3. Let \cal N be the tail sfield, ie.











A string is a sequence of letters, or symbols from the finite alphabet. For the purpose of computer modelling, we can assume that a string is a sequence of natural numbers {1,... d}, parameter d being the size of the alphabet.
Three simple examples of strings are the words, sentences in, say, English, and DNA molecules. Here d = 26 (for lowercase words), d = 94 for sentences, and d = 6 for the DNA (there are only four proteins, but extra symbols are used to mark various undecided cases).
The question of comparing two strings for similarities arises in molecular biology and in designing a spellchecker, or a speech recognition system. Accordingly, one would like to say which strings are similar, and how likely it is that they are similar due to chance only. Additional complications arise from the fact that two strings compared do not necessarily have the same length.
A simple way to compare two strings is to measure the number of symbols that don't match (the hamming distance). For instance abbacd and babacd would then have distance 2. But abbacd and bbacda would have distance 6, even though they differ just by one transposition.
A less obvious way to compare two strings is to measure the edit distance: how many editing operations are needed to transform one of the strings into another. Usually the editing operations are:
These are suitable for spellcheckers, where it is known that about 80% of typing errors are of the above form, thus most of mistyped words have editdistance 1 from the original.
Accordingly, the edit distance is set to 0, if the words are identical, 1 if they differ by a single error of one of the listed types, 2 if there were two such errors, etc. Formally, it is defined as the smallest number of elementary ``TE" transformations required to transform one of the words into another.
The method of computation is based on recurrence. It is easy to see that a number of transformations between and empty word and another one is exactly the length^{33} of the word. If the two words U, W are formed from shorter ones U0,W0 by adding letters at the end, then the distance Dist(U,W) is the smallest of the numbers
Here is a complete VBlisting:
Function Dist (U$, V$) As Integer 'Returns the edit distance '(number of elementary changes: replacement, deletion, insertion,transposition) 'that are required to transform word U$ into V$ ' 'Declare auxiliary variables Dim m As Integer, n As Integer, j As Integer, i As Integer Dim x As Integer, y As Integer, z As Integer, A$, B$ If Len(U$) < Len(V$) Then A$ = U$: B$ = V$: Else A$ = V$: B$ = U$ m = Len(A$) n = Len(B$) 'Declare matrix of distances between substrings of i,j characters ReDim D(m, n) As Integer 'Assign boundary values: distances from empty string For i = 0 To m: D(i, 0) = i: Next i For j = 1 To n: D(0, j) = j: Next j ' 'Main recurrence: Compute next distance D(i,j) from previously found values For i = 1 To m For j = 1 To n x = D(i  1, j) + 1 'delete character y = D(i, j  1) + 1 'inserte character x = Intmin(x, y) 'choose better (Integer Minimum) y = D(i  1, j  1)  (Mid$(A$, i, 1) <> Mid$(B$, j, 1)) 'swap characters i,j x = Intmin(x, y) ' choose better z = 0 If i > 1 And j > 1 Then 'If Mid$(A$, i, 1) <> Mid$(B$, j, 1) Then z = (Mid$(A$, i, 1) = Mid$(B$, j  1, 1)) * (Mid$(A$, i  1, 1) = Mid$(B$, j, 1)) 'End If y = (1 + D(i  2, j  2)) * z + x * (1  z) x = Intmin(x, y) End If D(i, j) = x Next j Next i Dist = D(m, n)'current value End Function
The main problem with using the edit distance to analyze DNA sequences is processing time. (Speedups are known.)
[ 40 Write a program computing the edit distance between strings, and another one, which does the editing by simulation. (Try the randomization based on random number of edit operations from the currently best candidate)
A cell in its growth goes through several phases, which have different probabilistic characteristics. In a simple model, the cell doubles after a random time, which is the sum of the exponential and deterministic portion. The average of the exponential phase can be assumed to depend on the external circumstances.
For a fixed vector y we have thus the joint probability mass function f(x) = Pr(X = x). The average information H(X) contained in X is defined as
 (108) 
Problem 77 Prove Gibbs' inequality å_{j} p_{j}logp_{j} ³ åp_{j} logq_{j} for all q_{j} > 0, åq_{j} = 1.
Notice that H(X) ³ 0 and H(X) £ logk
Problem 78 Suppose you have 15 identical in appearance coins, except that one of them has a different weight. Find the optimal^{34} weighting strategy to identify the odd coin by using a scale.
Modeling of the spread of disease is complicated due to multiple factors that influence its development. The birthanddeath process does not seem to be a good model for the spread of an epidemic in a finite population, since when a large proportion of the population has been infected, we cannot suppose that the rate of infections is independent of past history.
The L_{p} norm is

Notice that XEX_{2} is just the standard deviation.
We say that X_{n} converges to X in L_{p}, if X_{n}X_{p}® 0 as n® ¥. If X_{n} converges to X in L_{2}, we shall also use the phrase sequence X_{n} converges to X in meansquare. An example of the latter is Theorem 5.2.
Several useful inequalities are collected in the following. Theorem 43
 (109) 
 (110) 
 (111) 
Special case p = q = 2 of Hölder's inequality (110) reads EXY £ [Ö(EX^{2}EY^{2})]. It is frequently used and is known as the CauchySchwarz inequality.
For the proof of Theorem A.1 we need the following elementary inequality. [ 3 For a,b > 0, 1 < p < ¥ and 1/p+1/q = 1 we have
 (112) 
Proof. Function t® t^{p}/p+t^{q}/q has the derivative t^{p1}t^{q1}. The derivative is positive for t > 1 and negative for 0 < t < 1. Hence the maximum value of the function for t > 0 is attained at t = 1, giving





The next theorem lists useful properties of conditional expectations.
Remark 4 Inequality (iv) is known as Jensen's inequality and this is how we shall refer to it.
The abstract proof uses the following^{35}. [ 4 If Y_{1} and Y_{2} are \cal Fmeasurable and ò_{A}Y_{1} dP £ ò_{A} Y_{2} dP for all A Î \cal F, then Y_{1} £ Y_{2} almost surely. If ò_{A}Y_{1} dP = ò_{A} Y_{2} dP for all A Î \cal F, then Y_{1} = Y_{2}.
Proof. Let A_{e} = {Y_{1} > Y_{2}+e} Î \cal F. Since ò_{Ae}Y_{1} dP ³ ò_{Ae}Y_{2} dP + eP(A_{e}), thus P(A_{e}) > 0 is impossible. Event {Y_{1} > Y_{2}} is the countable union of the events A_{e} (with e rational); thus it has probability 0 and Y_{1} £ Y_{2} with probability one.
The second part follows from the first by symmetry.
^{[¯]}
Proof of Theorem A.2.
(i) This is verified first for Y = I_{B} (the indicator function of an event B Î \cal F). Let Y_{1} = E{XY\cal F}, Y_{2} = YE{X\cal F}. From the definition one can easily see that both ò_{A}Y_{1} dP and ò_{A} Y_{2} dP are equal to ò_{A ÇB} X dP. Therefore Y_{1} = Y_{2} by the Lemma A.2.
For the general case, approximate Y by simple random variables and use (vi).
(ii) This follows from Lemma A.2: random variables Y_{1} = E{X\cal F}, Y_{2} = E{X\cal G} are \cal Gmeasurable and for A in \cal G both ò_{A}Y_{1} dP and ò_{A} Y_{2} dP are equal to ò_{A}X dP.
(iii) Let Y_{1} = E{X\cal NÚ\cal F}, Y_{2} = E{X\cal F}. We check first that

(iv) Here we need the first part of Lemma A.2. We also need to know that each convex function g(x) can be written as the supremum of a family of affine functions f_{a, b} (x) = ax+b. Let Y_{1} = E{g(X)\cal F}, Y_{2} = f_{a, b}(E{X\cal F}), A Î \cal F. By (vi) we have

(v), (vi), (vii) These proofs are left as exercises.
^{[¯]}
Problem 79 Prove part (v) of Theorem A.2.
Problem 80 Prove part (vi) of Theorem A.2.
Problem 81 Prove part (vii) of Theorem A.2.
Problem 82 Prove the following conditional version of Chebyshev's inequality: if EX < ¥, then

Problem 83 Show that if (U,V,X) are such that in distribution (U,X) @ (V,X) then E{UX} = E{VX} almost surely.
Problem 84 Show that if X, Y are integrable nondegenerate random variables, such that

Problem 85 Show that if X, Y are integrable such that E{XY} = Y and E{YX} = X, then X = Y a. s.
 (113) 

They give immediately the expansions

In particular for x > 0 we have ln1+x < x and ln1+x > xx^{2}/2.
Suppose x = x(u,v), y = y(u,v). The change of variables formula in multivariate case is
 (117) 
 (118) 
The solution of the linear differential equation y¢+a y = g(x) with y(0) = y_{0} is given by y(x) = y_{0} e^{ax} ò_{0}^{x} e^{at} g(t)/, dt.
Second order linear equation y¢¢+a y¢+by = g(x) is often solved by the method of varying a constant. The first step is to solve the homogenous equation y¢¢+a y¢+by = 0 first using y = e^{rx}. The general solution of homogenous equation is y = C_{1}e^{r1x}+C_{2}e^{r2x}, or y = C_{1}e^{rx}+C_{2}xe^{rx} if the characteristic equation has double root.
Definition 27 A scalar product of vectors in \sf V is a bilinear, positive definite, nondegenerate mapping á··ñ: \sf V× \sf V® IR.
Definition 28 Vectors x,y are orthogonal with respect to scalar product áñ, if áxyñ = 0.
The length of a vector is x = Ö{áxxñ}. Orthogonal vectors of unit length are called orthonormal.. If e_{j} are the orthonormal vectors and x is in their linear span, then the coefficients in the expansion x = å_{j} a_{j}e_{j} are given by a_{j} = áxe_{j}ñ.
Example 83 Let \sf V be the vector space of all continuous functions on the interval [p,p]. In the scalar product áfgñ = ò_{p}^{p}f(t)g(t) dt the functions {sink}_{k Î IN}, {coskt}_{k Î IN}, 1 are orthogonal.

This implies that A^{n} = a_{0}(n)I+a_{1}(n)A+...+a_{d1}(n)A^{d1}, where x^{n} = D(x)Q(x)+a_{0}(n)+a_{1}(n)x+...+a_{d1}(n)x^{d1}. If A has n distinct eigenvalues l_{j}, the coefficients a_{j}(n) can be found by solving the system of equations l_{j}^{n} = a_{0}(n)+a_{1}(n)l_{j}+...+a_{d1}(n)l_{j}^{d1}.
A quick method that finds a characteristic polynomial due to K... is to solve the system of linear equations for the coefficients: pick a vector X at random and solve^{37} for a_{0},..., a_{d1} the equations a_{0}X+a_{1} AX+...+a_{d1}A^{d1}X = 0. If the resulting matrix is singular, resample X until a nonsingular matrix is found.
To approximate the integral ò_{a}^{b} f(x) dx in calculus we use left and right sums: L_{n} = [(ba)/ n]å_{k = 0}^{n1} f(a+[(ba)/ n]k), R_{n} = [(ba)/ n]å_{k = 1}^{n} f(a+[(ba)/ n]k)
The more exact trapezoidal rule uses
ò_{a}^{b}f(x) dx » S_{n} = ^{1}/_{2}(L_{n}+R_{n}).
Still more powerful and easy to program integration method is the Simpson rule: ò_{a}^{b} f(x) dx » ^{4}/_{3}S_{2n}^{1}/_{3}S_{n}. The Simpson rule is exact for cubic polynomials. Typical way to program it is to call the subroutine performing trapezoid method integration twice. Many calculators have Simpson rule build in. Before you use it, be sure to check if it is reliable enough. A simple test that catches some poorly written routines is ò_{0}^{500}e^{x2} dx » Ö{p/2}.
The fast and powerful bisection method requires correct endpoints, and finds one solution only. But it has virtually no assumptions, except the continuity: If f(x) is continuous and f(a) < 0, f(b) > 0 then there is a < x < b such that f(x) = 0 and the interval length can be halved by computing f([(a+b)/ 2]).
Three difficulties arise in real applications.
The second point has satisfactory answer for polynomial equations. The third point can be tackled through search for minimum. Namely, if equations are f(x,y) = 0,g(x,y) = 0 then we need to find minima of f^{2}(x,y)+g^{2}(x,y).
Any DOSrunning PC comes also with BASIC. You can start it with the command
QBASIC
from the DOS command line, or from Windows File Manager (click on an entry QBASIC.EXE, usually located in). If you are a devoted Windows user, you may install an Icon in Program Manager to run BASIC with a click of the mouse.
Correctly written programs halt at the end. But not all programs do that, so an
``emergency exit" is build in.
To stop the execution of any program, press simultaneously Ctrl+Break.
There is no possibility that by running these programs you will do any harm to your equipment. Experiment, and if something goes realy wrong, you can always turn OFF or restart the computer.
will print the decimal value of the expression ln(1+Ö2). This probably is the simplest program to begin with. It is so short that there is no point in saving it.
The real power comes from repetitive operations explained in Section .
BASIC programs are actually text files. The instructions are read consecutively by the BASIC interpreter. The QBASIC interpreter that comes with DOS is sufficient for our purposes. Sample programs below introduce certain more exotic buildin functions. Example 85 The following simple program prints current date & time
PRINT "HELLO"
PRINT "Today is "; DATE$
PRINT TIME$
Besides the actual code, programs should contain comments with explanations of instructions and their purpose. Comments and explanations in the program can be hidden from the processor through REM command. Everything in a line after this command is skipped (the only exception being metacommands, which we don't have to concern ourselves with here). The sample^{39} programs, use a shortcut ' instead of the full command REM. This is faster to type and equivalent in action. Example 86 Here is the previous program with comments
'Prints current date & time
PRINT "HELLO" 'greet the user
PRINT "Today is "; DATE$ 'print date
PRINT TIME$ ' print time could have printer TIMER=# seconds since 1980
The typical program consists of the main module with fairly general instructions that call subroutines to perform specific jobs, and a number of subprograms that are marked in text by SUB ... END SUB, and are displayed separately by the QBASIC Editor. Subprograms make it easier to write, test, and maintain the program.
Within QBasic the SUBs are separated, each in each own text window, but the final program is saved as one file with SUBs following the main module. To create a SUB, choose New SUB from the Edit menu. More details follow in Section .
Larger projects often use several modules^{40} that are compiled and run together.
Example 87 Here is the revision of the previous program with a simple but useful SUB.
'Prints current date & time in the middle of the screen
CLS 'clear display
CenterPrint "HELLO" 'greet the user
CenterPrint "Today is " & DATE$ 'print date  string has to be concatenated
CenterPrint TIME$ ' print time could have printer TIMER=# seconds since 1980
SUB CenterPrint (Text$)
'**** print text Text$ centered on screen
l=41LEN(Text$)/2
if l<0 then l=0 'too long text cannot be centered
print TAB(l); Text$
New things: CLS, concatenation of strings, calling SUB, screen positioning by TAB, LEN("hello")
Every subprogram should begin with a (commented) short description of its purpose, and the meaning of parameters.
FOR j=1 to 10 STEP .5 ... NEXT j.
STEP is optional, and if none is given, then 1 is used by default.
Example 88 Program
FOR j=1 TO 100
S=S+j
NEXT J
PRINT S
computes and prints the total of the first 100 integers (5050).
Conditional action is accomplished by
IF ... THEN ... ELSE ... END IF
END IF is required only when multiple lines of instructions are to be executed.
Selection from several cases is perhaps easiest through
SELECT var ...
CASE 0 ...
CASE .5 ...
CASE ELSE
END SELECT
Conditional loops (the ones that last indefinitely, unless special circumstance is encountered) can be programmed through
WHILE cond ... WEND
or through
DO
....
IF cond THEN EXIT DO
...
LOOP
Other ways of breaking out of DO ... LOOP are available, too.
Example 89 The following program illustrates several conditional instructions. It finds consecutive max = 30000 prime numbers larger than n = 1
n=1 max = 30000 nc = n  1 WHILE k < max DO nc = nc + 1 prime = 1 'TRUE FOR j = 2 TO SQR(nc) IF (nc MOD j) = 0 THEN prime = 0: EXIT FOR NEXT j IF prime THEN EXIT DO LOOP k = k + 1 Print k;"th prime is "; nc WEND
[ 41 Write the program solving recurrence (67).
To stop the execution of any program, press simultaneously Ctrl+Break.
To have user supply a value for variable X, write INPUT "Real number=";X
Note: instruction INPUT stops the program. The user has to hit Enter for the program to continue.
To scan for the key pressed by the user without stopping the program, use INKEY$ function instead of INPUT. This allows the user to control the program by simple menu functions. For example, embed the following lines within DO ... LOOP and stop the program by pressing KeyQ.
Key$=INKEY$
SELECT CASE Key$
CASE "Q"
END 'stop the program
CASE "q"
END 'stop the program
CASE "?"
PRINT "Did you ask for help?"
CASE ELSE
BEEP
END SELECT
A simple way to let the user know that something went wrong is to BEEP. [ 42 Write a program that complains (beeps) when user presses any key.
[ 43 Write a program that will type the text provided by user from the keyboard to screen in uppercase regardless what the user selected. (Hint: Function T2$=UCASE$(T1$) converts to uppercase)
To print to the printer, use LPRINT instead of PRINT. In the latter case, to force the page out of the printer, end every printing job with LPRINT CHR$(12).
QBASIC provides sophisticated ways of controlling the text output by format,
colors, location on the screen. In addition it does have graphic statements, as
long as the computer has graphics card. But the only thing needed for us is the
regular
PRINT "New value is X=";X
which outputs the string (in quotation marks) and the value New value is X=1.234
Occasionally we may want to do minimal ``format" through the semicolon, or TAB().
PRINT ".";
To see the effect of the semicolon, run the above statement in a loop^{41}. Then delete the semicolon, and run it again.
For professional formatting of output, look up the instruction PRINT USING "###.##" in any BASIC textbook (your public library is a good source!).
These can be combined into more complex user defined data structures through Type declaration.
If variables are not declared, and the name isn't appended with the symbol indicating type, the variables are treated as single precision. This is admissible^{42} in small programs where speed and memory aren't of major concern.
User defined functions are listed as separate windows within QBASIC. Select View to switch between various functions.
Each user=defined function starts with BEGIN FUNCTION FunctName (x, y,z) and ends with END FUNCTION. The code between these two lines is executed whenever the function is envoked from main program, from another function, or SUB, or from itself. FunctName is a name for your function (choose a descriptive one). Arguments (x,y,z) are the variables passed to the function.
Example 90 The following function soves the linear equation ax+b = 0
FUNCTION Solution(a, b)
Solution=b/a
END FUNCTION
[ 44 Write a function that solves quadratic equation a x^{2}+bx +c = 0.
There are just a couple things to remember when making functions:
Once a function is created, you can use it from within a program in the same way as the build in functions. The name of the function carries its value, eg x=FunctName(23)+FunctName(24). Notice that the function can also change values of the variables in its arguments, and perform any other tasks  this behavior is like that of any program.
SUBs act like functions, except that the name doesn't carry ``value" and the only change (if any) is to the variables passed. To invoke SUB from the program, use CALL SubName(Parameters). This method is recommended as it is more resistant to typographical errors.
You can pass a whole array as an argument of a function. The arrays are recognized by the parentheses: FuncName(A()) is a function of array A(). [ 45 Write functions SimulateBinomial (p,n), SimulateNormal(n,sigma), SimulateExponential(m) that simulate a single instance of the Binomial, Normal, Exponential distribution with given parameters.
Graphic commands are available only on computers with a graphics card (all PC's that run Windows have a graphics card).
It is nice to be able to make simple graphs, but the full topic is beyond the scope of this introduction. Program LIMTHS.BAS draws lines and boxes of various colors, in the display rectangle covering a portion of the screen.
If you are seriously interested in graphing the results of your computations, and printing the outcomes, you may want to switch to Visual Basic in Windows.
Long integers can be used to represent subsets of an nelement set, for n up to n = 15 (Why?). Single number k corresponds to long integer 2^{k}, the set {j, k} is represented by 2^{j}+2^{k}, etc.
If S, T are integers representing sets, then S AND T represents the intersection of sets, S OR T is the union, and S AND NOT T is the difference of sets.
To check if j Î S, verify if 2^j AND T is nonzero.
If you want to save your printout to file, print to the ASCII file sequentially. This is slower and less versatile than binary, but easier to master.
The syntax is quite rigid. The following example contains the basic idea:
OPEN FileName FOR OUTPUT as #17
PRINT #17, "HELLO"
CLOSE #17
You can print to file text, numbers, etc. If the file with the sam name as output file already exists, it is replaced by the new one (overwritten). To add rows to an existing file without loosing its contents use OPEN FileName FOR APPEND as #17
You can read input from programs back into the program by the corresponding INPUT statements. Beware that the file has to have the format expected by the INPUT.
The following simple program reads entries from the file FileName and prints it onto the screen, one by one.
OPEN FileName FOR INPUT as #17WHILE NOT EOF(17)
INPUT #17, H$
PRINT H$
WEND
CLOSE #17
PRINT H$
Change INPUT #17, H$ to LINE INPUT #17, H$ to get full text rather than words.
If you wrote a program that uses GOTO statement(s), it is a good exercise to rewrite it without a single GOTO instruction!
Modern BASIC is a structural language. The objective of this example is to show how various features of BASIC interplay in a design of a carddealing application.
Here is the description of the situation.
A deck of cards consists of 52 cards. Each card has two attributes:
 Suit (hearts, spades diamonds, and clubs)
 Value (1 through 13)
Card games require shuffling the deck, and dealing some number of cards from the top of the deck. The objective of this example is to write a program that will print out the number of requested cards twice. That is, the first player will get as many cards as she requests, and then the second player will get as many cards as he requests (from the rest of the deck!
'PROGRAM: GIVECARD.BAS
InitializeDeck
ShuffleDeck
'ask first player
n=HowManyCards
DealCards(n)
'ask second player
n=HowManyCards
DealCards(n)
What remains is only to decide what each step should do, and how the information about the cards will be stored. Since storing the information determines how it is passed between subprograms, we begin with determining this part.
We may want to use an array Deck(52) which will be of ``user defined" type. The advantage of this approach is that we may modify the information we w ``store" with each card with minimal changes in the program itself.
DEF TYPE Cards
Suite as string
Value as integer ' we may want string here, too!
End type
Afterwards we may declare two shared arrays
Dim Shared Deck(52) as cards
DIM Shared Order(52)
Shared means that every SUB in the program can access values of Order(j), and Deck(j). The first card dealt will be Deck(Order(1)). The definition of type says that its value is Deck(Order(1)).value and its suit is Deck(Order(1)).suit.
The simplest way to begin designing SUBS is to describe the purpose of each function/SUB with no code.
SUB InitializeDeck
'Initialize cards to their values.
'
END SUB
The next routine is perhaps not easy to write for a beginner, but we have a good example in the book.
SUB ShuffleDeck
'Make a random permutation
'Store it in shared array Order()
'Order(1), Order(2) are distinct random numbers range 1,...n,
END SUB
The next routine interacts with the user. User interaction should ALWAYS be implemented as a separate SUB. As straightforward as it seems, reliable coding requires extensive checking for errors resulting from unpredicted user reactions.
'ask first player
FUNCTION HowManyCards
' Ask user how many cards (s)he requests
' store in variable.
' Check for possible errors in input
' return value if enough cards are left.
END FUNCTION
The following routine is in charge of giving out cards. Since the order of cards was already determined, it seems straightforward. But again complications arise, and this portion will be much easier to handle if coded as a separate SUB
SUB DealCards(n)
' Remember how many cards are left
' Check how many cards are left
' Print out Error message if not enough cards
'Print next n cards
' You have to decide here HOW the cards will appear on screen:
'words? pictures? numbers?
' (In this example, it will be numbers)
END SUB
Rather than beginning to code the actual functions, we may want to double check that the ``flow" of our program is as we expect it. We may write a ``dummy" versions of the more difficult parts of the program, and test its operation. Only after we are sure that the program behaves as expected, we can invest more time into coding more difficult parts.
Here is a test program. It was produced from the previously described code; all newly added parts are clearly marked so that they can be removed once not needed.
DEF TYPE Cards
'***test***
REM Suite as string
Suite as integer
'*** end test ***
Value as integer ' we may want string here, too!
End type
Dim Shared Deck(52) as cards
DIM Shared Order(52)
InitializeDeck
ShuffleDeck
'ask first player
n=HowManyCards
DealCards(n)
'ask second player
n=HowManyCards
DealCards(n)
SUB InitializeDeck
'Initialize cards to their values.
'*** test ***
for j = 1 to 52
Deck(j).value=j mod 13 +1
Deck(j).suit=j mod 4 +1
next j
'*** end test ***
END SUB
SUB ShuffleDeck
'Make a random permutation
'Store it in Order()
'Order(1), Order(2) are distinct random numbers range 1,...n,
'*** test ***
'Factory order
For j= 1 to 52
Order(j)=j
NEXT j
'*** end test ***
END SUB
FUNCTION HowManyCards%
' Ask user how many cards (s)he requests
' store in variable.
' Check how many cards are left
' Print out Error message if not enough cards
' return value if enough cards are left.
'*** test ***
INPUT ``How many cards"; x
' should check for "crazy" answers here
HowManyCards=x
'*** end test ***
END FUNCTION
SUB DealCards(n)
' Remember how many cards are left
'Print next n cards
'*** test ***
' just print first n cards for now
Print "Your cards are:"
FOR j=1 TO n
Print Deck(Order(j)).Value ; " of suit No" ;Deck(Order(j)).Suit
NEXT j
'*** end test ***
END SUB
With this ``skeleton" program we can check the following things:
fractional numbers will go through, but strings at least will be stopped.
^{1} Disjoint, or exclusive events A, B Ì W are such that AÇB = Æ is empty.
^{2} The price for flexibility is that now we have to decide how to chose p_{n}.
^{3} Similar instruction on TI85 is Display rand.
^{4} In BASIC, to avoid repetitions use instruction RANDOMIZE TIMER at the beginning of a simulation program.
^{5} You need to find out how to handle graphic in BASIC. Otherwise, make a table of values instead.
^{6} A related method is brute force  checking all possible cases.
^{7} If you want to include more cities, you have to type the distances in a suitable format. If you embark on this project, try first to implement a method for selecting an arbitrary subset of cities to visit.
^{8} Trivial events are those with probabilities 0, or 1.
^{9} This terminology arises under the interpretation that X represents failure time.
^{10} A possible heuristic argument may argue that Tp is on average 1.
^{11} The precise definition is in Section 1.6.1.
^{12} Actually, we need only a rightinverse, i.e a function such that F(f(u)) = u.
^{13} Provided that limited accuracy is admissible
^{14} Theorem gives a more general statement.
^{15} See eg. W. Feller, An Introduction to Probability Theory, Vol II, Wiley, New York 1966.
^{16} For the multivariate analog, see Theorem
^{17} J. F. W. Herschel, Quetelet on Probabilities, Edinburgh Rev. 92 (1850) pp. 157
^{18} J. C. Maxwell, Illustrations of the Dynamical Theory of Gases, Phil. Mag. 19 (1860), pp. 1932. Reprinted in The Scientific Papers of James Clerk Maxwell, Vol. I, Edited by W. D. Niven, Cambridge, University Press 1890, pp. 377409.
^{19} This is all we can hope for in the finite computer arithmetics. Whenever U_{j} Î (0,1) is selected and finite approximation is chosen, N_{j} = NU_{j} is an integer, where N = 2^{b} is the characteristic of the operating system.
^{20} Actually, we need only a rightinverse, i.e a function such that F(f(u)) = u.
^{21} Why variance?
^{22} The smallest possible variance is 0! Why doesn't this happen in the actual application??
^{23} See J. S. Sadowski, IEEE Trans. Inf. Th. IT39 (1993) pp. 119128, and the references therein.
^{24} Such as y_{0} = 0.
^{25} Following good mathematical practise, we simplify the real world and assume that a single mature rabbit will produce one offspring every month.
^{26} Most likely, you know the solution to this one, but the method is useful for other problems as well.
^{27} Clearly, we request y_{0} = 0.
^{28} Notice that this is a mathematical simplification that might be not worth pursuing if the actual state space has some convenient interpretation.
^{29} Source: football in A. Hornby, Oxford Advanced Learner's Dictionary of Current English , Oxford University Press 1974
^{30} Variable aliasing is the mistake of assigning the same location to two or more variables in a program.
^{31} This is a very imprecise statement!
^{32} A martingale with respect to a family of increasing sfields \cal F_{n} is and integrable sequence X_{n} such that E(X_{n+1}\cal F_{n}) = X_{n}. The sequence X_{n} = E(X\cal F_{n}) is a martingale. The sequence in the theorem is of the same form, except that the sfields are decreasing rather than increasing.
^{33} Only deletions are required.
^{34} That is, find the strategy that costs the least if you have to pay for each use of the scale.
^{35} Readers not familiar with measure theory should skip the proofs.
^{36} See pl Theorem (Theorem 3.3) P. Billingsley, Probability and measure, Wiley, New York 1986.
^{37} Use Gaussian elimination.
^{38} This is golden section. The explanation why it occurs here can be found in numerical methods books.
^{39} The actual sample programs on the disk also contain commented L^{A}T_{E}X typesetting instructions at the beginning and at the end. Their sole purpose is to insert the listings into this text.
^{40} For example, in many applied math programs there is a need for integration routines. These perhaps would be collected in a separate module to facilitate repeated usage.
^{41} FOR j=1 TO 1000
PRINT ".";
NEXT J
^{42} Use DEFINT IL to override the default and force all the variables with names that begin with I though L to be of Integer type.