[section]
This course is aimed at students in applied fields and assumes a prerequisite of calculus. The goal is a working knowledge of the concepts and uses of modern probability theory. A significant part of such a ``working knowledge" in modern applications of mathematics is computer-dependent.
The course contains mathematical problems and computer exercises. Students will be expected to write and execute programs pertinent to the material of the course. No programming experience is necessary or assumed. But the willingness to accept a computer as a tool is a requirement.
For novices, BASIC programming language, (QBASIC in DOS, Visual Basic in Windows, or BASIC on Macintosh) is recommended. BASIC is perhaps the easiest programming language to learn, and the first programming language is always the hardest to pick.
Programs in QBASIC 4.5 on IBM-compatible PC and, to a lesser extend, programs on Texas Instrument Programmable calculator TI-85, and WindowsTM programs in Microsoft Visual Basic 3.0 are supported. This means that I will attempt to answer technical questions and provide examples. Other programming languages (SAS, C, C++, Fortran, Cobol, Assembler, Mathematica, LISP, TEX (!), Excel, etc.) can be used, but I will not be able to help with the technical issues.
Moment generating functions, limit theorems, characteristic functions. Stochastic processes: random walks, Markov sequences, the Poisson process.
Multivariate normal distribution. Gaussian processes, white noise. Conditional expectations. Fourier expansions, time series.
This text is available through the Internet in PDF, PostScript, or DVI formats. A number of other math related resources can be found on WWW. Also available are supporting BASIC program files.
Auxiliary textbooks:
This textbook is copyrighted © W. Bryc 1996. All rights reserved. GTDR.Academic users are granted the right to make copies for their personal use, and for distribution to their students in academic year 1995/1996.
Reproduction in any form by commercial publishers, including textbook publishers, requires written permission from the copyright holder.
Definition of a Tree: A tree is a woody plant with erect perennial trunk of at least 3.5 inches (7.5 centimeters) in diameter at breast height (41/2 feet or 1.3 meters), a definitely formed crown of foliage, and a height of at least 13 feet (4 meters).
The Auborn Society Field Guide to North American Trees.
This chapter introduces fundamental concepts of probability theory; events, and their chances. For the readers who are familiar with elementary probability it may be refreshing to see the computer used for counting elementary events, and randomization used to solve a deterministic optimization problem.
The questions are
To begin with, we start with a truism. Real world is complicated, often to a larger degree than scientists will readily admit. Most real phenomena have multi-aspect form, and can be approached in multiple ways. Various questions can be asked. Even seemingly the same question can be answered on many incompatible levels. For instance, the generic question about dinosaur extinction has the following variants.
Theses questions are ordered from the most individual level to the most abstract. The reader should be aware that probability theory, and stochastic modelling deal only with the most abstract levels of the question. Thus, a stochastic model may perhaps shed some light whether dinosaurs had to go extinct, or whether mammals will go extinct, but it wouldn't go into details of which comet had to be responsible for dinosaurs, or which is the one that will be responsible for the extinction of mammals.
It isn't our contention that individual questions have no merit. They do, and perhaps they are as important as the general theories. For example, a detective investigating the cause of a mysterious death of a young woman, will have little interest in the ``abstract statistical fact" that all humans eventually die anyhow. But individual questions are as many as the trees in the forest, and we don't want to overlook the forest, either.
Probabilistic models deal with general laws, not individual histories. Their predictions are on the same level, too. To come back to our motivating example, even if a stochastic model did predict the extinction of dinosaurs (eventually), it would not say that it had to happen at the time when it actually happened, some 60 million years ago. And the more concrete a question is posed, say if we want to know when Dino the dinosaur died, the less can be extracted from the stochastic model.
On the other hand, many concepts of modern science are define in statistical, or probabilistic sense.(If you think this isn't true, ask yourself how many trees do make a forest.) Such concepts are best studied from the probabilistic perspective. An extreme view is to consider everything random, deterministic models being just approximations that work for small levels of noise.
Let \cal M be a s-field of its subsets, called the events. Events A,B Î \cal M model sentences about the outcomes of the experiment to which we want to assign probabilities. Under this interpretation, the union AÈB of events corresponds to the alternative, the intersection AÇB corresponds to the conjunction of sentences, and the complement A¢ corresponds to the negation of a sentence. For A,B Î \cal M, A\B: = AÇB¢ denotes the set-theoretical difference.
For an event A Î \cal M the probability Pr(A) is a number interpreted as the degree of certainty (in unique experiments), or asymptotic frequency of A (in repeated experiments). Probability Pr(A) is assigned to all events A Î \cal M, but it must satisfy certain requirements (axioms). A set function Pr(·) is a probability measure on (W,\cal M), if it fulfills the following conditions:
| (1) |
Probability axioms do not determine the probabilities in a unique way. The axioms provide only minimal consistency requirements, which are satisfied by many different models.
| (2) |
The uniform assignment of probability involves counting. For small sample spaces
this can be accomplished by examining all cases. Moderate size sample spaces can be
inspected by a computer program. Counting arbitrary large spaces is the domain of
combinatorics. It involves combinations, permutations, generating functions,
combinatorial identities, etc.
Short recalls
the most elementary counting techniques. Problem 1
Three identical dice are tossed. What is the probability of two of a kind?
The following BASIC program inspects all outcomes when five dice are rolled, and counts how many are ``four of a kind".
PROGRAM yahtzee.bas
'
'declare function and variables
DECLARE FUNCTION CountEq! (a!, b!, c!, d!, e!)
'prepare screen
CLS
PRINT "Listing four of a kind outcomes in Yahtzee..."
'*** go through all cases
FOR a = 1 TO 6: FOR b = 1 TO 6: FOR c = 1 TO 6: FOR d = 1 TO 6: FOR e = 1 TO 6
IF CountEq(a + 0, b + 0, c + 0, d + 0, e + 0) = 4 THEN
PRINT a; b; c; d; e; : ct = ct + 1
IF ct MOD 5 = 0 THEN PRINT : ELSE PRINT "|";
END IF
NEXT: NEXT: NEXT: NEXT: NEXT
'print result
PRINT
PRINT "Total of "; ct; " four of a kind."
FUNCTION CountEq (a, b, c, d, e)
'*** count how many of five numbers are the same
DIM x(5)
x(1) = a
x(2) = b
x(3) = c
x(4) = d
x(5) = e
max = 0
FOR j = 1 TO 5
ck = 0
FOR k = 1 TO 5
IF x(j) = x(k) THEN ck = ck + 1
NEXT k
ck = -ck
IF ck > max THEN max = ck
NEXT j
'assign value to function
CountEq = max
'END FUNCTION
The program is written explicitly for tossing five dice, and you may want to modify it to answer similar question for any number of dice, and an arbitrary k-of-a-kind question.
[ 1 Run and time YAHTZEE.BAS. Then estimate how long a similar problem would run if the question involved tossing 15 fair dice. The answer depends on your computer, and the software. Both Pascal and C-programs seem to run on my computer about 15 times faster than the (compiled) QuickBasic. ANS: Running time would take years!
Exercise 1.2.1 shows the power of old-fashioned pencil-and-paper calculation. Problem 2 Continuing Problem 1.2.1, suppose now n identical dice are tossed. What is the probability of n-1 of a kind? ANS: [5 n/( 6n-1)].
For bounded subsets W Ì IRd, put
| (3) |
Geometric probability usually involves multivariate integrals. Example 1 A point is selected from the 32 cm wide circular dartboard. The probability that it lies within the 8cm wide bullseye is [1/ 16].
Example 2 A needle of length 2l < 2 is thrown onto a paper ruled every 2 inches. The probability that the needle crosses a line is 2l/p.
Analysis of Example 1.2.2 is available on WWW.
[ 2 Test by experiment (or simulation on the computer) if the above two examples give correct answers. (Before writing a program, you may want to read Section first.)
Example 3 Two drivers arrive at an intersection between 8:00 and 8:01 every day. If they arrive within 15 seconds of each other, both cars have to stop at the stop-sign. How often do the drivers pass through the intersection without stopping?
[ 1 Continuing Example 1.2.2: What if there are three cars in this neighborhood? Four? How does the probability change with the number of cars? At what number of users a stop light should be installed?
| (4) |
| (5) |
| (6) |
| (7) |
For a finite or countable set W Ì IN and a given summable sequence of non-negative numbers an put
| (8) |
pk = [(ak)/( ån Î W an)] and rewrite (8) as Pr(A) = ån Î A pn.
Formula (8) generalizes the uniform assignment (2), which corresponds to the choice of equal weights ak = 1. At the same time it is more flexible2 and applies also to infinite sample spaces. Example 4 Let W = IN and put pk = [1/( 2k)]. Then the probability that an odd integer is selected is Pr(Odd) = åj = 0¥ 2-2j-1 = 2/3. ()
Table list the most frequently encountered discrete probability assignments.
| Name | W | Probabilities pk |
| Binomial | {0,...,n} | pk = (nk)pk(1-p)n-k |
| Poisson | ZZ\scriptscriptstyle + | pk = e-l[(lk)/ k!] |
| Geometric | IN | pk = p(1-p)k-1 |
| Equally likely outcomes | {x1,...,xk} | pk = 1/k |
Problem 3 For each of the choices of numbers pk in Table 1.1 verify that (8) indeed defines a probability measure.
The reasons behind the particular choices of the expressions for pk in Table 1.1 involve modeling.
| (9) |
The first step in a simulation is to decide how to generate ``randomness" within a deterministic algorithm in a computer program. Programming languages, and even calculators, provide a method for generating consecutive numbers from the interval (0,1) ``at random". For instance, BASIC instruction3 PRINT RND(1) returns different number at every use. We shall assume that the reader has access to a of generating uniform numbers from the unit interval (0,1). These are called pseudo-random numbers, since the program usually begin the same ``random" sequence at every run, unless special precautions4 are taken.
Once a pseudo-random number from the interval (0,1) is selected, an event that occurs with some known probability p can be simulated by verifying if { RAND(1) < p} occurs in the program. For instance, the number of heads in a toss of a 1,000 fair coins is simulated by the following BASIC program.
PROGRAM tosscoin.bas
'
'Simulating 1000 tosses of a fair coins
H = 0
FOR n = 1 TO 1000 'main loop
IF RND(1) < .5 THEN H = H + 1
NEXT n
'print final message
PRINT "Got "; H; " heads this time"
END
'
A simple method for simulating an outcome on a six-face die is to take the
integer part of a re-scaled uniformly selected number INT(1+6*RND(1)). Can
you use this to write a simulation of a roll of 5 dice? Such simulations are often
used to evaluate probabilities empirically as a substitute for a real empirical
study.
[ 3 Write the simulation (as opposed to deterministic inspection of all sample points on page pageref) to estimate how often the event ``four of a kind" occurs in a roll of five dice.
[ 2 Run the (modified) coin
tossing program in a loop and to answer the following questions.
Hint: Move the main part of TOSSCOIN.BAS into a SUB,
or a FUNCTION. This way you can easily use it without
cluttering your program with irrelevant details. (See a generic template below.)
More complicated objects are often of interest in simulations. For instance we may want to draw two cards from the deck of 52. One possible way to do it is to number the cards, and select two numbers a,b.
In more advanced exercises you may want to estimate probabilities that aren't known. In such cases it is always a good idea to run simulations of various lengths and compare the results. In this section we briefly discuss how such a simulation can be organized in a way that promotes multiple uses of the same program.
The key is organizing the programs carefully into manageable blocks of small size. Modern BASIc is a structural programming language. The generic program to study the effects of the length of simulation on its output can be written as follows
' PROGRAM Generic.bas
'Generic Simulator
'Size is simulation size varied from Min=100 to Max=10000
For Size 100 to 10000 Step 100
Simulate(Size, Result)
Print "Simulation size="; Size ; Öutput="; Result
Next Size
End
The actual simulation is performed by
SUB Simulate (SizeRequested, Result)
'Runs requested number of simulations and returns average
'Trial numbers consecutive simulations from 1 to SizeRequested
For Trial=1 to SizeRequested
SimulateOne(Score)
Result=Result+Score
Next Size
'Most simulations return averages of single trials
Result=Score/Size
End SUB
The actual modeling is performed in another SUB, which in the generic program we named SimulateOne. This SUB may be as simple as simulating a toss of a single coin
SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
OutCome=0
if RND(1)<1/2 THEN Outcome=1
END SUB
Or it can be as complicated as we wish. The example below simulates a toss of five dice, and uses previously introduced function CountEq(a,b,c,d,e). four-of-a-kind.
SUB SimulateONe(Outcome)
'simulate One occurrence, return numerical outcome
d1=int(RND(1)*6+1)
d2=int(RND(1)*6+1)
d3=int(RND(1)*6+1)
d4=int(RND(1)*6+1)
d5=int(RND(1)*6+1)
IF CountEq(d1,d2,d3,d4,d5)=4 THEN Outcome=1
END SUB
[ 3 Write a blind-search program to find the maximum of a function.
Planning such a tour is easy by hand for 3-4 cities. For longer tours some help is needed. To check the performance of the blind search, you may want to know what the usual algorithms involve. A greedy method starts with the shortest distance, and then keeps going to the closest city not yet visited. Another heuristic method is to to select a pair of closest cities and accept the best (shortest) connection among those that do not complete a shorter loop, and do not introduce a spurious third connection to a city. Eventually, the disjoint pieces will form a path that often can be further improved upon inspection.
The program is longer but not at all sophisticated - it just selects paths at random. Notice that a solution to Exercise 1.4 - how to generate a random permutation - is given in one of the subprograms (which one?). The method for the latter is simple-minded - the algorithm attempts to place consecutive numbers at random spots until an empty spot is selected.
The following is the main part of the program. You can use it as a template in designing your own version of Blind search programs. The full code with SUBs is online in RANDTOUR.BAS.
'****
CLS
'**** get number of cities (no choice which) from user
LOCATE 2, 1
INPUT ; "Shortest distance between how many cities?", n
'*** initialize program
CLS
nMax = 19 ' current data size. Make sure not exceeded!
IF n > nMax THEN n = nMax
'declare arrays
DIM SHARED dist(nMax, nMax) ' matrix of distances
DIM SHARED city(nMax) AS STRING
DIM P(n), BestP(n)
'read distances
CALL AssignDistances(nMax)
'initial permutation
FOR j = 1 TO n
P(j) = j
BestP(j) = j
NEXT j
'initial length of trip
MinLen = PathLen(P())
'*** main loop
DO 'infinite loop till user stops
'count trials
no = no + 1
'*** interacting with user
'check if user pressed key to stop
k$ = INKEY$
IF k$ > "" THEN EXIT DO 'exit infinite loop
'display currrent progress
display (no)
'*** get any path
CALL GetPermutation(P())
x = PathLen(P())
IF x < MinLen THEN
'Better path found, so memorize and display
Dlen = MinLen - x
MinLen = x
'*Memorize best order and print
FOR j = 1 TO n
BestP(j) = P(j)
PRINT city(P(j)); "->";
NEXT j
'Finish printing
PRINT city(P(1))
PRINT "Best so far: "; MinLen
PRINT "Progress rate "; Dlen / (no - Slen); " miles per trial"
Slen = no
END IF
LOOP
'Print final message
CLS
PRINT "ALPHABETIC TOUR OF FIRST "; n; " CITIES in the USA"
PRINT "Blind Search Recommended Route found in "; Slen; "-th search"
FOR j = 1 TO n - 1
PRINT city(BestP(j)); "-->";
NEXT j
PRINT city(BestP(n)); "-->"; city(BestP(1))
PRINT "Total distance: "; MinLen
LOCATE 22, 40
PRINT "(Distances subject to change)"
END
The program runs in the infinite loop until it is stopped by the user. Once
stopped, the program prints the best route it found. For larger sets of cities we
may have hard time deciding when to stop it.
Here is a sample output (from the improved version, as marked in the actual code):
When you run this program on larger sets of cities, you will notice that the
program is not fast. One possible improvement in the design of this program is
to modify the randomization to be less likely to pick long paths. For instance,
you can attempt to modify paths that are known to be short, or weight the
modifications by lengths of resulting paths. Such methods are actually in use in
image restoration problems (simulated annealing),
see page .
It is possible to improve on the last aspect without complicating the program much. The idea is to make random modifications of the currently best found value. For example, in a one-dimensional maximization of a function f(x), we would do the following steps
We want to allow for the chance of checking points far away from the ``best-so-far" answer. But we don't want this to happen too often, because x0 might be rather close to the optimum. The tradeoffs are that the program will tend to follow ``direct path" to the maximum, but the danger is that it will get stuck longer in local maxima.
Improved blind search with time/state dependent randomization is actually implemented within RANDTOUR. It is commented out in the version on the disk, so that it isn't operational. To make it active, uncomment the call to ImproveBest as a replacement for GetPermutation.
Program RANDTOUR.BAS selects permutations at random only in its ``simplest" variant. Here are a few examples of problems that require selecting random permutations.
Suppose there are 7 items hidden under 12 cups, and a person is allowed to try to find them. How often all seven will be recovered by pure lack (as opposed to, say, parapsychic abilities?
SUB GetPermutationFast(P())
'Put random integers into P()
n=UBOUND(P) 'how many entries
'this loop can be omitted is we are sure that P(j) list all the numbers we want
FOR j=1 TO n
P(j)=j
next j
for j=1 to UBOUND(P) 'not n as n changes in the loop
k=INT(RND(1)*n+1)
SWAP P(n), P(k)
n=n-1
next j
[ 5
There are many incompatible measures of ``quality" of an algorithm.
One of the ``objective" criteria is the number of ``If" verifications.
Another ``objective criterion"
might be the number of calls to a function. A ``subjective" criterion, which depends on the
hardware and circumstances, is timing.
Does SUB GetPermutationFast deserve adverb Fast in its name?
[ 4
The Subset-Sum Problem is stated as follows:
Let S be a set of positive integers and let t be a positive integer. Decide of
there is a subset S¢ Ì S such that the sum of integers in S¢ is
exactly t.
The associated optimization problem is to find a subset S¢ Ì S whose
elements' sum is the largest but doesn't exceed t. This optimization problem is
NP-complete, ie it isn't known if there is a polynomial time
algorithm (polynomial in the size of S) to find S¢.
Investigate how the blind search will do on sets S selected at random,
and on sets S constructed in more regular fashion like
arithmetic progression S = {a, 2a, 3a, ...}, geometric progression S = a,a2, a3,....
To formalize this idea, suppose B is an event such that Pr(B) ¹ 0. The last condition merely says that B is an event that does have some chance of occurring. Conditional probability of event A given event B is denoted by Pr(A|B). It is defined as
Conditional probability satisfies the axioms of probability, and Pr(A|B) = 0 if A,B are disjoint. In particular, Pr(A¢|B) = 1-Pr(A|B), Pr(B|B) = 1.
The easiest way to find Pr(A|B) by simulations is to repeatedly simulate the complete experiment, discarding all the outcomes except the ones resulting in B.
[ 6
Suppose we toss repeatedly a fair coin, and the ``success" is to get heads.
Use computer simulations to find the
conditional probability that the very first trial was successful,
if 10 consecutive (and independent) trials
resulted in 8 successes.
Try to answer the same question under the condition that 50 independent trials
resulted in 40 successes.
You should notice that it takes forever to simulate events that happen rarely. Section indicates one possible way out of this difficulty.
Example 5 Suppose we have a deck of 52 cards numbered 1 through 52.
Since it isn't obvious how to simulate selecting 5 cards without replacement,
we may want to select them with replacement instead. Let A denote the event that all
five cards are different. What is the probability of A?
We may perform the experiment sequentially, drawing one card at a time.
Let Ak denote the event that the k consecutive draws resulted in different cards.
Then A = A5 Ì A4 Ì ... Ì A1. Moreover, Pr(A1) = 1.
Clearly, Pr(A2|A1) = [51/ 52], so
Pr(A2) = Pr(A2ÇA1) = Pr(A2|A1)Pr(A1) = [51/ 52]. Similarly,
Pr(A3) = Pr(A3ÇA2) = Pr(A3|A2)Pr(A2) = [50/ 52][51/ 52]. Continuing this we get
Pr(A5) = [51 50 49 48/( 524)] » 0.82.
| (10) |
Example 6 A lake has 200 white fish and a 100 black fish,
and a nearby pond contains 20
black fish and 10 white ones. No other fish live there.
A fish is selected at random from the lake and moved to the pond. Then a fish is
selected from the pond and moved back to the lake. What is the probability that
all fish in the pond are black?
Denoting by \cal P the generic path A1ÇA2Ç...ÇAn, and by \cal P(k) = Ak we have the following path integral formula for the probability of an event \cal F specifying the outcome of the complete experiment, and consisting of many paths \cal P.
| (11) |
Example 7 Suppose that the double transfer operation from Example 1.5.1 was repeated twice. That is, a random selection was done four times. What is the probability that all fish in the pond are black?
[ 8 Check by simulation how the proportion of black fish in the pond changes when the random transfers from Example 1.5.1 are performed repeatedly for a long time.
Two events A,B are independent, if the conditional probability is the same as unconditional, Pr(A|B) = Pr(A). This is stated in multiplicative form which exhibits symmetry and includes trivial events8 Definition 1 Events A, B are independent if Pr(AÇB) = Pr(A)Pr(B).
Independence captures the intuition of non-interaction, and lack of information. In modeling it is often assumed rather than verified. For instance, we shall assume that the events generated by consecutive outputs of the random generator are independent. We also assume that tosses of a coin (fair, or not!) are independent.
Beginners sometimes confuse disjoint versus independent events. Exclusive (ie. disjoint) events capture the intuition of non-compatible outcomes. Not compatible outcomes cannot happen at the same time. This is not the same as independent outcomes. If A, B are disjoint and you know that A occurred, then you do know a lot about B. Namely you know that B cannot occur. Thus there is an interaction between A and B. Knowing whether A occurred influences chances of B, which is not possible under independence.
Independence (or, more properly, mutual stochastic independence) of families of events is defined by requesting a much larger number of multiplicative conditions. The reason behind is Theorem , which provides a very convenient tool. Definition 2 Events A1, A2, ..., An are independent, if Pr(Çj Î J Aj) = Õj Î J Pr(Aj) for all finite subsets J Ì IN.
Example 8 A coin is tossed repeatedly. Find the probability that heads appears for the first time on the fourth toss.
Problem 4
SUB GetPermutation from the program RANDTOUR.BAS selects numbers between 1 and n
at random until it finds a number not yet on the list. Then it ads the number to its list, and repeats the process.
Definition 3 Let C be a non-trivial event. Events A,B are C-conditionally independent if Pr(AÇB|C) = Pr(A|C)Pr(B|C).
Random variables are introduced for convenient description of experiments with numerical outcomes. (The other option is to select W Ì IR, or W Ì IRd.) If we want to run computer simulations, we need to represent even non-numerical experiments (like tossing coins) in numerical terms anyhow. Thus the language of random variables becomes the natural extension of elementary probability theory, expressing many of the same concepts in a little different language.
A random variable is the numerical quantity assigned to every outcome of the experiment. In mathematical terms, random variable is a function X:W® IR with the property that sets {w Î W: X(w) £ a} are events in \cal M for all a Î IR. Recall that the last conditions means that we may talk about probabilities of events {w Î W: X(w) £ a}.
Probabilities for a one-dimensional r. v. are determined by the cumulative distribution function
| (12) |
Cumulative distribution function can be used to express probabilities of intervals Pr(a < X £ b) = F(b)-F(a). Since probability is continuous, (1) we can also compute Pr(X = a) = limb® a+Pr(a < X £ b) = F(a+)-F(a). The right hand side limit F(a+) exists, as F is a non-decreasing function.
Example 9 Suppose F(x) = (1-e-x)Ú0. Then Pr(|X-2| < 1) = e-1-e-3.
In probability theory we are concerned with probabilities. Random variables that have the same probabilities are therefore considered equivalent. We write X @ Y to denote the equality of distributions, ie. Pr(X Î U) = Pr(Y Î U) for all Borel sets U Ì IR (say, all intervals U).
Vector valued r. v. are measurable the functions W® IRd. In the vector case we also refer to X = (X1,... Xd) as the d-variate, or multivariate, random variable.
We will use the ordinary notation for sums and inequalities between random variables. There is however a word of caution. In probability theory, equalities and inequalities between random variables are interpreted almost surely. For instance X £ Y+1 means Pr(X £ Y+1) = 1; the latter is a shortcut that we use for the expression Pr({w Î W: X(w) £ Y(w)+1}) = 1. Problem 5 Show that F(x) = Pr(X £ x) is right continuous: limx®a+F(x) = F(a).
A binomial experiment, called also binomial trials, consists of the sequence of simpler identical experiments that have two possible outcomes each. The independent events Sj represent successes in consecutive experiments. We assume that we have an infinite sequence of events S1,S2,... Sk,... that are independent and have the same probability p = Pr(Sj). We denote by Fj = Sj¢ the failure in the j-th experiment, and put q = 1-p.
Two important random variables are associated with the binomial experiment are the number X of successes in n trials, and the number T of trials until first success. Example 10 The probability that number X of successes in n trials is k is Pr(X = k) = (kn)pkqn-k. (Here k = 0,..., n.)
Example 11 The probability of more than n attempts needed for the first success is Pr(T > n) = qn. The probability that first success occurs at the n-th trial is Pr(T = n) = pqn-1 (geometric).
Example 12 Geometric distribution has lack of memory property: Pr(T > n+k|T > n) = Pr(T > k).
Random variables are often described solely in terms of cumulative distribution function F(x), or formulas for Pr(X = x) without reference to the underlying probability space W. For instance, the number of minutes T that we spend waiting for a bird to come to the bird feeder at the back of my house is random, and I believe Pr(T = n) = pqn-1 because Pr(T > n+k|T > n) = Pr(T > k).
It is intuitively obvious that on average we get np successes in n trials.
It is perhaps less obvious10
that on average we need 1/p trials to get the first success.
[ 9 Write a simulation program to verify the claims about the averages for several values of p.
Example 13 The probability that in 2n tosses of a fair coin, half are heads is [(2n)!/( 4n(n!)2)] » [1/( Ö{ pn})]® 0 as n®¥. The latter isn't easy to prove, but the computer printout is quite convincing, see Table (note that [1/( Ö{p})] » 0.5642).
| 2n | Pr(X = n) | Frequency in 1000 trials | ÖnPr(X = n) |
| 100 | 0.07959 | 0.08200 | 0.56278 |
| 300 | 0.04603 | 0.06100 | 0.56372 |
| 500 | 0.03566 | 0.03700 | 0.56391 |
| 700 | 0.03015 | 0.02200 | 0.56399 |
By now you should have written some simple simulation programs, and printed out the results. It is perhaps a good moment to pause and consider what are the aspects of simulations that we are interested in.
In general, we would like to get answers to questions that we don't know how to answer in any other way. But before we do that, we should develop some intuition on the cases that can check the answers. Therefore we begin with simulation of probabilities or averages that are known.
Problem 7 [Exercise]
A die is thrown until an ace turns up.
Assuming the ace doesn't turn up on the first throw, what is the
probability that
more than three throws (ie. at least four) will be necessary? ANS: 197/198
Suppose that an ace turns up on an even throw. What is the probability that
it turned up on the second throw?
(If you think this is too hard mathematics, do it
as a computer assignment!)
[ 5 A deck of 52 cards has 4 suits and 13 values per suit.
If you think this is too difficult on a computer, compute the probabilities by hand.
[ 6 This is the expanded version of Exercise 1.5.1.
In a group of n people, how often at least two have the same birthday?
Tree species are not distributed at random but are associated with special habitats.
The Auborn Society Field Guide to North American Trees.
Intuitively, random variables are numerical quantities measured in an experiment. The concept11 is the core of probability theory; it leads outside of elementary probability and it touches advanced concepts of integration, function transforms and weak limits.
For convenience random variables are split into three groups: continuous, discrete, and the rest.
The definition says that X is a discrete r. v. if there is a finite, or countable set \cal V of numbers (values) of X such that pv = Pr(X = v) > 0 and åv Î \cal V pv = 1. The function f(v) = pv is then called the probability mass function of X. For completeness, the domain of the probability mass function is often extended to all x Î IR (or to x Î IRd in the multivariate case) by f(x) = Pr(X = x).
It is easy to see that if f is a function which satisfies two natural conditions:
| |||||||||||||
For discrete r. v. the cumulative distribution function (12) plays lesser role. It is a discontinuous function given by the expression
| (16) |
| Name | Values | Probabilities | Symbol | Parameters |
| Binomial | 0,...,n | Pr(X = k) = (nk)pk(1-p)n-k | Bin(n,p) | 0 £ p £ 1, n Î IN |
| Poisson | ZZ\scriptscriptstyle + | Pr(X = k) = e-l[(lk)/ k!] | Poiss(l) | l > 0 |
| Geometric | ZZ\scriptscriptstyle + | Pr(X = k) = p(1-p)k-1 | 0 £ p £ 1 | |
| Uniform | {x1,...,xk} | Pr(X = xj) = 1/k | k Î IN, x1,...,xk Î IR | |
| Hypergeometric | ||||
|
Negative Binomial |
Example 14 Let the random variable X denote the number of heads in three tosses of a fair coin.
Example 15 Let the random variable X denote the score of a randomly selected student on the final exam.
Problem 8 Let N be Poiss(l), and assume N balls are placed at random into n boxes. Find the probability that exactly m of the boxes are empty. ANS: (mn)e-lm/n(1-el/n)n-m.
Other methods are also available for the distributions from Table 2.1. For example, program TOSSCOIN.BAS on page pageref simulates binomial distribution Bin(n=100, p=1/2).
The following exercise provides tools to run more involved simulations.
[ 11
Write functions SimOneBin(n,p) and SimOneGeom(p)
that will simulate a single occurrence of the Bin(n,p) r. v. and the geometric r. v.
The sample usage: PRINT SimOneBin(15,.5) should simulate the number of heads
in tossing 15 fair coins.
Also write function SimGeneric(p()) which simulates generic r. v. with values
{0,1,...,n} and prescribed probabilities pk = p(k).
Since probability satisfies continuity axiom (1), Pr(X Î (a,a+h))® 0 as h® 0 for all a. The main interest in continuous case is that the rate of convergence to 0 is also known, Pr(X Î (a,a+h)) » f(a) h +o(h). Function f(x) in this expansion is called the density function.
In terms of the cumulative distribution function 12), the probability is Pr(X Î (a,a+h)) = F(a+h)-F(a), and thus f(a) = limh® 0[(F(a+h)-F(a))/ h] = F¢(a) is the derivative of the cumulative distribution function F. Therefore when the Fundamental Theorem of Calculus can be invoked (say, when f is piecewise continuous)
| (17) |
Definition 5 Random variable X is (absolutely) continuous, if there is a function f such that Pr(X Î U) = òU f(x)dx. Function f is called the probability density function of X.
It is known that if f is a function which satisfies two natural conditions:
| ||||||||||||||
It is convenient to use the heuristic probability density function in continuous case corresponds to probability mass function in discrete case, and that expressions that involve in discrete case sums are replaced by integrals, compare (17) and (16).
| Name | Range | Density | Symbol | Parameters |
| Normal | -¥ < x < ¥ | f(x) = [1/(Ö{2p}s)]exp-[((x-m)2)/(2s2)] | N(m,s) | m Î IR,s > 0 |
| Exponential | x > 0 | f(x) = le-lx | l > 0 | |
| Uniform | a < x < b | f(x) = [1/( b-a)] | U(a,b) | a < b real |
| Gamma | x > 0 | f(x) = Cxa-1e-x/b | Gamma(a,b) | a > 0,b > 0,C = [1/( baG(a))] |
| Weibull |
Example 16 A dart is thrown at a circular dart board of radius 6.
Let X denote the distance of the dart from the center. Assuming uniform
assignment of probability (3), the density of X is
f(x) = {
Problem 9 Referring to Exercise 1.2.2, let X,Y denote the arrival times
of the two drivers at the intersection. Find the density of the time lapse |X-Y|
between their arrivals.
To create a useful histogram, split the range into the finite number of
intervals. Then graph over each interval the rectangle with the area equal to the
observed frequency. The number, and positioning of intervals depends on the amount
of data available, and personal preference.
The generic method for simulating a continuous random variable is
similar to the method used in the discrete case. Namely, we take
X = f(U) with the suitable function f.
To find f assume it is increasing and thus invertible with inverse
g. Then Pr(f(U) < x) = Pr(U < g(x)) = g(x). This completes the generic
prescription: take as f(x) the inverse12 of the
cumulative distribution functions F(x) = Pr(X £ x).
This method of simulation is quite effective if the inverse of F can be found
analytically. It becomes slow when the inverse (or, worse still, cumulative
distribution function F itself) is computed by numerical procedures. Since this
is the case
of
the normal distribution, special methods are used to simulate the normal
distribution.
Example 17 To simulate X which is exponential with parameter
l, use X = -[1/( l)]lnU.
Expected value captures the intuition of the average of a random quantity.
It is also this intuition that leads to estimating the expected value by averaging
the outcomes of simulations.
Example 18 If X has values
x1,...,xn with equal probability, then EX = [`x] is the arithmetic
mean of x1,...,xn.
Simulationsare oten used to get answers that are too difficult
to find analytically. The following exercise can be answered by simulation if you
figure out how to shuffle cards from a deck at random (Exercise 1.4).
[ 12 What is the expected number of cards which must be turned over in
order to see each of one suit.
Example 19 If X takes two
values a < b and Pr (X = a) = p, then EX = pa+(1-p)b is the number in the
closed interval [a,b].
Definition 7
For continuous random variable X the expected value EX is given by
EX = òIRxf(x) dx, provided the integral converges.
Problem 10 Compute EX for the entries in Table 2.3.
[ 13 Simulate EX for the entries in Table 2.3 (except normal) for different values
of parameters involved.
Problem 11 [Exercise] Referring to Example 2.2.1,
what is the average distance of a dart from the center?ANS: 4.
If you chose to do the simulations, you should perhaps notice that it
is rather difficult to decide how many simulations to take in
order to achieve the desired accuracy. Typically, you need to increase
the size of a simulation four times to half the error.
Another point to keep in mind is that simulations do return numbers.
But from numbers alone it is
difficlut to see how parameters of the model change it, and this is a
more interesting question.
Proof. To simplify the proof and expose the main idea more clearly,
consider the discrete case with a finite number of values.
Similarly, to avoid
technicalities we consider continuous case with bounded
range
only.
Discrete case:
EX =
x1p1+x2p2+...+xnpn =
x1(p1+p2+...+pn-(
p2+...+pn))+x2(
p2+...+pn-(
p3+...+pn))+...+xn-1(
pn-1+pn -pn)+xnpn = x1(
p1+p2+...+pn)+(x2-x1)(
p2+...+pn-(p3+...+pn))+...+(
xn-1-xn-2)(
pn-1+pn)+(xn-xn-1)pn
The latter is ò0xn Pr(X > t) dt.
Continuous case: Let f denote the density and F be the cumulative distribution function.
Integrating by parts
EX = ò0b x f(x) dx = b-ò0bF(x) dx = ò0b(1-F(x) dx.
[¯]
It is natural to define EX for more general r. v. in the same way through
formula (21). Write
X = X+-X- to decompose X into its non-negative, and non-positive parts,
and then define EX = ò0¥ P(X > t) dt-ò0¥ P(X < -t) dt.
Clearly, if one of the integrals diverges, EX is not defined.
Example 20 Suppose X is exponential with the density f(x) = e-x. Let Y
be X truncated at level 3. That is
Y = {
Clearly, Y is not continuous, as Pr(Y = 3) = Pr(X > 3) = e-3 > 0. On the other
hand, Y is not discrete as it takes uncountable number of values; in fact all
the numbers between 0 and 3 are possible. The
definitions of the expected value we gave do not apply,
but (21) can be used to show that E(Y) = 1-e-3.
Indeed, Pr(Y > t) = Pr(X > t) = e-t for 0 < t < 3, and Pr(Y > t) = 0 for t > 3.
Example 21 To determine the mean of the geometric distribution we can either
compute the sum pån = 1¥ n (1-p)n-1, or use (21) and find
the easier sum ån = 0¥ (1-p)n.
[ 1
If U ³ 0 then
Indeed, by (21) we have
EU = ò0¥ Pr(U > x) dx ³ ò0tPr(U > x) dx ³ tP(|X| > t).
Problem 12 Suppose U is uniform U(0,1). Then Pr(U > t) £ [1/ 2t].
This means that 1-t < [1/ 2t].
Expected value EX is approximated without much difficulty13 on a computer by averaging large number of
independent trials.
Example 22 To find the average of the uniform U(0,1) distribution, take
1/n(U1+...+Un).
This is the basis of Monte-Carlo methods , which
is a family of related probabilistic methods for computing the integrals. To find
òab f(x) dx we simulate Xj = f(a+bUj). The average
1/nåj = 1nXj approximates [1/( b-a)]òab f(u) du for large
n.
The variance of the n-th approximation is of the order n-1/2, which is
worse than the trapezoidal rule for smooth
functions. In return the approximation is insensitive to the smoothness of the
integrands, and also to the dimension of the integral. Monte Carlo methods can be
used effectively for multiple integrals over irregular domains.
[ 14 Use Monte Carlo method to approximate
p = ò-112[Ö(1-x2)] dx. (You may want to compare the output with numerical
procedures described in Section .)
Another method of similar nature is to pick points (X,Y) at random from the
rectangle containing the graph of f and check if Y < f(X) holds. The proportion
of ``successes" approximates the proportion of the area under the graph of f.
[ 15 Approximate p/4 by selecting points (X,Y) at random from
the unit square, and checking if X2+Y2 < 1.
The following sample program computes numerically double integral
òòUcos(10x+20y) dxdy over a circle x2+y2 = 1. The only conceptual
difficulty as compared to single integrals is how to select points at random from
the unit disk. This is done by picking points from a bigger square and discarding
those that didn't make it. Can you do this integral analytically? Or by another
numerical procedure?
PROGRAM dblint.bas
[ 7 There are two natural choices to estimate by simulation
events of small probability. We can pick a large sample size n, run the
simulation experiment and hopefully get several data points. The outcome X of
such an experiment is a binomial random variable with unknown probability p of
success, and we would estimate p » X/n. The trouble is that if we don't
know how small the chances are, we might get none, estimating probability to be
zero. Or we could run the experiment until we get the prescribed number of
successes. The observation would then consist of a set of geometric random
variables T1,..., Tk with unknown mean ET = 1/p. We could then estimate
p = 1/ET by taking the inverse of the arithmetic mean of Tj.
In this approach we are guaranteed to get some observations, as long as p ¹ 0.
The question is Which of the methods would you recommend to use? (Of course,
you would recommend a better method, but what the word ``better" might mean here?)
Often an experiment involves measuring two or more random numbers, say X and Y.
The fact that we know the distribution of X, and the distribution of Y separately
doesn't
determine probabilities of events that involve both X and Y simultaneously.
Example 23 Suppose
It is clear that if X is discrete, and Y is discrete, then (X,Y)
is an IR2 valued discrete r. v. That is, the values of the pair are
countable. Probabilities Pr(X = x, Y = y) are called the joint
distribution. Corresponding Pr(X = x) and Pr(Y = y) are the so called
marginals. Example 2.5 points out that marginals do not determine
joint probabilities uniquely. But if we know the joint probabilities
then we can compute the marginals, eg Pr(X = x) = åyPr(X = x, Y = y).
In contrast to the discrete case, joint continuity can not be
recognized from the continuity of the components, and requires full definition.
Definition 8 Let X = (X1,...,Xn). Random variables X1,...,Xn are
jointly (absolutely) continuous, if there is a function f such that
Example 24
Suppose X,Y have uniform distribution in the unit disk. Then the joint density
is f(x,y) = 1/p for x2+y2 £ 1 and the density of X is
fX(x) = [2/( p)][Ö(1-x2)].
The relation between probabilities and the density is
(Similarly we define the stochastic independence of random vectors).
We say that X1,...,Xn are independent identically distributed
(i. i. d) if (26) holds and
Pr(Xi Î U) = Pr(Xj Î U) for all Borel U Ì IR.
[ 2
If X,Y are discrete with the probability mass function f(x,y), then
independence of X,Y is equivalent to f(x,y) = fX(x)fY(y).
If X,Y are continuous with the joint density f(x,y) then independence of X,Y
is equivalent
to f(x,y) = fX(x)fY(y).
Independence is often part of the model. Independence
allows to determine joint distributions from marginals. Thus each independent
random
variable can be analyzed separately, and then
more complex questions can be answered. From the mathematical perspective, under
independence we can determine
joint distributions if we know marginals.
Example 25 Suppose X is
binomial Bin(n,p) and Y is Poiss(l). If
X,Y
are independent, then the joint probability mass function of X,Y is given by
f(x,y) = (xn)pn(1-p)n-xe-lly/y! (or 0).
Example 26 [Example 1.2.2 continued]
Two drivers arrive at an intersection between
8:00 and 8:01 every day. Their arrival times are independent random variables.
Indeed, using formula (25) and elementary area computation,
Pr(X < a,Y < b) = ab for 0 < a,b < 1.
Sums and linear combinations, medians, maxima, and minima are perhaps the most
often encountered functions of multidimensional random variables. Methods to
compute the distribution, or the expected value of such a function are of
considerable practical significance.
If X,Y are discrete and EZ exists, then
If X,Y are jointly continuous then Z might be continuous, discrete, or say of mixed type.
Regardless of the case
In particular the expected value is linear
The fact that expected value is linear provides a simple method of computing some
otherwise difficult sums.
Example 27 Suppose X is Binomial Bin(n,p). Then X = X1+...+Xn, where
Xj is the number of successes in j-th trial. Clearly each Xj is 0, or
1, and EXj = p.
[ 16 [Example 1.2.2 continued]
Two drivers arrive at an intersection between
8:00 and 8:01 every day. On average how much time lapses between their arrivals?
Definition 10 Let m = EX. The variance Var(X) is defined
as Var(X) = E(X-m)2.
Notice that
The standard deviation is s = [ÖVar(X)].
Problem 13 Compute variances Var(X) for the entries in Table 2.4. (Some of these
are a real challenge to your computational skills, so you may safely give up. Another method will
make it easier in Chapter ).
Since Var(X) = EX2-(EX)2 ³ 0, the following inequality follows
Definition 11 The covariance of random variables X,Y with expected values
mX, mY is defined as cov(X,Y) = E(X-mX)(Y-mY).
Clearly Cov(X,X) = Var(X) and Var(X+Y) = Var(X)+Var(Y)+2cov(X,Y).
Theorem 3 If X,Y are independent, then Var(X+Y) = Var(X)+Var(Y).
Example 28 let X1,X2,...,Xn be independent {0,1}-valued random
variables, and suppose that Pr(Xj = 1) = p. Then Var(åj = 1nXj) = np(1-p).
What is the distribution of åj = 1nXj?
Example 29 If X > 0 then EeX = 1+ò0¥ et P(X > t) dt
Problem 14 Show that EX2 = ò0¥ t P(|X| > t) dt.
Generalize this formula to E|X|p.
Special cases of Chebyshev's inequality (23) are:
Chebyshev's inequality is one reason we often strive for small average quadratic
error. If E|X-X0|2 < e then we can be sure that
Pr(|X-X0| > 3Ö{e}) < 3Ö{e}. The following may be
used (with some caution) in computer programs to asses error in estimating
probabilities by sampling.
Example 30 If X
is
Bin(n,p)
then
Pr(|X/n-p| > [1/( 3Ön)]) £ [1/( 43Ön)]
[ 17
Run a simulation of the event that you know probability of, printing out the error estimate.
Do the same for the event that you don't know the probability analytically.
(Use programs written for previous exercises)
From now on, in the output of your simulation programs you should provide some
error estimates.
As an example of simple scheduling problem consider the following.
Example 31
Suppose that we want to bake a batch
of chocolate-chip cookies. The tasks and their (estimated) times
are:
[ 18
Suppose for the sake of this exercise that the numbers presented in the
cookie-baking example are just the average values of the exponential random
variables.
Suppose X,Y are discrete and f(x,y) = Pr(X = x, Y = y) is their joint probability
mass function. Then Z = X+Y is discrete with values z = x+y. Therefore
For independent random variables X, Y this takes a slightly simpler form.
Formula (36) can be used to prove the so called summation formulas.
Theorem 4
If X, Y are independent Binomial with the same parameter p, ie. X is
Bin(n,p) and Y is Bin(m, p), then X+Y is Binomial Bin(n+m,p).
If X, Y are independent Poisson Poiss(lX) and Poiss(lY), then X+Y is
Poisson Poiss(lX+lY).
One can prove that if X,Y are independent and continuous than X+Y is
continuous with the density
Formula (37) defines the convolution
fX*fY. It can be used to prove the so called summation formulas.
Theorem 5
If X, Y are independent Normal, then X+Y is Normal.
If X, Y are independent Gamma with the same parameter b, then X+Y is
Gamma(aX+aY,b).
Example 34 Suppose X,Y are independent U(0,1). The density of Z = X+Y is
f(z) = {
Suppose X,Y are independent.
If U = min{X,Y} then Pr(U > t) = Pr(X > t)Pr(Y > t). Therefore the reliability
function of U can be computed from the two given ones.
Example 35 If X, Y are independent exponential, then U = min{X,Y} is
exponential.
If U = max{X,Y} then Pr(U < t) = Pr(X < t)Pr(Y < t). Therefore the cumulative
distribution function of U can be computed from the two given ones.
Example 36
Suppose X,Y are independent uniform U(0,1). Then U = max{X,Y} has the density
f(u) = 2u for 0 < u < 1.
Problem 15
Let U1,..., Un be independent uniform U(0,1). Find the density of
Let R1,..., Rn be the corresponding order statistics. This means that at the end of each experiment we re-arrange the
numbers X1,... Xn into the increasing sequence R1,..., Rn. This
means that R1 = minj Xj is the smallest, R2 = maximinj ¹ i Xj is the
second largest,
etc.
The density of Rk can be found by the following method. In order for the
inequality Rk > x to hold, there must be at least k values among the Xj
above level x. Since Xj are independent and have the same probability
p = Pr(Xj > x) of ``success" in crossing over the x-level, this means that
Pr(Rk > x) is given by the binomial formula with n trials and probability of
success p = 1-G(x).
When the derivative is taken, the sum collapses into just one term, giving the
elegant answer rk(x) = [n!/( (k-1)!(n-k)!)](G(x))k(1-G(x))n-kg(x).
The L2 norm is
Notice that ||X-EX||2 is just another notation for the standard deviation. Thus
standard deviation is the L2 distance of X from a constant.
We say that Xn converges to X in L2, if ||Xn-X||2® 0 as
n® ¥. We shall also use the phrase
sequence Xn converges to X in mean-square.
An example of the latter is Theorem .
Several useful inequalities are collected in the following14.
Theorem 6
For all square-integrable X,Y
Proof of (40): Use (39) with Y = 1.
Proof of (41): By (39)
E|X+Y|2 £ ||X||22+||Y||22+2||X||2||Y||2.
[¯]
Random variables with r = 0 are called uncorrelated.
Correlation coefficient close to one of the extremes ±1 means that there is a
strong linear relation between X,Y; this is stated more precisely in
().
Theorem 2.7 states that independent random variables with finite variances are
uncorrelated.
Problem 17 Give an example of dependent random variables that are
uncorrelated.
Suppose we would like to approximate random variable Y by another quantity X
that is perhaps better accessible. Of all the possible ways to do it, linear
function Y » mX+b is perhaps the simplest. Of all such linear functions, we
now want to pick the best. In a single experiment, the error is |Y-mX-b|. We could minimize the
average empirical error over many experiments 1/nåj|Yj-mXj-b|. This
approximates the average error E|Y-mX-b|. Let us agree to measure the error of
the approximation by a quadratic error E(Y-mX-b)2 instead. (This choice leads
to simpler mathematics.)
Question: For what values of m, b the error E(Y-mX-b)2 is the smallest?
When is it 0?
Let H(m,b) = E(Y-mX-b)2. Clearly, H is a quadratic function of m,b Î IR.
The unique minimum is determined by the system of equations
The answer is m = [cov(X,Y)/ Var X], b = EY-mEX, and the minimal error is
A chain in the x,y-plane consists of n links each of unit length. The angle
between two consecutive links is ±a, where a > 0 is a constant.
Assume the sign is taken and random, with probability 1/2 for each.
Let Ln be the distance from the beginning to the end of the chain. The angle
between the k-th link and the positive x-axis is a random variable Sk-1,
where we may assume (why?) S0 = 0 and Sk = Sk-1+xka, where
x = ±1 with probability 1/2. The following steps determine the average length
of the random chain.
ELn2 = n [(1+cosa)/( 1-cosa)] -2cosa[(1-cosna)/( (1-cosa)2)]
Definition 12 The conditional expectation E{X|Y = y} is defined as
åxPr(X = x|Y = y) in discrete case, and as òIR x f(x|Y = y) dx in the
continuous case. One can show that the expected values exist, when E|X| < ¥.
Example 37 A game consists of tossing a die. If the face value
on the die is X then a coin is tossed X times. Let Y be the number of heads.
Then E(Y|X = x) = 1/2x.
Example 38 Suppose Y is a discrete with different values on
the events A1, A2, ¼, An which form a non-degenerate
disjoint partition of the probability space W. Then
Example 39 Suppose that f(x, y) is the joint density with respect
to the Lebesgue measure on IR2 of the bivariate random variable (X, Y)
and let fY(y) ¹ 0 be the (marginal) density of Y. Put
f(x|y) = f(x, y)/fY(y). Then E{X|Y} = h(Y), where
h(y) = ò-¥¥ x f(x|y) dx.
Total probability formula for conditional
expectations is as follows.
Similar question can be solved for N having a Poisson distribution.
Example 42 What is the distribution of a geometric sum of i. i. d. exponential
r. v.?
Example 43 Stock market fluctuations can be modelled by
Z = x1+...+xN, where N, the number of transactions, is Poisson(l)
and x are normal N(0,s). There is no explicit formula for the density
of Z, but there is one for the moment generating function. Thus Chebyshev
inequality gives bounds of the form Pr(Z > t) £ exp....
Theorem gives geometric interpretation of the conditional
expectation E{·|Z}; for square integrable functions E{.|Z} is just
the orthogonal projection of the Banach (normed) space L2 onto its closed
subspace L2(Z), consisting of all 2-integrable random variables of the
form f(Z).
Theorem 7 For square
integrable Y the quadratic error E(Y-h(X))2 among all square integrable
functions h(X) is the smallest if
h(x) = E(Y|X = x).
Theorem 2.14 implies that the linear approximation from Section is usually less accurate. In Chapter we shall see that linear approximation is the best one can
get in the all important normal case. Even in non-normal case linear
approximations offer quick solutions based on simple second order statistics. In
contrast, the non-linear approximations require elaborate numerical schemes to
process the empirical data.
Suppose T represents a failure time of some device.
If the device is working at
time t, then the probability of surviving additional s seconds is
Pr(T > t+s|T > t).
For a device that doesn't exhibit aging this probability should be the same as for
the brand new device.
Problem 19
Show that geometric T satisfies (47) for integer t,s.
Equation (47) implies Pr(T > t+s) = Pr(T > t)Pr(T > s), an equation that can
be solved.
Theorem 8 If T > 0 satisfies (47) for all t,s > 0, then T is
exponential.
Proof.
The tail distribution function N(x) = P(T > x) satisfies equation
Formula (48) implies that for all integer n
and all x ³ 0
This shows that for rational q > 0
Let T > 0 be a continuous r. v. interpreted as a failure time of a certain device.
If the device is in operational condition at time t, then the probability that
it will fail immediately afterwards may be assumed negligible. The probability of
failing within h units of time is Pr(T < t+h|T > t). The failure rate at time t
is defined as
Example 44 If T is exponential then the failure rate is constant.
A family of failure rates that exhibit interesting aging patterns is
provided by the family of power
functions l(t) = ta.
Theorem 9
If T is continuous with failure rate l(t) = ta, where a > 0
then T has the Weibull density:
Proof.
Rewrite the expression
(kn)[(lk)/( nk)](1-[(l)/ n])n-k = lk/k!(1-l/n)n/(1-l/n)kÕj = 0k(1-j/n). [¯]
Example 46
A bag of chocolate chip cookies has 50 cookies. The manufacture claims there are
a 1,000 chips in a bag. Is it likely to find a cookie with 15 or less chips
in such a bag?
From numbers 1, ... n, select at random k > n numbers.
On average, how many of these numbers repeat? (and hence should be thrown out)
Problem 21 The performance of the algorithm for selecting a random permutation
in GetPermutation SUB of RANDTOUR.BAS
can be estimated by analyzing the following ``worst-case" scenario.
When the algorithm attempts to select
last of the random
numbers 1, ... n, then
Problem 22
For a group of n person, find the expected number of days of the year which are
birthdays of exactly k people. (Assume 365 days in a year and that all birthdays
are equally likely.)
Problem 23
A large number N of people are subject to a blood test. The test can be
administered in one of the two ways:
(i) Each person can be tested separately (N tests are needed).
(ii) The blood sample of k people can be pooled (mixed) and analyzed. If the
test is positive, each of the k people must be tested separately and in all
k+1 tests are then required for k
people.
Assume the probability p that the test is positive is the same for all people
and that the test results for different people are stochastically independent.
[ 21
There are 3 stop-lights spaced within 1km of each other and operating asynchronously. (They are
reset at midnight.)
Assuming each is red for 1 minute and then green for one minute, what is the average time
to pass through the three lights by a car that can instantaneously accelerate to 60km/h.
This exercise can be developed into a simulation project that may address some of the following
questions
Problem 25
Let X ³ 0 be a random variable and suppose that
for every 0 < q < 1 there is T = T(q) such that
Problem 26 If x, U are independent,
Pr(x = 0) = Pr(x+1) = 1/2 and U is uniform U(0,1). What is the
distribution of U+x?
Properties of a sequence {an} are often reflected in properties of the
generating function h(z) = ånanzn.
A moment generating function is non-negative, and convex (concave up). The
typical example is the moment generating function of the {0,1}-valued random
variable.
A linear transformations of X changes the moment generating function by the following formula.
Important properties of moment generating functions are proved in
more theoretical probability courses15.
Theorem 11
(i) The distribution of X is determined uniquely by its moment generating
function M(t).
(ii) If X, Y are independent random variables, then
MX+Y(t) = MX(t) MY(t) for all t Î IR.
(iii) M(0) = 1, M¢(0) = EX, M¢¢(0) = EX2
Problem 27 Find moment generating functions for each
of the entries in Table 3.1.
Problem 28 Use moment generating functions to compute
EX, Var(X) for each
of the entries in Table 2.4.
Problem 29
Prove the summation formulas stated in Theorems 36 and 37
For a d-dimensional random variable X = (X1, ¼, Xd)
the moment generating function
MX: IRd® \sf CC is defined by
MX(t) = Eexp(t·X), where the dot denotes the
dot (scalar) product, ie. x·y = åxkyk.
For a pair of real valued random variables X, Y, we also
write
M(t, s) = M(X, Y)((t, s)) and we call M(t, s)
the joint moment generating
function
of X and Y.
The following is the multi-dimensional version of Theorem 3.2.
Theorem 12
(i) The distribution of X is determined uniquely by its
moment generating function M(t).
(ii) If X, Y are independent IRd-valued random variables,
then
The major nuisance in using the moement generating functions is the fact that
the moment generating functions may not exist, when the definiting integral diverges.
For this reason it is more preferable to use an expression that is always bounded,
and yet has the same convenient algebraic properties.
The natural candidate is
For symmetric random variables complex numbers can be avoided at the expense of
trigonometric identities.
Example 47 If X is -1,1 valued,
Pr(X = 1) = 1/2,
then
f(t) = cost.
Example 48
The characteristic function of the normal N(0,1) distribution is
f(t) = e-t2/2.
(For a = 6 Pr(Sn = k) is the probability of scoring the sum k+n in a throw
with n dice. The solution of this problem is due to de Moivre.)
Suppose that all distributions of the sexes for n children are
equally likely.
Find the probability that a family has exactly k girls.
ANS:
[(2apk)/( (2-p)k+1)]
Problem 32
Show that if X ³ 0 is a random variable such that
Problem 33
Show that if Eexp(lX2) = C < ¥ for some
a > 0, then
Problem 34
Prove that function f(t): = Emax{X, t} determines
uniquely the distribution of an integrable random variable X
in each of the following cases:
The predominance of normal distribution is often explained by the
. This theorem asserts that under fairly general conditions the
distribution of the sum of many independent
components is approximately normal. In this chapter we give another reason why
normal distribution might occur.
The usual definition of the standard normal variable Z specifies
its
density f(x) = [1/( Ö{2p})]e-[(x2)/ 2], see Figure 2.1 on page pageref. In
general, the so called N(m, s) density is given by
In multivariate case it is more convenient to use moment generating functions
directly. For consistency we shall therefore adopt the following definition.
Definition 13 A real valued random variable X has the
normal N(m, s)
distribution if its moment generating
function has the form
From Theorem 3.2 one can check by taking the derivatives
that m = EX and s2 = Var(X).
Using (53) it is
easy to see that every univariate normal X
can be written as
[ 22 [Basketball coach's problem]
Collect data on the heights of 25 to 50 randomly selected males. Make a histogram
of the data, compute the empirical mean and standard deviation.
Ten years after Herschel,
the reasoning was repeated by J. C. Maxwell18.
The fact that velocities are normally distributed is sometimes
called Maxwell's theorem.
The beauty of the reasoning lies in the fact
that the interplay of two very natural assumptions: of independence
and of rotation
invariance, gives rise to the normal law of errors - the most important
distribution in statistics.
Theorem 13 Suppose
random variables X, Y have joint probability distribution m(dx, dy) such
that
(i) m(·) is invariant under the rotations of IR2;
(ii) X, Y are independent.
Then X, Y are normal.
The following technical lemma asserts that moment generating function exists.
[ 1 If X,Y are independent and X+Y, X-Y are independent, then
EexpaX < ¥ for all a.
Let X1, X2 be the independent copies of X. Inequality (56)
follows from the fact that event {|X1| ³ 2x} implies that either the
event {|X1| ³ 2x}Ç{|X2| ³ 2x0}, or the event
{|X1+X2| ³ 2(x - x0)}Ç{|X1 - X2 | ³ 2(x - x0)} occurs.
Indeed, let x0 be such that P(|X2| ³ 2x0) £ 1/2. If
|X1| ³ 2x and |X2| < 2x0 then
|X1±X2| ³ |X1| - |X2| ³ 2(x -x0). Therefore using independence and the trivial bound
P(|X1+X2| ³ 2a) £ P(|X1| ³ a)+P(|X2| ³ a), we obtain
Therefore (take u = v) the second derivative Q¢¢(u) = Q¢¢(0) = const ³ 0.
This means M(u) = exp( au+bu2).
[¯]
If EX = EY = 0, the moment generating function M(t,s) = EetX+sY is given by
M(t,s) = e1/2(s12 t2+2tsrs1s2+s2t2)
Clearly s12 = Var(X), s22 = Var(Y), r = Cov(X,Y).
When r ¹ ±1 the joint density of X,Y exists and is given by
Example 49 If X,Y are jointly normal with correlation coefficient
r ¹ ±1 then the conditional distribution of Y given X is normal.
Problem 37 If X,Y are independent normal N(0,1), find the density of
X2+Y2. (Hint: compute cumulative distribution function, integrating in polar
coordinates.)
Problem 38 For jointly normal X,Y show that E(Y|X) = aX+b is linear.
Problem 39 If X,Y are jointly normal then Y-rXsY/sX and
X are independent.
Problem 40 If X, Y are jointly normal with variances
sX2,sY2 and the correlation coefficient r, then
X = sX(g1 cosq+g2sinq,
Y = sY(g1 sinq+g2cosq, where gj are independent N(0,1) and sin2q = r.
This is a short chapter on asymptotic behavior of sums and averages of independent
observations. Theorem justifies simulations as
the means for computing probabilities and expected values. Theorem
provides error estimates.
Definition 15
Xn® X in probability, if Pr(|Xn-X| > e)® 0 for all e > 0
Example 50
Let Xn be N(0,s = 1/n). Then Xn® o in probability.
Definition 16
Xn® X almost surely, if Pr(Xn® X) = 1.
Example 51
Let U be the uniform U(0,1) r. v. Then 1/n U® 0 almost surely.
Example 52
Let U be the uniform U(0,1) r. v. et
Xn = {
Definition 17
Xn® X in L2 (mean square), if E|Xn-X|2 ® 0 as n® ¥.
Remark 2
If Xn® X in L2, then by Chebyshev's inequality Xn® X in probability.
If Xn® X almost surely, then Xn® X in probability.
Theorem 14
Suppose Xk are such that EXk = m, Var(Xk) = s2, and cov(Xi,Xj) = 0
for all i ¹ j. Then 1/nåj = 1n Xj® m in L2
Proof. To show that 1/nåj = 1n Xj® m in mean square,
compute the variance.
[¯]
The following method can be used to prove strong law of large numbers for Binomial
r. v.
ånPr(|Xn/n-p| > e) < ¥.
In addition to the types of convergence introduced in Section , we also have the convergence in distribution.
Definition 18 Xn converges to X in distribution, if Ef(Xn)® Ef(X)
for all bounded continuous functions f.
Theorem 15
If Xn ® X in distribution, then Pr(Xn Î (a,b))® Pr(X Î (a,b))
for all a < b such that
Pr(X = a) = Pr(X = b) = 0.
Theorem 16
If MXn(u)® MX(u) for all u, then Xn® X in distribution.
Theorem 2.17 states convergence in distribution to Poisson limit.
Here is a proof that uses moment generating functions.
Proof of Theorem 2.17.
MXn(u) = (1+pn(eu-1))n® el(eu-1).
[¯]
We state first normal approximation to binomial.
Theorem 17
If Xn is Binomial Bin(n,p) and 0 < p < 1 is constant, then
[(Xn-np)/( [Önpq])]® Z in distribution to N(0,1) random variable Z.
Proof. For p = 1/2 only.
The moment generating function of
[(Xn-np)/( [Önpq])] = [(2Xn-n)/( Ön)] is
Mn(u) = e-Önu(1/2+1/2e-2u/Ön)n = ([(e-u/Ön+eu/Ön)/ 2])n = coshn (u/Ön)® eu2/2.
(Can you justify the limit?)
[¯]
Limit theorems are illustrated by the program LIMTHS.BAS. This program
graphs empirical proportions [^p]t as the (random) function of t £ n,
makes a histogram, and compares it with the Normal and Poisson histograms. By
trying various lengths of path n, one can see the almost sure Law of Large
Numbers, and for moderate n see the normal approximation.
Problem 41 The graph of [^p]n as the function of n as given by
LIMTHS.BAS suggests a pair of ``curves" between which the averages are
``squeezed". What are
the
equations of these curves?
[ 23 Suppose a poll of size n is to be taken, and the
actual
proportion of the voters supporting an issue in question is
p = 1/2. Determine
the size n such that the observed proportion [^p] = 1/nX
satisfies
Pr([^p] > .8) £ .01.
[ 24 Plot the histogram for a 100 independent Binomial
Bin(n = 100,p = .5) random variables.
Theorem 18 If Xj are i.i.d. with EX = m, Var(X) = s2 then
[(åj = 1n Xj- nm)/( sÖn)]® Z
Proof. For m = 0, s = 1 only.
The moment generating function is
Mn(u) = (M1(u/Ön))n = (1+[u/( Ön)]M¢(0)+[(u2)/ 2n]M¢¢(0)+O([1/( n2)]))n® eu2/2 [¯]
The following program uses this to simulate random curves. The program requires
a graphics card on the PC.
PROGRAM firewalk.bas
One role of limit theorems is to justify the simulations and provide error
estimates. The simulation presented on page pageref uses the Central
Limit Theorem to print out the so called 95% confidence interval.
Central limit theorem is also a basis for a fast simulation of the normal distribution,
see Section .
We begin with the bound for binomial distribution.
Recall that the relative entropy is
H(q|p) = -qlnq/p - (1-q) ln[(1-q)/( 1-p)].
Theorem 19
Suppose Xn is Binomial Bin(n,p). Then for p¢ ³ p
Proof.
Use Chebyshev's inequality (34) and the moment generating function.
Pr(Xn ³ n p¢) = Pr(uXn ³ un p¢) £ e-np¢u EuXn =
( e-p¢u (1-p+peu))n = ( (1-p) e-p¢u+pe(1-p¢)u)n.
Now the calculus question: for what u ³ 0 the function
f(u) = (1-p)e-p¢u+pe(1-p¢)u attains a minimum? The answer is u = 0 if p¢ £ p, or
u = ln([(1-p)/ p][(p¢)/( 1-p¢)]). Therefore the minimal value is
f(umin) = [p/( p¢)]e(1-p¢)u = exp(-p¢lnp¢/p-(1-p¢)ln(1-p¢)/(1-p))
[¯]
Proof. This follows from the inequality
(1+x)ln(1+x) +(1-x)ln(1-x) ³ x2.
[¯]
Suppose Xj are i.i.d. Conditional limit theorems say what
is the conditional distribution of X1, given the value of the empirical average
1/nåj = 1n h(Xj). Such probabilities are difficult to simulate when
1/nåj = 1n h(Xj) differs significantly from Eh(X).
This chapter collects information about how to simulate special
distributions.
Considerations of machine efficiency, in particular convenience of doing
fixed precision arithmetics, make the uniform U(0,1) the fundamental
building block of simulations. We do not consider in much detail how
such sequences are produced - efficient methods are hardware-dependent.
We also assume these are i. i. d., even though the typical random number
generator returns the same pseudo-random sequence for each value of the
seed, and often the sequence is periodic and correlated. This is again a
question of speed and hardware only.
The classical Peano curve actually maps the unit interval onto the unit
square preserving the Lebesgue measure. Thus two independent U(0,1)
are as ``random" as one!
Experiments with discrete outcomes aren't necessarily less random than
continuous models. Expansion into binary fractions connects infinite
tosses of a coin with a single uniform r. v.
Example 53 Let U be uniform U(0,2p). Random variables
Xk = sign (sin(2k U)) are i. i. d. symmetric independent.
Theorem 20 If xj are independent identically distributed
discrete random variables with values {0,1} and Pr(x = 1) = 1/2
then åk = 1¥ [1/( 2k)]xk in uniform U(0,1).
Proof. We show by induction that if U is
independent of {xj} uniform U(0,1) r. v., then
[1/( 2n)]U+åk = 1n [1/( 2k)]xk is uniform for all
n ³ 0. For induction step, notice that in distribution
[1/( 2n)]U+åk = 1n[1/( 2k)]xk @ 1/2x1+1/2U. This reduces the proof to
n = 1 case.
The rest of the proof is essentially the solution of Problem
2.18. Clearly
Pr(1/2x1+1/2U < x) = Pr(x1 = 0)Pr(U/2 < x)+Pr(x1 = 1)Pr(1/2+U/2 < x) Expression Pr(U/2 < x) is 0, 2x, or 1. Expression Pr(1/2+U/2 < x) is 0, 2x-1, or 1. Their average (check carefully ranges of x!) is
{
[¯]
From the mathematical point of view all of the simulations can be based
on a single U(0,1) random variable. In particular, to generate
independently uniform integer numbers19 in the
prescribed range 0 ... N we need to pick any u Î (0,1), and define
Xj = Nju mod N.
Clearly Xj is the integer part of NUj, where Uj solve the
recurrence
Many programs provide access to uniform numbers. These however might be
platform-dependent, and are often of ``low quality". Often a person performing
simulation may want to use the code they have explicit access to. What is ``random
enough" for one application may not be random enough for another.
There is good evidence, both theoretical (see (62)) and empirical, that the simple
multiplicative congruential algorithm
The linear congruential method has the advantage of being very fast. It
has the disadvantage that it is not free of sequential correlations
between successive outputs. This shows up clearly in the fact that the
consecutive k-points lie in at most N1/k subspaces of dimension
k-1.
Many system-supplied random number generators are linear
congruential generators, which generate a sequence of integers n1, n2, ...
each between 0 and N-1 by the recurrence relation
This is rather easy to convert into the computer algorithm.
An exact method to simulate Poisson distribution is based on the fact
that it occurs the Poisson process, and that sojourn times in the Poisson
process are exponential (see Theorem on page ).
Therefore, to simulateX with Poiss(l) distribution, simulate independent
exponential r. v. T1, T2,..., with parameter l = 1 and
put as
the value of X the first value of n such that T1+...+Tn > l.
A reasonable approximation to Poisson Poiss(l) random variable
X is obtained by simulating
binomial Bin(n,p) random variable X¢ with
l = np. Since X¢ £ n, use n large enough to exceed any realistic values
of X. Run program LIMTHMS.BAS to compare the histograms - why is
Poisson distribution more spread out than the binomial?
Problem 42
Given a cumulative distribution function G(x) with inverse H(), and density g(x) = G¢(x),
let U1,U2
be two independent uniform U(0,1) random variables.
Show that X = H(U1), Y = U2g(X) are uniformly distributed in the region {(x,y): 0 < y < g(x)}.
If the conditional density of Y given X = x is f(y|x), and the density of X
is g(x), then the density of Y is òf(y|x)g(x) dx.
Example 54
Suppose Y has density
cye-y = Cån = 0¥y(1-y)n/n!. Let Pr(X = n) = c/n!. Then the conditional distribution is
f(y|X = n) = Cy(1-y)n, which is the distribution of a median from the sample of
size 2n.
The exact simulation of normal distribution uses the fact that
an independent pair X1,X2 of normal N(0,1) random variables can be written as
We simulate two independent normal N(0,1) r. v. from two
independent uniform r. v. U1, U2
by taking Q = 2pU1, R = Ö{-2lnU2} and using
formulas (65)-(66).
The name comes from the technique suggested by Problem 6.3.1. Instead of selecting points under the graph of f(x),
we pick another function g(x) > f(x), which has known antiderivative
with explicitly available inverse. We pick points under the graph of
g(x), and "reject" those that didn't
make it below the graph of f(x).
Often used density g(x) = c/(1+x2) leads to X = H(U1), where H(u) = tan(pu/c).
Rejection sampling can be used to simulate continuous or discrete distributions.
The idea behind using it in discrete case is to convert discrete distribution to
a function of continous random variable. For example, to use rejection sampling for
Poisson distribution, simulate the density
f(x) = e-ll[x]/[x]! and take the integer part [X] of the resulting
random variable.
For subsets of low dimension, a sequence of 0,1 can be identified with binary numbers.
Set operations are then easily converted to binary operations on integers/long integers.
Suppose we want to re-arrange elements a(1),..., a(n) into a random order. A quick method to accomplish this goal
is to pick one element at a time, and set it aside. This can be easily implemented
within the same sequence.
Problem 43
Let a1,... an be numbers such that åj aj = 0, åj aj2 = 1
Let X denote the sum of the first half of those numbers, after a random rearrangement.
Program listing on page pageref explains how to perform
double integration by simulation.
Here we concentrate on refining the procedure for single integrals by
reducing the variance.
To evaluate J = òab f(x) dx we may use the approximation
1/Nåj = 1N f(Xj)/g(Xj), where Xj are i.i.d. with the density
g(x) such
that òabg(x) dx = 1.
The error, as measured by the variance21, is
For smooth functions, a better approximation is obtained by selecting points more uniformly than the pure random
choice. The so called Sobol sequences are based on
sophisticated mathematics. Cookbook prescriptions can be found eg in
Numerical recipes.
Note: The reliability of the Monte Carlo Method, and the associated error bounds
depends on the quality of the random number generator. It is rather unwise to use an unknown random number generator
in questions that require large number of randomizations.
Example 55 The following problem is a challenge to any numerical integration
method due to rapid oscillations,
ò012sin2(1000 x) dx = 1-[1/ 2000]sin2000 » 1.0004367
The idea of stratified sampling is to select different number of points from
different subregions. As a simple example, suppose we want to integrate a smooth function
over the interval [0,1] using n points. Instead of following with the
standard Monte Carlo prescription, we can divide [0,1] into k non-overlapping
segments and
choose nj points from the j-th subinterval Ij.
An extreme case is to take k = n and nj = 1 - this becomes a variant of the
trapezoidal method. The optimal choice of nj is to select them proportional to
the local standard deviation of the usual Monte Carlo estimate of òIj f(x) dx.
Indeed, denoting by Fj the estimator of the integral over Ij, the variance
of the answer is Var(åj = 1k Fj) = åj Var(Fj) = åj sj2/nj. The minimum under the constraint åjnj = n is
nj » sj.
A simple variant of recursive stratified sampling is to generate points and subdivisions
based on estimated values of the variances.
Pr(X1+...+Xn > an) = òy1+...+yn > ane-ly1... e-lynf(y1)... f(yn) dy1... dyn. This leads to the following procedure.
Example 56
Suppose Xj are {0,1}-valued so that X1+... Xn is binomial
Bin(n,p). What is the distribution of Yj? Simplify the expression
e-lY1... e-lYn.
[ 26 Write the program computing by simulation the
probability
that
in
a n = 10 tosses of a fair coin, at least 8 heads occur. Once you
have a program that does for n = 10 a comparable job to the ``naive"
program below, try Exercise 1.5 on page
pageref.
Here is a naive program, that does the job for n = 10, but not for
n = 100. It should be used to test the more complex ``tilted
density" simulator.
PROGRAM heads.bas
Stochastic processes model evolution of systems that either exhibit
inherent randomness, or operate in an unpredictable environment. This unpredictability
may have more than one form, see Section .
Probability provides models for analyzing random or unpredictable outcomes.
The main new ingredient in stochastic processes is the explicit
role of time. A stochastic
process
is described by its position X(t) at time t Î [0,1], t Î [0,¥), or
t Î {0,1,...}.
From the conceptual point of view, stochastic processes that use discrete moments of time
t Î {0,1,...} are the simplest.
Since the discrete moments of time can represent arbitrarily small
time increments, discrete time models are rich enough to model real-world phenomena.
Continuous time versions are convenient mathematical
idealizations.
Mathematical models of time evolution of deterministic systems often
involve differential
equations.
Difference equations are discrete
analogues of the differential equations. Difference equations occur
in applied problems, and also when solving differential equations by series
expansions, or by Euler's method. The concepts of numerical versus
analytical solution, initial values, boundary values, linearity, and superposition of
solutions occur in the discrete setup in
complete analogy with the theory of differential equations.
We will consider only special cases
of
classes of equation
Example 57
Suppose a sequence yn is to satisfy
It is easy to write a short program that computes the values of yn.
But to compute the actual sequence we need to specify the initial
value y0. Table gives sample
outputs of such a program for several choices of y0.
-1
Similar numerical procedures show up in differential equations, where
they are used to approximate continuous solutions by discretizing time. The
procedure is called
Euler method, or tangent line method.
On the other hand, we may notice that equation (68) defines
the arithmetic progression. Instead of a table like Table
7.1, we can write down the solution for all
possible values of y0 and for all n. Namely,
Example 58
Here is another well known difference equation with obvious
solution. Suppose
Examples 7.1.1 and 7.1.1 are deceptively simple.
Not every difference equation has a simple or easy to guess answer.
Many interesting difference equations, including (68), (70), (72) fall into the category of
linear difference equations with constant coefficients.
The first order linear difference equations with constant coefficients has the form
A method to solve difference equation of order 2 consists of the following steps:
Suppose you borrow y0 dollars on fixed monthly interest rate r.
If you do not make any payments on your loan, then your balance will
``balloon" exponentially. Formula yn = (1+r)ny0 expresses monthly
balance after n months when no loan payments are made.
Some people prefer to pay a fixed amount of p dollars at the end
of each month. If p is large enough, then
the balance may even shrink down! This situations is easily described by
a difference equation. To write it down, compute the next month
balance, if the previous month balance is known:
In differential equations, we often solve a similar problem in
continuous time, with
continuous compounding and continuous payments schedule. This is a mathematical
simplification of the actual banking situation, but the answers are reasonably
close, and they are easier to get.
Here is an example of the mathematical problem that leads to a natural,
but non-trivial26
difference equation.
Example 59 Suppose we want to find the formula for the sum of all consecutive
integers from 1 to n. Let the answer by yn. Then the
recurrence formula
Now (74) can be written as yn+1-yn = n+1 and actually this is
why we use the name difference equations rather than recurrence
relations. Notice
that we do know the solution of its continuous analogue. Equation
y¢ = t+1
for unknown function y = y(t) resembles (74). Its solution27 is y(t) = 1/2t2+t.
So to find the answer to (74) we may try substituting
un = 1/2n2+n into the equation. Unfortunately, simple arithmetics shows that we
don't get the right answer, as
The method we used - subtracting the equations - works for
the so called linear difference equations (and also linear differential)
equations. It is closely related to the Principle of
Superposition for linear differential equations.
Before jumping to models that use explicit randomness in their evolution,
it is quite illuminating to analyze first some mathematically simple
and well defined ``deterministic evolutionary processes".
The following set of examples describes deterministic
evolution in discrete time of a system described by a single numerical parameter
x Î (0,1). All of the examples fall into the category of (non-linear)
difference equations: given initial value x0 Î (0,1) and a simple
evolution equation of the form xn+1 = g(xn), we are supposed to make
inferences about the behavior of the solutions.
There is no randomness in the evolution itself. But since we are allowed
to choose any initial condition, and no initial condition can be
measured exactly, we may as well consider the initial value to be random. Example 60
Equation
Example 61 Let {x} denote the fractional part of x. Equation
Here is a rather surprising fact. Suppose x0 is selected at random.
Since xk is a function of x0, and x0 is random, xk
becomes a random variable. It may
happen that xk < 1/2. Call this event Ak. The reasoning presented in
Section 6.1.1 implies that events {Ak} are independent
and have the same probability Pr(Ak) = 1/2. Therefore the deterministic evolution equation contains
at least as much randomness as the tosses of a coin.
The last example may perhaps give the impression that it is the
discontinuity of the evolution equation that is the source of
difficulties. This is not at all the case.
Example 62 Equation
Example 7.3 indicates that ``naive" prediction of the future of a
system
is unreliable even
within the realm of deterministic evolution equations.
On the other hand, there are aspects of the evolution that we can
analyze reliably. These deal with the average behavior of the evolution.
There are two classes of questions that we can answer, but both deal
with statistical nature of the evolution:
The quick way to get insights into operation of real systems is to model
their behavior.
Here are examples of enterprises that operate under randomness.
Mathematicians devised methods of modelling each of these. But it is
interesting also to simulate their behavior. Notice that simulations
require assumptions, but so do the analytical methods. Regardless of the
method, we have to be careful about what are the questions we can
answer.
Example 63
Suppose we want to study how much capital is needed to run a casino.
We need to answer the following.
There is also a number of questions that we do not want to answer.
There is also a number of questions that we do not want to answer.
For instance we do not want to predict whether I'll file an insurance
claim today, driving back home on I-74 without enough sleep since I was
preparing this class until 3AM.
Example 65 A construction company has n jobs to be performed
in the future. For each of these jobs,
experts provide estimate of the cost. Then, after questioning, they
complement the estimates by the lower/upper bound for the costs. How
much money should be allotted?
Example 66
A store averages l(t) customers per hour at time t of the
day. Each customer brings some (random) profit. However, a
customer may just leave the store without shopping, if the lines are too long.
A random walk is a process of the form Xn = åj = 1n xj, where xj
are i. i. d. Random walks have independent increments
, and describe the
accumulation of independent contributions over time.
Random walks are examples of Markov processes which will be studied in
detail in Chapter . Their special structure allows
to analyze them independently of the general
theory.
Definition 19
T:W® INÈ¥ is a stopping time, if the event {T = n} is
independent of xn+1,xn+2,....
This definition is specialized to random walks. In
a more general Markov case the definition is less transparent but captures the same idea.
The most important example of a stopping time is the first entrance time
T = inf{k: Xk Î A}. An example that is not a stopping time is
the last exit from a set.
When T < ¥ we define random sums XT = åj £ Txj. The following theorem is
an exercise when T is independent of x.
Theorem 21 If xj are i. i. d., Ex = m, and
ET < ¥ is a stopping time then
(These formulas are of interest in branching processes, and in
chromatography.)
Example 67 The number of checks cashed at a bank
per day is Poisson random variable N with mean l = 200. The amount
of each check is a random variable with a mean of $30 and a standard
deviation of $5. If the bank has $6860 on hand, is the demand likely
to be met?
Problem 44 Suppose xk, T are independent. Find the variance of XT in terms of the first two moments
of x, T.
The sample is injected into a column, and the molecules are
transported along the length by electric potential, flow of gas, or
liquid. The basis for chromatographic
separation of a sample of molecules is difference in their physical
characteristics. The molecules switch between two phases: mobile, and stationary,
and the separation of compounds is caused by the difference in
times spend in each of the phases.
Suppose that the molecules of a compound spend random independent amounts of
time Uk in mobile phase and random amount of time Wk in the
stationary phase. Thus at time t the position of a molecule is given
by a random sum våj = 1T(t) Uj, where T(t) = inf{k: åj = 1kUj+Wj > t}.
Section 7.5.1 gives formulas for the mean and the
variance of the position. Since the number of transitions T
is likely to be large for a typical molecule, it isn't surprising that the actual position
has (asymptotically) normal distribution. (The actual Central Limit Theorem for random sums
is not stated in these notes.)
[ 27 Simulate the output of the chromatography column of fixed
length separating a pair of substances that have different distributions of
mobile and stationary phases Uk, Wk. Select a hundred particles of
each substance, and measure the degree of separation at the end of the
column.
Suppose a gambler can afford to loose amount L > 0, while the casino has
capital C < 0. Let xj = ±1 be i. i. d. random variables modelling the
outcomes of consecutive games, Sn be the partial sums (representing
gains of the gambler), and let T = inf{k:Sk ³ L or Sk £ C} be
the
total
number of games played. Then Pr(T > k) = Pr(C < Sk < L) and thus
ET = åkPr(C < Sk < L).
The special case Pr(x = ±1) = 1/2 is easily solved, since in this case
EST = ExET = 0. Let p = Pr(ST = C) denote the
probability of ruining the casino. Since ST is either
C, or L
we have 0 = EST = pC+(1-p)L, giving p = L/(L-C). This formula means that
a gambler has a fair chance of ruining a casino in a fair game,
provided he brings with him enough
cash L.
For more general random walks (and less fair games) probability of gambler's ruin
can be found explicitly using the
one-step-analysis (Section ).
It is also interesting to find
how long a game like that would last on average. (The expression for ET given above is not
explicit.)
Let Xt denote the number of bacteria in t-th generation, with
X0 = 1. Assume that a bacteria can die with probability q > 0, or
divide into two cells with probability p = 1-q, and that all deaths
occur independently.
Our goal here is to find the average number of bacteria m(t) = E(Xt) in
the t-th generation. This can be recovered from Theorem 7.5.1. Instead, we
show another method based on conditioning.
The number of bacteria in the next generation is determined by binomial probabilities:
Pr(Xt+1 = 2k|Xt = n) = (nk)pk qn-k. Therefore E(Xt+1|Xt) = 2p Xt and
the average population size m(t) = E(Xt) satisfies difference equation
m(t+1) = 2p m(t). We have m(t) = (2p)t.
In particular, the population on average grows exponentially when p > 1/2.
It is perhaps surprising that a population of bacteria with p = 3/4,
which on average
grows by 50% per generation, has still a 1/3 chance of going extinct.
One way to interpret this is to say that infections by a ``deadly" and
rapidly developing desease may still
have a large survival rate without any intervention of medicine, or immune system.
(The methods to compute such probabilities will be introduced in
Section . The answer above assumes infection by a single cell.)
Problem 45
Find the formula for the variance Var(Xt) of the number of bacteria
in t-th generation.
Problem 46
What is the probability that an infection by 10 identical bacteria with the doubling
probability p = 3/4 dies out?
evolution, n. [L. evolutio (-onis)], an unrolling or opening...
Markov processes are perhaps the simplest model of a random evolution
without long-term memory.
Markov process is a sequence Xt of
random variables indexed by discrete time t Î ZZ\scriptscriptstyle + , or continuous
t ³ 0 that satisfies the so called Markov property. The set of all possible values of variables Xt is
called the state space of the Markov chain.
Typical examples of state spaces are IR, IN, the set of all
non-negative pairs of integers, and finite
sets.
Markov chains are Markov processes with discrete
time. Thus a Markov chain is an infinite sequence
{Xk}k Î ZZ\scriptscriptstyle + of (usually, dependent) random variables with short-term
(one-step) memory.
The formal definition of the Markov property is as follows.
Definition 20 A family of discrete r. v. {Xk}k Î ZZ\scriptscriptstyle + is
a Markov chain, if
Examples of Markov chains are:
Examples of non-Markov processes are easy to construct, but lack of Markov propertry is
not obvious
to verify. In general, if Xk is a Markov process, Yk = f(Xk) may fail
to be Markov.
Markov condition
Example 68 Suppose Xn is a Markov chain with periodic transition
probabilities Pn = Pn+T. Then Yn = (Xn+1,Xn+2,..., Xn+T)
is a homogeneous Markov chain.
Problem 47 Suppose xj are independent
{0,1}-valued with Pr(x = 1) = p. Let Xn = axn+bxn+1, where
ab ¹ 0.
Explain why Xn is a Markov chain.
Write
the transition matrix for the Markov chain Xn.
[ 3 The probabilities Pr(Xn = j|X0 = i) are given by
the i,j-entries of the matrix Pn
Proof.
This is the consequence of Markov property (78) and
the total probability formula (10).
[¯]
Suppose Pr(X0 = k) = pk, where pk solve stationarity equations
Then the resulting process is stationary:
the distribution of each k-tuple
(X(t1),...,X(tk))
is invariant
under shifts in time,
(X(t1+s),...,X(tk+s)) @ (X(t1),...,X(tk)).
If Xt is a stationary Markov process and f is a function on its state space,
then Y(t) = f(Xt) is also stationary, although not necessarily Markov.
If Xj(t) are independent realizations of
the same Markov process and f is a function on their state space then
Yn(t) = n-1/2åj = 1n (f(Xj(t))-m) is stationary and approximately
normal random sequence.
Finite state Markov chain is regular, if there is deterministic number N such that all
states are connected by paths of length at most N.
Problem 48
Show that regular Markov chain is aperiodic and irreducible.
Additive functionals of a Markov process are expressions of the form
1/nåj = 0n-1 f(Xj). Under certain conditions, the averages
converge and the limit doesn't depend on the initial distribution. Under certain conditions,
partial sums are approximately normal.
Problem 49
Let Xk be an irreducible {0,1}-valued Markov chain with invariant
initial distribution.
For regular (finite state space) Markov chains the limit actually exists
independently of the initial state. Therefore the stationarity equations can be used to find
the limit.
Problem 50
Let
P = [
Problem 51
Let
P = [
Problem 52
Let
P = [
Suppose Xk is a Markov chain with transition matrix
P = [
If 0 < a,b < 1
Problem 53
Suppose Xk is a Markov chain with transition matrix
[ 29
Suppose n people on a committee discuss a certain issue. When one person
finishes speaking, we assume that it is equally likely that any of the other
n-1 begins. We also assume
that each person speaks for exponentially distributed random time.
How long does it take on average for all participants to take part in
the discussion?
For each possible configurations of teams, we could determine the
average score, and thus
determine which player distribution is the best, if there is one.
If there is no best single arrangement that wins against all other arrangements, then we should still be able to find the
optimal mixed strategy.
[ 30 For the Markov chain defined above,
select a pair of configurations and determine the following.
The optimal strategy in a game is the best (in the above minimax sense) randomized
strategy against every deterministic strategy of the opponent.
Problem 54 What is the ``optimal" arrangement of players in soccer?
Example 69 In the setup of Section 7.5.3, suppose
Pr(x = 1) = p = 1-Pr(x = -1) . Let T be the first time random walk
reaches L > 0 or C < 0. Find the probability of winning (ruining the
casino) Pr(XT = C|X0 = 0).
Note that evven though we are interested in a single number Pr(XT = C|X0 = 0),
the fist-step analysis requires computing probabilities
p(x) = Pr(XT = C|X0 = x) for all initial positions x.
Example 70 Suppose Pr(x = 1) = p, Pr(x = -1)1-p . Let T be the
first time random walk reaches L > 0 or C < 0. Find the average length of
the game E0(T).
Note that the fist-step analysis requires computing m(x) = Ex(T) for
all initial positions x.
Problem 55 On average, how long does it take for a symmetric random walk
on a square lattice to exit from a d×d square?
Problem 56
For the Markov chain defined in Section 8.2.1,
select a pair of player configurations and determine the following.
Problem 58 One of the quality control rules is to stop
the process when on k = 8 successive
days the process runs above, or below, specification line.
On average, how often a process that does not need any adjustment will
be stopped by this rule?
Suppose that you have an insurance contract for a vehicle. The premium
payment is due yearly in advance and there are four levels
P1 > P2 > P3 > P4. The basic premium is P1, unless no claims were
made in the previous year. If no claims were made in a given year, then
the premium for the following year is upgraded to the next category.
Suppose that the probability of the damage larger than s hundred
dollars is e-s (substitute your favorite density for the
exponential density used in this example). Because of the incentive of
lowered premium, not all damage should be reported. The goal of this
example is to determine numbers s1,s2,s3,s4 above which the claims
should be filed; sj is a cutoff used in a year when premium Pj is
paid.
Let Xt be the type of premium paid in t-th year. Clearly, the
transitions are i® 1 with probability e-si, 1® 2
with probability 1-e-s1, etc.
In the long run, the yearly cost is
[ 31
Find optimal sj when
P1 = 800, P2 = 720, P3 = 650, P4 = 600 (about 10% discount).
Each player has
to toss the dice at least once per turn. The player can choose to
toss the dice as many times as (s)he
wishes as long as no ace occurs. However the current total
is added to player's score
only when the player ends his turn voluntarily.
If an ace occurs, then the player's turn is over, and his score is not
increased. If two aces occur then the score is set to zero.
A player chooses the following strategy:
toss the dice for as long as
an ace occurs, or the sum of current subtotal+score exceeds number K.
(If her score exceeds
K she tosses the dice once, as this is required by the rules.) The
quantity to optimize is the average
number of turns that takes the player to reach the score of a hundred. The number of turns
under this strategy is the number of aces when the score is less than K, plus the number
of throws when the score is larger than K.
If Xk denotes players score after the k-th throw, then
clearly Xk is a Markov chain with the finite number of states,
and with chance [1/ 36] of returning to zero at each
turn.
Which value of K minimizes the average number of turns that takes the
player to reach a hundred?
A simple-minded computation would just take into account the average
gain and disregard the score completely. If the players subtotal is t
then after
the additional throw the total is on average 23/36×(t+7). This
is less then t for t ³ 12, thus there is no point
continuing beyond 12. Is this conclusion correct? Should the
player choose to stop once the total on the dice exceeds 12?
[ 32
What is the average number of
turns
it takes to reach a 100 under this strategy?
A less simple-minded computation would take into account the average
gain: If the players score is s then after the additional throw his
score on average is 1/36 ×0+ 1/3 ×s + 23/36×(s+7)
which is more then s for s < 7×23. Is this conclusion correct?
Should the player always choose to toss again? [ 33
What is the average number of
turns
it takes to reach a 100 under this strategy?
[ 34
What is the average number of
turns
it takes to reach a 100 under the cautious strategy of no additional tosses?
Another computation would take into account the current score s and
current subtotal t. After the additional throw the numbers on average
are s1 = 1/36 ×0+ 35/36 ×s, t1 = 23/36×(t+7). On average,
t1+s1 > t+s when t < 12-s/13.
More complicated strategies could depend on current totals of other
players, current score of the player, and the subtotal in some more
complicated fashion. For instance, if s denotes the score, t denotes
current subtotal, one could stop using two-parameter criterion t > A-Bs.
This may have seemingly different effects than the previous strategy:
when A is large, and score s is close to 0 there is no
need for stopping; but if accumulated score
s is large, the player may behave more cautiously.
One could optimize the probability of winning against k = 3,4
players instead of just minimizing the average number of tosses. The latter puts this
problem into game theory framework. Simulations seem to indicate that
the strategy based on t < 25-1/9s works well against inexperienced human players.
Theorem 23 State x is recurrent iff åPx,x(t) = ¥.
Proof. Let M be the number of times Xt returns to state x.
Let f be the probability of returning to x. State x is recurrent
iff f = 1. Suppose f < 1 and let M be the number of returns. Clearly
Prx(M ³ k) = fk,
thus Ex(M) = f/(1-f) < ¥. Since M = åt I{Xt = x},
Ex(M) = åPx,x(t).
[¯]
Theorem 24
Let T be a return time, and suppose m(x) = ExT < ¥. Then
Px,x(t)® 1/m(x) as t®¥.
Problem 59 Suppose Markov chain Xt moves to the right
k®k+1 with probability 1/k or returns to 1 with probability
1-1/k, k = 1,2,.... Find its stationary distribution, and the
average return time to state k.
This sections describes a more refined method for randomized search of
minima of functions.
Suppose a finite set V is given, and we are to minimize a function U:V®IR.
Theorem 25
The invariant measure of transition matrix (83) is pq(u) = 1/Z eqU(u)
Proof.
To check the invariance condition, denote \cal N(u) = {v: (u,v) Î E}.
[¯]
{
PROGRAM fasttour.bas
In the example below we limit our attention to the one-dimensional
problem. This simplifies the notation, while the basic ideas are the
same.
Let u(x,t) be an unknown function of x Î IRd, t ³ 0.
The difference equation we may want to solve is the following discrete analog of the diffusion
equation.
The solution is u(x,t) = E(u0(St)), where St = åj £ tej
is the sum of independent random variables with 2d equally likely
values ±ek Î IRd.
Suppose xk is a stationary Markov chain and let Xn be the solution of
the difference equation Xn+1-a Xn = xn+1. One can write the transition Matrix
for Markov process Xt and try find the stationary distribution for X0.
A more direct method is based on the fact that the solution of the
difference equation is
Xt = atX0+at-1x1+...+axt-1+xt. Therefore if |a| < 1, the stationary
initial distribution is X0 = åak xk. Thus Xt = åk = 0¥ ak xt-k
Problem 60
Suppose xk are i. i. d. Find the covariance EX0Xk.
Problem 61
Suppose xk are i. i. d. Find E(Xk|X0).
Solutions of higher order difference equations can be easily out into
the Markov framework, too. If Xn+2+aXn+1+bXn = xn+1 then
Yn = (Xn+1,Xn) is Markov and satisfies the corresponding equation
in matrix form: Yn+1 = AYn+Xn+1. Therefore the stationary
solution exist provided that the eigenvalues of A satisfy inequality
|lj| < 1.
To sort efficiently a set S
of numbers into ascending order we should find a number y such that about half of
the elements of S is below y.
Then the total number of steps required is T(n) £ T(/n/2)+n+C(n), where
C(n) is the number of comparisons required to find y.
Random Quick Sort is based on an idea that a random
choice of y is good enough on average.
Theorem 26
The expected number of comparisons in random quick sort is at most 2nln(n+1).
Here is one possible realization of the subprogram:
PROGRAM qsrt.bas
Theorem 27
Suppose g(x) is increasing (non-decreasing) function and Xt is a Markov chain
on IN that moves left only and E(Xt+1|Xt = m) £ m+g(m).
Let T be the time of reaching 1. Then En(T) £ ò1n [1/ g(x)] dx.
Proof. By induction
En(T) £ 1+Eò1X [dx/ g(x)] = 1+ò1n[dx/ g(x)]-EòXn[dx/ g(x)] £ 1+ò1n[dx/ g(x)]-EòXn[dx/ g(n)] £ 1+ò1n [dx/ g(x)]+E[(X-n)/ g(n)]
[¯]
Problem 62
Estimate the average running time of the above algorithm. (ANS: ET £ 4lnn).
Suppose certain objects multiply independently and in discrete time
intervals. Each object at the end of every period produces a random
number x of descendants (offspring) with the probability
distribution pk = Pr(x = k). Let Xt be the total number of
objects at t-th generation.
Then in distribution Xt+1 = åj = 1Xtxj. The three
questions of interest are the average size of the population, its
variance, and the probability of extinction.
Definition 21 By extinction we mean
the event that the random sequence {Xt} consists of zeros for all
but the finite number of values of t Î IN.
Probability of extinction by time n can be found directly from the
first-step-analysis: numbers un = Pr(Xn = 0) satisfy
Let mn = E(Xn), Vn = Var(Xn). By Theorem 7.5.1 and induction we
have
Theorem 28 [Watson(1874)] The generating function Hn(z) of Xn
is the n-fold composition (iteration) of g.
Proof. E(zXn+1|Xn = k) = (g(z))k. Thus by total probability
formula (46)
Hn+1(z) = EzXn+1 = åk(g(z))kPr(Xn = k) = Hn(g(z)).
[¯]
Problem 63 Prove (87) using moment generating function
directly.
Notice that if there is a limit of q = limn®¥g°(n)(0), then it
has to satisfy the equation
Since Xt is integer valued and the events An = {Xt = 0} are
decreasing, by continuity (1) the probability of extinction
q = limn®¥Pr(Xn = 0) exists.
Theorem 29 If EX1 £ 1, the extinction probability q = 1. If
EX1 > 1, the extinction probability is the unique nonnegative solution
less than 1 of the equation (89).
Proof.
This is best understood by graphing g(s) and marking the iterates on the diagonal.
Check by induction that gn(0) < 1 for all n Î IN. If there is a
solution s0 < 1 of g(s) = s, then it is the ``attractive point" of the
iteration.
[¯]
The generating function is g(z) = q+(1-q) z2. Asymptotic
probability of extinction solves quadratic equation
(1-q)z2-z+q = 0. The roots are z1 = 1 and z2 = [(q)/( 1-q)].
In particular, probability of extinction is 1 when
q ³ 1/2.
When q = 1/4 probabilities of extinction at n-th generation are
u0 = 0,u1 = .25, u2 = .296875, u3 = .3161, u4 = .3249, u5 = .33127, u6 = .3323.
The generating function is
g(z) = (1-rq) +q(1-r) r[z/( 1-rz)]. The most interesting feature of this moment generating
function is that it can be readily composed.
[ 2 The composition of fractional linear functions
f(z) = [(a+bz)/( c + dz)] and g(z) = [(A+Bz)/( C + Dz)] is a fractional
linear function (with the coefficients given by the matrix
multiplication!).
Asymptotic probability of extinction has to solve quadratic equation
1-rq+q(1-r) r[z/( 1-rz)] = z. The roots are
z1 = 1 and z2 = [(q)/( 1-q)]. In particular, probability of
extinction is 1 when q ³ 1/2.
Since the iterates of the generating function can actually be written
down explicitly, in geometric case Pr(Xn = 0) = 1-mn(1-pe)/(mn-pe)
is explicit. Here pe = z2, m = ....
Problem 64 Suppose that in a branching process the number of
offspring of the initial seedling has a distribution with generating
function F(z). Each member of the next generation has the number of
offspring whose distribution has generating function G(z). The
distributions alternate between generations.
Problem 65
In a simple model of linear immunological response, the doubling probability
p of the population of bacteria changes with time due to the increased number of
lymphocytes.
If there are X(t) bacteria at t-th generation, then assume
p = a/(t+a). Find the probability ut of extinction by t-th generation
for
infection by 10 bacteria. What is the average length of the disease?
Univariate normal distribution, standardization,
and its moment generating function were introduced in Chapter
4. Below we define multivariate normal distribution.
We follow the usual linear algebra notation. Vectors are denoted
by small bold letters x, v, t, matrices by capital bold
initial letters
A, B, C
and vector-valued random variables by capital boldface X, Y, Z;
by the dot we denote the usual dot product in IRd, ie.
x·y: = åj = 1d xjyj;
||x|| = (x·x)1/2 denotes the usual Euclidean norm.
For typographical convenience we sometimes write (a1,¼,ak) for the
vector
[
Below we shall also consider another scalar product
á·,·ñ associated with the normal
distribution; the corresponding semi-norm will be denoted by the
triple bar |||·|||.
Definition 22 An IRd-valued random variable
Z is multivariate normal, or Gaussian (we shall use
both terms interchangeably; the second term will be preferred
in abstract situations) if for every t Î IRd the real valued
random variable t·Z is normal.
Example 71
Let x1,x2,¼ be i. i. d. N(0,1).
Then X = (x1,x2,¼,xd) is multivariate normal.
Example 72
Let x be N(0,1).
Then X = (x,x,¼,x) is multivariate normal.
Example 73
Let x1,x2,¼ be i. i. d. N(0,1).
Then
X = (X1, X2, ¼, XT), where
Xk = åj = 1kxj
are partial sums, is multivariate normal.
Clearly the distribution of univariate t·Z is determined
uniquely by its mean m = mt and its standard deviation
s = st. It is easy to see that mt = t·m,
where m = EZ. Indeed, by linearity of the expected value
mt = Et·Z = t·EZ. Evaluating the moment generating
function M(s) of the real-valued random variable t·Z at
s = 1 we see that the moment generating function of Z can be
written as
C = [
CY = [
|||[
Denoting r = sin2q, it is easy to check that
This implies that the joint density of Y1 and Y2 is given by
Another representation
The latter
representation makes the following Theorem obvious in the bivariate case.
Theorem 30 Let X,Y be jointly normal.
(i) If X,Y are uncorrelated,
then they are independent.
(ii) E(Y|X) = m+AX is linear
(iii) Y-AX and X are independent.
Clearly, m = 0. Equation (91) expresses X as a linear transformation
X = Ag of the i. i. d. standard normal vector with
Expression 1/2 ||Dx||2 can be interpreted as proportional to the
kinetic energy of the motion described by the path x; assigning probabilities
by Ce-Energy/(kT) is a well known practice in statistical physics. In
continuous time, the derivative plays analogous role.
To simulate any d-dimensional normal distribution we need only to simulate d
independent N(0,1) random variables and use linear representations like
in Theorem . For such simulation the covariance matrix needs
to be inverted and diagonalized, a numerical nuisance in itself. When the
multivariate normal distribution occurs as the so-called time series, a method
based on Fourier expansion is then convenient, see Section , or the introductory examples in
Chapter
.
The linear algebra results imply that the moment generating function
corresponding to a normal distribution on IRd can be written in
the form
Theorem 31
The moment generating function corresponding to a normal random variable
Z = (Z1, ¼, Zd) is given by (93),
where m = EZ and C = [ci, j], where ci, j = Cov(Zi, Zj),
is the covariance matrix.
From (93) and linear algebra we get also
We have the following multivariate
generalization of (55).
Theorem 32
Each d-dimensional normal random variable Z has the
same distribution as m+Ag,
where m Î IRd is
deterministic, A is a (symmetric) d×d matrix
and g = (g1, ¼, gd) is a random vector
such that the components g1, ¼, gd are
independent N(0, 1) random variables.
Proof. Clearly,
Eexp( t·(m+Ag)) = exp(t·m) Eexp( t·(Ag)). Since the
moment generating function of g is
Eexp( x·g) = exp1/2 || x|| 2 and
t·(Ag) = (ATt)·g, therefore we get
Eexp( t·(m+Ag)) = expt·mexp+1/2 || ATt|| 2, which is another form of
(94).
[¯]
Theorem 33 If Z is centered normal with the nonsingular
covariance matrix C, then the density of Z is given by
In the nonsingular case the density expression
implies strong integrability.
Theorem 34 If Z is normal, then there
is e > 0 such that
Remark 3 Theorem 10.4.1 holds true also in the singular
case and for Gaussian random variables with values in infinite dimensional
spaces.
Theorem 35
If (X, Y) has jointly normal distribution on IRd1+d2,
then
Vector a = mX-QmY and matrix Q are
determined by the expected values mX, mY and by the
(joint) covariance matrix C (uniquely if the covariance CY
of Y is non-singular). To find Q, multiply (95)
(as a column vector) from the right by (Y-EY)T and take the
expected value. By Theorem (i) we get
Q = R×CY-1, where we have written C as the (suitable) block
matrix
C = [
Problem 66
For the random walk from Section 10.2.1,
what is E(Xk|X1,...,Xk-1,Xk+1,...,Xn)?
Problem 67
Suppose X1,...,Xd are jointly normal, EXj = 0, EXj2 = 1 and all covariances
EXiXj = r are the same for i ¹ j.
Find E(X1|X2,X3,...,Xd). (Notice that in this example r > -1/d.)
Suppose (Xt)t = 0,1,... is a Markov chain with multivariate normal
distribution. That is, suppose X0 is normal, and the transition probabilities
are normal, too.
Without loss of generality we assume EXt = 0, EXt2 = 1 and let
EX0X1 = r. Then E(Xt+1|Xt) = rXt and therefore
EX0Xt = rt. This means that the covariance matrix of the Markov Gaussian process
depends on one parameter r only. Comparing the answer with Section
8.7 we find out that all homogeneous Markov Gaussian processes
have the form
Xt = åk = 0¥ gk+trk, where gk are
independent normal r. v.
Poisson distribution occurs as an approximation to binomial. Another
reason for its occurrence is related to exponential random variables and
counting customers in queues. The latter is perhaps the most efficient
way of simulating Poisson random variables, see Section 6.2.4. A
related reason for the frequent use of Poisson distribution in modeling
is the law of rare events.
Definition 23 A Poisson process of intensity l > 0 is an
integer-valued stochastic process {Nt:} with the following
properties.
Theorem 36 For t > 0,s > 0 random variable
N(t+s)-N(t) is independent of Nt and has Poisson distribution
Poiss(sl).
Proof. We will prove only that Nt has the Poisson
distribution. To simplify notation assume that exponential r.`v. have
parameter l = 1. We prove the formula
Pr(Nt = k) = [(tk)/ k!]e-t by induction.
k = 0: Pr(Nt = 0) = Pr(T > t) = òt¥ e-x dx = e-t
Suppose the formula is true for k. Then
Problem 68 Assume a device fails when a cumulative effect of k
shocks occurs. If the shocks occur according to the Poisson process with
parameter l, what is the density function for the life T of the
device?
Problem 69 Let f(x,t) = Ef(x+Nt), where Nt is the Poisson
process of intensity l = 1. Show that
[(¶f)/(¶t)] = f(x+1)-f(x).
In particular, pt(k) = Pr(Nt = k) satisfies [(¶pt(k))/(¶t)] = pt(k+1)-pt(k).
Problem 70 Customers arrive at a facility at random
according to a Poisson process of rate l. The customers are
dispatched (processed) in groups at deterministic times
T, 2T,3T,....
There is a waiting time cost c per customer per unit of time, and a
dispatch cost K.
Problem 71 Find the mean ENt, variance Var(Nt) and the
covariance cov(Nt,Ns).
Problem 72 Let
X(t) = (-1)Nt. Find the mean EXt, variance Var(Xt) and the
covariance cov(Xt,Xs).
The rate l in the Poisson process has probabilistic interpretation:
Theorem 37
Nt = N((0,t]) is the Poisson process of intensity l.
Example 74
A careless programmer assigns memory locations to the variables in his
program30
at random . Suppose that there are M® ¥ locations and
N = lM variables. Let Xi be the number of variables assigned to each
location. If each location is equally likely to be chosen, show that
Example 75
While testing a program, the number of bugs discoverd in the program follows the Poisson
process with intensity l = 5 errors per hour.
Tester's fiance enters the test area and agrees to wait for the tester
to find just one more bug.
How long will she wait on average: 12 minutes, or 6 minutes?
The Poisson process Nt counts the number of events. If each event
results in a random (and independent) outcome xj, then the total is
the compound Poisson process Z(t) = åj = 1Nt xj.
The moments and also the moment generating function of Z(t) can be
determined through conditioning. Section 7.5.1 implies
that if Ex = m, Var(x) = s2 then E(Z(t)) = lmt,
Var(Z(t)) = l(s2+m2)t.
If the damage accumulates additively, then Z(t) represents the total
damage sustained up to time t. Suppose that the system continues to
operate until this total damage is less than some critical value c and
fails otherwise. Then the (random) failure time T satisfies T > t if
and only if Z(t) < c. Therefore
Pr(T > t) = ån = 0¥Pr(åk = 1nxk £ z|Nt = n)(lt)n e-lt/n!. Thus
ET = [1/( l)]ån = 0¥ Pr(åj = 1nxj £ c). In
particular if xk are exponential ET = [(1+cm)/( l)].
Given a discrete-time Markov chain, there are many ways of ``running it"
in continuous time. One particular method is to make the moves at random
moments of time. If these instances are exponential, then the resulting
continuous-time process is Markov, too. This is the so-called
embedded Markov chain.
Non-pathological continuous time Markov processes with countable state
space have embedded Markov chain
representation. In such representation we run a continuous time clock
based on the independent exponential random variables. Once the time
comes, we select the next position according to the transition
probabilities of a discrete-time Markov chain.
The theory of continuous time Markov chains (that is - processes with
countable state space) is similar to discrete time theory. The linear
first-order difference equations for probabilities are replaced by the
systems of first-order differential equations.
Example 76 Suppose Xn is a two-state Markov chain with the following
transitions:
0® 1 with probability a, 1® 0 with
probability b. From Section 8.1 we know that
P(Xk = 1)® a/(a+b) as k®¥.
Let Tk be i. i. d. exponential r. v. and let Y(t) = Xk when
T1+...+Tk < t < T1+...+Tk+1.
Function p(t) = Pr(Y(t) = 1) satisfies differential equation:
p¢(t) = -lp(t)+lb p(t)+ la (1-p(t). Indeed,
Pr(Y(t+h) = 1) » Pr(Y(t) = 1, T > h)+Pr(Y(t) = 0,T < h).
Therefore p(t) = a/(a+b)+b/(a+b) exp(-l(1-b+a) t).
Here is a slightly different method to run the continuous time
finite-valued Markov chain.
Pick the initial value according to prescribed distribution. Then select
an exponential random variable for each of the possible transitions.
Each of these can have different parameter lk. Then select the
smallest, T = min Tj. It can be shown that T is exponential with
parameter ålj and that
Pr(T = Tj) = [(lj)/( åklk)].
Let pk(t) = Pr(X(t) = k). Then
pk¢(t) = -(a+b)kpk+(k+1)bpk+1+a(k-1)pk-1. Population average
m(t) = åk k pk(t) satisfies m¢(t) = (a-b)m(t), thus
m(t) = e(a-b)t.
Problem 73 Suppose Xt is a Markov process whose birth rate is
an+a and death rate is bn with b > a. This describes a population
of species that die out in a given habitat, but have a constant rate of
``invasion". One would expect that such competing effects will result
in some sort of equilibrium. Find the average population size as
t®¥.
[ 36 Customers arrive at a burger outlet at a rate l, and after
exponential service time with mean 1/m leave.
Continuous-time Markov processes with uncountable state space require more
advanced mathematical tools. Only Gaussian case can be briefly mentioned here.
Definition 24 A stochastic process {Xt}0 £ t < ¥
is Gaussian, if the n-dimensional r. v.
(Xt1,¼, Xtn) has multivariate normal distribution for all n ³ 1
and all t1,¼,tn Î [0,¥).
The simplest way to define the Wiener process is to list its properties as follows.
Definition 25 The Wiener process {Wt} is a Gaussian
process with continuous trajectories such that
A stochastic process {Xt}t Î [0,1] has continuous trajectories
if it is defined by a C[0,1]-valued random vector, or if all of its
paths are continuous31. For
infinite time interval t Î [0,¥), a stochastic process has
continuous trajectories if its restriction to t Î [0,N] has
continuous trajectories for all N Î IN.
Conditions (97)-(99) imply that the Wiener
process has independent increments, ie.
W0, Wt-W0, Wt+s-Wt, ¼ are independent.
Series expansions for the Wiener process are available in the
literature. One way to obtain these is from Fourier expansion for
the covariance function.
Problem 74 Let u(x,t) = Ef(Wt+x), where f is a smooth function. Show that
u satisfies the parabolic equation
[(¶u)/( ¶t)] = 1/2 [(¶2 u)/( ¶x2)].
Problem 75
What partial differential equation is solved by u(x,t) = Ef(aWt+x+bt) when a,b are non-zero constants?
Scaled Wiener process is a good model of diffusion. Use the two-dimensional
Wiener process to model a
two-dimensional
diffusion. The two-dimensional Wiener
process is obtained from two independent one-dimensional
components.
The
diffusion coefficient a2 has units
[length2/time] and is implemented by scaling the Wiener process: aWt has
diffusion coefficient a2.
[ 38 An eye-irritant pollutant is emitted from a factory
chimney located at x = 0, y = 10 on the xy plane, and the wind blows
left-to-right with velocity v(y) = y at height y. At a distance
L = 100 down-wind, there is a residential building of height 15. Which
floor of the building is polluted the most? Is there a significant
difference in pollution level between the floors? (Assume that the
distances are in units such that the diffusion coefficient is 1.)
This chapter contains additional topics on discrete time processes.
We begin with several examples of possible simulations. The resulting random
curves are very different, but they have the same ``mean-square" behavior.
Example 78 Suppose we wish to pick a curve at
random. In other words, we need a random function X(t) of integer
parameter t. Here is one way to do it: take
X(t) = åkgkakcos(kt), where gk are i. i. d. normal N(0,1). To ensure
the series is convergent we need åk ak2 < ¥; thus we can
assume ak = ò02p g(q)cos(kq) dq. The theory
of Fourier series tells us that if òg2(q) < ¥, then the
coefficients ak are square-summable.
Example 79 Suppose we wish to pick a curve at random. Here is one
way to do it: take X(t) = åcj cos(2pt+Fj), where Fj
are independent and uniformly distributed on the interval (0,2p).
Again select ck = ò02p g(q)cos(kq) dq.
[ 39
Write simulations of the curves as described. Pick as g
a trigonometric polynomial, say
g(q) = 1+cos(q)cos(2q).
Stationary processes are those that their probabilistic characteristics
(distributions, conditional distributions, moments, covariances) do not
change with time.
Example 80 Suppose Xn is a Markov chain, and
Pr(X0 = j) = p(j), where pj is its invariant distribution. Then
(X0,Xk) @ (Xn,Xn+k).
For second-order stationary processes only means, variances, and
covariances do not change with time. That is,
m(t) = EX(t) = const,cov(X(t), X(s)) = K(t-s). The second-order theory of processes is a very coarse
theory. Nevertheless, it does solve the best linear
prediction problem.
[ 4
In each of the introductory Examples, E X(t) = 0 and
cov(X(t),X(s)) = K(t-s), where
K(t) = ò02p cos(qt)f(q) dq.
The covariance K(t) of the weakly stationary process is a
positive definite function.
That is, åci cj K(ti-tj) = E|åj cjX(tj)|2 ³ 0 for
all cj Î \sf CC, tj Î IR. In addition, K(t) = K(-t).
Theorem 38
Given a positive definite even function K:ZZ® IR, there
is a (0,2p)-valued random variable Q such that for all integer t
The following theorem follows immediately from the proof of Theorem 5.2.
The assumption holds true in
particular when Xn has spectral density.
Theorem 39 Suppose Xn is weakly stationary. If cov(X0,Xn)® 0 as
n® ¥ then 1/n(X1+...+Xn)® EX in L2-norm.
Proof. Compute the variance, and use the Cesaro summability result (Theorem
).
[¯]
The covariance argument can be rewritten in spectral notation. Suppose
EX0Xk = (2p)-1ò-pp eiks f(s) ds. Then
E(X1+...+Xn)2 = (2p)-1ò-pp |åk = 1neiks|2 f(s) ds,
so
Var(1/n(X1+...+Xn)) = (2p)-1[1/( n2)]ò-pp [(sin2 (1/2ns))/( sin2 (1/2s))] f(s) ds® 0 as n®¥.
In the linear prediction problem we deal with linear estimators
a0+åjajXj. The quadratic error involves variances, covariances and averages only.
Thus it is appropriate to handle this in through the second order processes, and
the solution should depend on the density of Q - the so called spectral
density only.
The general Hilbert space theory tells us that the best linear prediction of
Xt+1 based on the past is åj = 0t ajXj with coefficients aj
such that EXj(Xt+1- åj = 0t ajXj) = 0 for all 0leq j £ t. This is a system of t+1 linear
equations for t+1 unknown coefficients.
Gramm-Schmidt orthogonalization allows to replace Xj by orthonormal xj.
Optimal prediction uses åj = 0t aj xj with aj = EX0xt-j.
Suppose xk are i. i. d. A stochastic process Xt is an autoregressive
process, if it satisfies a linear difference equation
Example 82
A moving average Xt+1 = 1/dåj = 1dxt-j is an autoregressive
process. What is the corresponding difference equation (101)?
For autoregressive process the optimal one-step prediction of Xt+1 is
More general autoregressive sequences are defined as solutions of
These generalize simultaneously autoregressive and moving average processes.
First we describe the population at a single instance. We consider the
model in which each hereditary character is carried by a pair of genes.
For simplicity, we assume only one (pair) of a gene, and only one hereditary
character corresponding to one locus. There are several possible
alleles (categories) in a locus - we assume there are only two, denoted by
a,A, of which A is dominant. Each individual thus is described by
one of the pairs AA, Aa, aa, the so called phenotype
. However, a is obstructed from the view
by A, thus for an outside observer individuals AA and Aa are
indistinguishable. A statistical study of such a population can
only yield the proportion PA of individuals with "A-feature", and not
the actual probabilities of the three possible phenotype.
Now we turn to the modelling of the generation change. Under simple assumptions
we shall be able to find out what are the frequencies of phenotype and
genotypes. Usually this information is not directly available. We assume that the
next generation occurs by random mating. Let
pAA(0),pAa(0),paa(0)
be the actual (and as yet unknown) probabilities of the phenotype.
Under random mating with independent selection of parents, the probability that an offspring
has phenotype AA is pAA(1) = (pAA(0)+1/2pAa(0))2.
Denoting by pA(t) = pAA(t)+1/2pAa(t)
we get the Hardy-Weinberg equilibrium: after one generation the proportions
pA(t) of genotypes stabilize and the phenotype frequencies
become
This determines the actual proportions of the phenotype from the proportion of
observed A-carriers and a-carriers.
Problem 76
Show that PAa = 2(Ö{Paa}-Paa).
The next question is to study the effects of the selection, where, say phenotype
aa has different chance of survival. This leads to total probability formula and
Markov chains.
Theorem 40
If A,B,C are n×n matrices such that AB ¹ C then Pr(ABX = CX) £ 1/2.
Proof. For a non-zero v we have Pr( |v·X| > 0) < 1/2
[¯]
The following beautiful theorem due to B. de Finetti points
out the role of exchangeability as a substitute for independence.
Theorem 41
Suppose that X1, X2, ¼ is an infinite exchangeable sequence.
Then there exist a s-field \cal N such that X1, X2, ¼ are
\cal N-conditionally i. i. d., that is
We will use the following (weak) version of the martingale32
convergence
theorem.
Theorem 42 Suppose \cal Fn is a decreasing
family of s-fields, ie.
\cal Fn+1 Ì \cal Fn for all n ³ 1. If X is integrable,
then E{X|\cal Fn}® E{X|\cal F} in L1-norm,
where \cal F is the intersection of all \cal Fn.
Proof. Suppose first that X is square integrable.
Subtracting m = EX if necessary, we can reduce the convergence
question to the centered case EX = 0. Denote Xn = E{X|\cal Fn}.
Since \cal Fn+1 Ì \cal Fn, by Jensen's inequality
EXn2 ³ 0 is a decreasing non-negative sequence.
In particular, EXn2 converges.
Let m < n be fixed. Then E(Xn-Xm)2 = EXn2+EXm2-2EXnXm.
Since \cal Fn Ì \cal Fm, by Theorem we have
If X is not square integrable, then for every e > 0 there is a square
integrable Y such that E|X-Y| < e. By Jensen's inequality
E{X|\calFn} and E{Y|\cal Fn} differ by at most e in L1-norm; this
holds uniformly in n. Since by the first part of the proof
E{Y|\calFn} is convergent, it satisfies the Cauchy condition in L2 and hence in
L1. Therefore for each e > 0 we can find N such that for all n, m > N we
have E{|E{X|\cal Fn}-E{X|\cal Fm}|} < 3e. This shows that
E{X|\cal Fn} satisfies the Cauchy condition and hence converges in
L1.
The fact that the limit is
X¥ = E{X|\cal F} can be seen as follows.
Clearly X¥ is \cal Fn-measurable
for all n, ie. it is \cal F-measurable.
For A Î \cal F (hence also in \cal Fn),
we have EXIA = EXnIA. Since
|EXnIA-EX¥ IA| £ E|Xn-X¥ |IA £ E|Xn-X¥ |® 0,
therefore EXnIA® EX¥ IA.
This shows that EXIA = EX¥ IA and by definition,
X¥ = E{X|\cal F}.
[¯]
A string is a sequence of letters, or symbols from the finite alphabet. For
the purpose of computer modelling, we can assume that a string is a sequence of
natural numbers {1,... d}, parameter d being the size of the alphabet.
Three simple examples of strings are the words, sentences in, say, English, and
DNA molecules. Here d = 26 (for lower-case words), d = 94 for sentences, and
d = 6 for the DNA (there are only four proteins, but extra symbols are used to
mark various undecided cases).
The question of comparing two strings for similarities arises in molecular biology
and in designing a spell-checker, or a speech recognition system. Accordingly, one
would like to say which strings are similar, and how likely it is that they are
similar due to chance only. Additional complications arise from the fact that two strings
compared do not necessarily have the same length.
A simple way to compare two strings is to measure the number of symbols that don't
match (the hamming distance). For instance abbacd and babacd would
then have distance 2. But abbacd and bbacda would have distance
6, even though they differ just by one transposition.
A less obvious way to compare two strings is to measure the edit
distance: how many editing operations are needed to
transform one of the strings into another. Usually the editing operations are:
These are suitable
for spell-checkers, where it is known that about 80% of typing errors are of the
above form, thus most of mistyped words have edit-distance 1 from the original.
Accordingly, the edit distance is set to 0, if the words are identical, 1 if they differ
by a single error of one of the listed types, 2 if there were two such errors, etc.
Formally, it is defined as the smallest number of elementary ``TE" transformations
required to transform one of the words into another.
The method of computation is based on recurrence.
It is easy to see that a number
of transformations between and empty word and another one is exactly
the length33 of the word.
If the two words U, W are formed from shorter ones U0,W0 by adding letters
at the end,
then the distance Dist(U,W) is the smallest of the numbers
Here is a complete VB-listing:
The main problem with using the edit distance to analyze DNA sequences is
processing time. (Speed-ups are known.)
[ 40
Write a program computing the edit distance between strings, and
another one,
which does the editing by simulation. (Try the randomization based on
random
number of edit operations from the currently best candidate)
A cell in its growth goes through several phases, which have different probabilistic
characteristics. In a simple model, the cell doubles after a random time, which is
the sum of the exponential and deterministic portion. The average of the exponential phase can
be assumed to depend on the external circumstances.
For a fixed vector y we have thus the joint probability mass
function f(x) = Pr(X = x). The average information H(X)
contained in X is defined
as
Problem 77
Prove Gibbs' inequality åj pjlogpj ³ åpj logqj for all qj > 0, åqj = 1.
Notice that H(X) ³ 0 and H(X) £ logk
Problem 78
Suppose you have 15 identical in appearance coins, except that one of them has a
different weight. Find the optimal34 weighting strategy to
identify the odd coin by using a scale.
Modeling of the spread of disease is complicated due to multiple factors
that influence its development. The birth-and-death process does not
seem to be a good model for the spread of an epidemic in a finite
population, since when a large proportion of the population has been
infected, we cannot suppose that the rate of infections is independent
of past history.
The Lp norm is
Notice that ||X-EX||2 is just the standard deviation.
We say that Xn converges to X in Lp, if ||Xn-X||p® 0 as
n® ¥. If Xn converges to X in L2, we shall also use the phrase
sequence Xn converges to X in mean-square.
An example of the latter is Theorem 5.2.
Several useful inequalities are collected in the following.
Theorem 43
For the proof of Theorem A.1 we need the following elementary inequality.
[ 3 For a,b > 0, 1 < p < ¥ and 1/p+1/q = 1 we have
Proof.
Function t® tp/p+t-q/q has the derivative
tp-1-t-q-1. The derivative is positive for t > 1 and negative
for 0 < t < 1. Hence the maximum value of the function for t > 0 is
attained at t = 1, giving
The next theorem lists useful properties of conditional expectations.
Theorem 44
The abstract proof uses the following35.
[ 4 If Y1 and Y2 are
\cal F-measurable and òAY1 dP £ òA Y2 dP for all
A Î \cal F, then Y1 £ Y2 almost surely. If
òAY1 dP = òA Y2 dP for all A Î \cal F, then Y1 = Y2.
Proof. Let Ae = {Y1 > Y2+e} Î \cal F. Since
òAeY1 dP ³ òAeY2 dP + eP(Ae), thus
P(Ae) > 0 is impossible. Event
{Y1 > Y2} is the countable union of the events
Ae (with e
rational); thus it has probability 0 and Y1 £ Y2 with probability
one.
The second part follows from the first by symmetry.
[¯]
(i) This is verified first for Y = IB (the indicator function
of an event B Î \cal F). Let
Y1 = E{XY|\cal F}, Y2 = YE{X|\cal F}. From the definition one
can easily see that both òAY1 dP and òA Y2 dP are equal
to òA ÇB X dP. Therefore Y1 = Y2 by the Lemma A.2.
For the general case, approximate Y by simple random variables and
use (vi).
(ii) This follows from Lemma A.2: random variables
Y1 = E{X|\cal F},
Y2 = E{X|\cal G} are \cal G-measurable and for
A in \cal G both òAY1 dP and òA Y2 dP are equal to
òAX dP.
(iii) Let Y1 = E{X|\cal NÚ\cal F}, Y2 = E{X|\cal F}.
We check first that
(iv) Here we need the first part of
Lemma A.2. We also need to know that each convex
function g(x) can be written as the supremum of a family of
affine functions fa, b (x) = ax+b.
Let Y1 = E{g(X)|\cal F}, Y2 = fa, b(E{X|\cal F}), A Î \cal F.
By (vi) we have
(v), (vi), (vii) These proofs are left as exercises.
[¯]
Problem 80 Prove part (vi) of Theorem A.2.
Problem 81 Prove part (vii) of Theorem A.2.
Problem 82
Prove the following conditional version of Chebyshev's
inequality: if E|X| < ¥,
then
Problem 83
Show that if (U,V,X) are such that in distribution (U,X) @ (V,X)
then E{U|X} = E{V|X} almost surely.
Problem 84
Show that if X, Y are integrable non-degenerate random variables,
such that
Problem 85
Show that if X, Y are integrable such that
E{X|Y} = Y and E{Y|X} = X, then X = Y a. s.
They give immediately the expansions
In particular for x > 0 we have ln1+x < x and ln1+x > x-x2/2.
Suppose x = x(u,v), y = y(u,v).
The change of variables formula in multivariate case is
The solution of the linear differential equation
y¢+a y = g(x) with y(0) = y0 is given by
y(x) = y0 e-ax ò0x eat g(t)/, dt.
Second order linear equation y¢¢+a y¢+by = g(x) is often solved by the
method of varying a constant.
The first step is to solve the homogenous equation y¢¢+a y¢+by = 0 first
using
y = erx. The general solution of homogenous equation is
y = C1er1x+C2er2x, or y = C1erx+C2xerx if the characteristic
equation has double root.
Definition 27
A scalar product of vectors in \sf V is a bilinear, positive definite, non-degenerate
mapping á·|·ñ: \sf V× \sf V® IR.
Definition 28
Vectors x,y are orthogonal with respect to scalar product á|ñ, if
áx|yñ = 0.
The length of a vector is ||x|| = Ö{áx|xñ}. Orthogonal
vectors of unit length are
called orthonormal.. If ej are the
orthonormal vectors and x is in their linear span, then the coefficients in
the expansion x = åj ajej are given by
aj = áx|ejñ.
Example 83 Let \sf V be the vector space of all continuous functions on the
interval [-p,p]. In the scalar product
áf|gñ = ò-ppf(t)g(t) dt the functions {sink}k Î IN, {coskt}k Î IN, 1 are
orthogonal.
This implies that An = a0(n)I+a1(n)A+...+ad-1(n)Ad-1, where
xn = D(x)Q(x)+a0(n)+a1(n)x+...+ad-1(n)xd-1. If A has n distinct
eigenvalues lj, the coefficients aj(n) can be found by solving the system of
equations ljn = a0(n)+a1(n)lj+...+ad-1(n)ljd-1.
A quick method that finds a characteristic polynomial due to K... is to solve the
system of linear equations for the coefficients: pick a vector X at random and
solve37 for a0,..., ad-1 the equations
a0X+a1 AX+...+ad-1Ad-1X = 0. If the resulting matrix is singular,
re-sample X until a non-singular matrix is found.
To approximate the integral òab f(x) dx in calculus we use left and right
sums: Ln = [(b-a)/ n]åk = 0n-1 f(a+[(b-a)/ n]k),
Rn = [(b-a)/ n]åk = 1n f(a+[(b-a)/ n]k)
The more exact trapezoidal rule uses
òabf(x) dx » Sn = 1/2(Ln+Rn).
Still more powerful and easy to program integration method is the Simpson
rule: òab f(x) dx » 4/3S2n-1/3Sn.
The Simpson rule is exact for cubic polynomials. Typical way to program it
is to call the subroutine performing trapezoid method integration twice. Many
calculators have Simpson rule build in. Before you use it, be sure to check if it
is reliable enough. A simple test that catches some poorly written routines is
ò0500e-x2 dx » Ö{p/2}.
The fast and powerful bisection method requires
correct end-points, and finds one solution only. But it has virtually no
assumptions, except the continuity: If f(x) is continuous and f(a) < 0, f(b) > 0
then there is a < x < b such that f(x) = 0 and the interval length can be halved by
computing f([(a+b)/ 2]).
Three difficulties arise in real applications.
The second point has satisfactory answer for polynomial equations. The third point
can be tackled through search for minimum. Namely, if equations are
f(x,y) = 0,g(x,y) = 0 then we need to find minima of f2(x,y)+g2(x,y).
Any DOS-running PC comes also with BASIC. You can start it with the command
QBASIC
from the DOS command line, or from Windows File Manager
(click on an entry QBASIC.EXE, usually located in). If you are a devoted
Windows user, you may install an Icon in Program Manager
to run BASIC with a click of the mouse.
To stop the execution
of any program, press simultaneously Ctrl+Break.
There is no possibility that by running these programs you will do any harm to your equipment.
Experiment, and if something goes realy wrong, you can always turn OFF or restart the computer.
will print the decimal value of the expression ln(1+Ö2).
This probably is the simplest program to begin with. It is so short that there is
no point in saving
it.
The real power comes from repetitive operations explained in Section
.
BASIC programs are actually text files. The instructions are read consecutively by
the BASIC interpreter. The QBASIC interpreter that comes with DOS is sufficient
for our
purposes. Sample programs below introduce certain more exotic build-in functions.
Example 85 The following
simple program prints current date & time
PRINT "HELLO"
'Prints current date & time
Within QBasic the
SUBs are separated, each in each own text window, but the final program is saved
as one file with SUBs following the main module.
To create a SUB, choose New SUB from the Edit menu. More details follow in Section
.
Larger
projects often use several modules40 that are compiled and run
together.
Example 87 Here is the revision of the previous program with a simple but useful
SUB.
'Prints current date & time in the middle of the screen
New things: CLS, concatenation of strings, calling SUB, screen positioning by TAB,
LEN("hello")
Every subprogram should begin with a
(commented) short description of its purpose, and the meaning of parameters.
FOR j=1 to 10 STEP .5 ... NEXT j.
STEP is optional, and if none is given, then 1 is used by default.
Example 88 Program
computes and prints the total of the first 100 integers (5050).
Conditional action is accomplished by
IF ... THEN ... ELSE ... END IF
END IF is required only when multiple lines of instructions are to be executed.
Selection from several cases is perhaps easiest through
SELECT var ...
Conditional loops (the ones that last indefinitely, unless special
circumstance is encountered)
can be programmed through
WHILE cond ... WEND
or through
DO
....
IF cond THEN EXIT DO
...
LOOP
Other ways of breaking out of DO ... LOOP are available, too.
Example 89 The following program illustrates several conditional
instructions. It finds consecutive max = 30000 prime numbers larger than n = 1
To stop the execution of any program, press simultaneously Ctrl+Break.
To have user supply a value for variable X, write
INPUT "Real number=";X
Note: instruction INPUT stops the program. The user has to hit Enter for the program
to continue.
To scan for the key pressed by the user without stopping the program, use
INKEY$ function instead of INPUT. This allows the user to
control the program by simple menu functions. For example, embed the following lines
within DO ... LOOP and stop the program by pressing KeyQ.
Key$=INKEY$
SELECT CASE Key$
CASE "Q"
END 'stop the program
CASE "q"
END 'stop the program
CASE "?"
PRINT "Did you ask for help?"
CASE ELSE
BEEP
END SELECT
A simple way to let the user know that something went wrong is to BEEP.
[ 42
Write a program that complains (beeps) when user presses any key.
[ 43
Write a program that will type the text provided by user from the keyboard to screen in
upper-case regardless what the user selected. (Hint: Function
T2$=UCASE$(T1$) converts to upper-case)
To print to the printer, use LPRINT instead of PRINT. In the latter
case, to force the page out of the printer, end every
printing job with LPRINT CHR$(12).
PRINT "New value is X=";X
which outputs the string (in quotation marks) and the value New value is X=1.234
Occasionally we may want to do minimal ``format" through the semi-colon, or
TAB().
PRINT ".";
To see the effect of the semi-colon, run the above statement in a
loop41. Then
delete the semicolon, and run it
again.
For professional formatting of output, look up the instruction PRINT USING
"###.##" in any BASIC
textbook (your public library is a good source!).
These can be combined into more complex user defined data structures through
Type declaration.
If variables are not declared, and the name isn't appended with the symbol
indicating type, the variables are treated as single precision. This is
admissible42 in small programs where speed and memory aren't of major concern.
User defined functions are listed as separate windows within QBASIC.
Select View to switch between various functions.
Each user=defined function
starts with BEGIN FUNCTION FunctName (x, y,z)
and ends with END FUNCTION. The code between these two lines
is executed whenever the function
is envoked from main program, from another function, or SUB, or from itself.
FunctName is a name for your function (choose a descriptive one). Arguments
(x,y,z) are the variables passed to the function.
Example 90
The following function soves the linear equation ax+b = 0
FUNCTION Solution(a, b)
[ 44
Write a function that solves quadratic equation a x2+bx +c = 0.
Once a function is created, you can use it from within a program in the same way
as the build in functions. The name of the function carries its value, eg
x=FunctName(23)+FunctName(24). Notice that the function can also change values
of the variables in its arguments, and perform any other tasks - this behavior is
like that of any program.
SUBs act like functions, except that the name doesn't carry ``value" and the
only change (if any) is to the variables passed. To invoke SUB from the program, use
CALL SubName(Parameters). This
method is
recommended as it is more resistant to typographical errors.
You can pass a whole array as an argument of a function. The arrays are recognized by the parentheses:
FuncName(A()) is a function of array A().
[ 45
Write functions SimulateBinomial (p,n), SimulateNormal(n,sigma), SimulateExponential(m)
that simulate a single instance of the Binomial, Normal, Exponential distribution
with given parameters.
Graphic commands are available only on computers with a graphics card (all
PC's that run Windows have a graphics card).
It is nice to be able to make simple graphs, but the full topic is beyond the scope
of this introduction.
Program LIMTHS.BAS draws lines and boxes of various colors,
in the display rectangle covering a portion of the
screen.
If you are seriously interested in graphing the results of your computations,
and printing the outcomes,
you may want to switch to Visual Basic in Windows.
Long integers can be used to represent subsets of an n-element set, for n up
to n = 15 (Why?). Single number k corresponds to long integer 2k, the set {j, k}
is
represented by 2j+2k, etc.
If S, T are integers representing sets, then S AND T represents the intersection of sets,
S OR T is the union, and S AND NOT T is the difference of sets.
To check if j Î S, verify if 2^j AND T is non-zero.
If you want to save your printout to file, print to the ASCII file sequentially.
This is slower and less versatile than binary, but easier to master.
The syntax is quite rigid. The following example contains the basic idea:
OPEN FileName FOR OUTPUT as #17
PRINT #17, "HELLO"
CLOSE #17
You can print to file text, numbers, etc. If the file with the sam name as
output file already exists, it is replaced by the new one (overwritten). To add rows to
an existing
file
without loosing
its
contents use OPEN FileName FOR APPEND
as
#17
You can read input from programs back into the program by the corresponding INPUT
statements. Beware that the file has to have the format expected by the INPUT.
The following simple program reads entries from the file FileName and
prints it onto the screen, one by one.
WHILE NOT EOF(17)
INPUT #17, H$
PRINT H$
WEND
CLOSE #17
PRINT H$
Change INPUT #17, H$ to LINE INPUT #17, H$ to get full text
rather than words.
If you wrote a program that uses GOTO statement(s), it is a good exercise to
re-write it without a single GOTO instruction!
Modern BASIC is a structural language.
The objective of this example is to show how various features of BASIC interplay
in a design of a card-dealing application.
Here is the description of the situation.
Card games require shuffling the deck, and dealing some number of cards
from the top of the deck.
The objective of this example is to write a program that will print out the number of requested cards twice.
That is, the first player will get as many cards as she requests, and then the second player
will get as many cards as he requests (from the rest of the deck!
What remains is only to decide what each step should do,
and how the information about the
cards will be stored. Since storing the information determines how it is passed between
subprograms, we begin with determining this part.
We may want to use an array Deck(52) which will be of ``user defined" type.
The advantage of this approach is that we may modify the information we w ``store" with each card
with minimal changes in the program itself.
DEF TYPE Cards
Afterwards we may declare two shared arrays
Dim Shared Deck(52) as cards
Shared means that every SUB in the program can access values of Order(j), and
Deck(j). The first card dealt will be Deck(Order(1)). The definition of type says that
its value is
Deck(Order(1)).value and its suit is Deck(Order(1)).suit.
The simplest way to begin designing SUBS is to
describe the purpose of each function/SUB with no code.
The next routine is perhaps not easy to write for a beginner, but we have a good example
in the book.
SUB ShuffleDeck
The next routine interacts with the user. User interaction should ALWAYS be
implemented as a separate SUB. As straightforward as it seems,
reliable coding requires extensive checking for errors resulting from
unpredicted user reactions.
'ask first player
The following routine is in charge of giving out cards. Since the order of cards was already
determined, it seems straightforward. But again complications arise, and this portion will
be much easier to handle if coded as a separate SUB
SUB DealCards(n)
Rather than beginning to code the actual functions, we may want to double check that
the ``flow" of our program is as we expect it.
We may write a ``dummy" versions of the more difficult parts of the program,
and test its operation. Only after we are sure that the program behaves as expected,
we can invest more time into coding more difficult parts.
Here is a test program. It was produced from the previously described code; all
newly added parts are clearly marked so that they can be removed once not needed.
With this ``skeleton" program we can check the following things:
fractional numbers will go through, but strings at least will be stopped.
1 Disjoint, or exclusive
events A, B Ì W are
such that
AÇB = Æ is
empty.
2 The price for flexibility is that now we have to decide how to
chose pn.
3 Similar instruction
on TI-85 is Display rand.
4 In BASIC, to avoid repetitions use instruction RANDOMIZE
TIMER at the beginning of a simulation
program.
5 You need to find out how to handle
graphic in BASIC. Otherwise, make a table of values
instead.
6 A related method is brute force -
checking all possible cases.
7 If you want to include
more cities, you have to type the distances in a suitable format. If you embark
on this project, try first to implement a method for selecting an arbitrary subset
of cities to visit.
8 Trivial events are those with probabilities 0, or 1.
9 This terminology arises under the interpretation that
X represents failure time.
10 A possible heuristic argument may
argue that
Tp is on average 1.
11 The precise definition is in Section 1.6.1.
12 Actually, we need only
a right-inverse, i.e a function such that F(f(u)) = u.
13 Provided that
limited accuracy is admissible
14 Theorem
gives a more general statement.
15
See eg. W. Feller, An Introduction to Probability Theory,
Vol II, Wiley, New York 1966.
16 For the multivariate analog, see Theorem
17 J. F. W. Herschel,
Quetelet on Probabilities, Edinburgh Rev. 92 (1850) pp. 1-57
18 J. C. Maxwell,
Illustrations of the Dynamical Theory of Gases,
Phil. Mag. 19 (1860), pp. 19-32.
Reprinted in The Scientific Papers of James Clerk Maxwell,
Vol. I, Edited by W. D. Niven, Cambridge, University Press
1890, pp. 377-409.
19 This is all we can hope
for in the finite computer arithmetics. Whenever Uj Î (0,1) is
selected and finite approximation is chosen, Nj = NUj is an integer,
where N = 2b is the characteristic of the operating system.
20 Actually, we need only
a right-inverse, i.e a function such that F(f(u)) = u.
21 Why variance?
22 The smallest possible variance is 0! Why doesn't this
happen in the actual application??
23
See
J. S. Sadowski, IEEE Trans. Inf. Th. IT-39 (1993) pp. 119-128, and the
references therein.
24 Such as y0 = 0.
25 Following good
mathematical practise,
we simplify the real world and assume that a
single mature
rabbit will produce one offspring every month.
26 Most likely, you know the solution to this
one, but the method is useful for other problems as well.
27 Clearly, we
request y0 = 0.
28 Notice that this is a mathematical
simplification that might be not worth pursuing if the actual state
space has some convenient interpretation.
29 Source: football in A. Hornby, Oxford
Advanced Learner's Dictionary of Current English , Oxford University Press 1974
30 Variable aliasing is the mistake of
assigning the same location to two or more variables in a program.
31 This is a very imprecise statement!
32
A martingale with respect to a family of increasing s-fields
\cal Fn is and integrable sequence Xn such that E(Xn+1|\cal Fn) = Xn.
The sequence
Xn = E(X|\cal Fn) is a martingale. The sequence in the theorem is of the
same form, except that the s-fields are decreasing rather than increasing.
33 Only deletions are required.
34 That is, find the strategy that costs
the least if you have to pay for each use of the scale.
35 Readers not familiar with measure
theory should skip the proofs.
36 See p-l Theorem
(Theorem 3.3) P. Billingsley, Probability and measure,
Wiley, New York 1986.
37 Use Gaussian elimination.
38 This is golden section. The explanation why it occurs here can
be found in numerical methods books.
39 The actual sample programs on the disk also contain
commented LATEX typesetting instructions at the beginning and at the end. Their
sole purpose
is
to insert the listings into this text.
40 For example, in many applied math
programs there is a need for integration routines. These perhaps would be
collected in a separate module
to
facilitate repeated usage.
41 FOR j=1 TO 1000 42 Use DEFINT I-L to over-ride the default and force all
the variables with names that begin with I though L to be of
Integer type.
[x/ 18]
if 0 £ x £ 6
0
otherwise
2.2.2 Histograms
Simulations and experiments do not give direct access to the density, but often a
histogram will approximate it reasonably well. Histograms are graphical
representations of empirical data. A sample histogram is drawn on the side of
the square in Figure on page .
2.2.3 Simulations of continuous r. v.
2.3 Expected values
Expected values are perhaps the single most important numerical characterization
of a random phenomenon.
Definition 6 For discrete random variable
X the expected value EX is given by EX = åv vPr(X = v), provided
the series converges.
Name Probability distribution EX
Normal f(x) = [1/(Ö{2p}s)]exp-[((x-m)2)/( 2s2)] m
Exponential f(x) = le-lx [1/( l)] > 0
Uniform f(x) = [1/( b-a)] 1/2(a+b) real
Gamma
Weibull
Binomial Pr(X = k) = (nk)pk(1-p)n-k np
Poisson Pr(X = k) = e-l[(lk)/ k!] l
Geometric Pr(X = k) = p(1-p)k-1 1/p
Hypergeometric
Negative Binomial 2.3.1 Tail integration formulas
The following tail integration formula is of considerable convenience in
theoretical analysis.
Theorem 2 For
non-negative random variables, both in the discrete, and in the
continuous case
EX =
ó
õ
¥
0
Pr
(X > t) dt (21)
If X is discrete integer valued X Î {0,1,...}, then (21) can be
written as
EX =
¥
å
n = 0
Pr
(X > n). (22)
X
if X £ 3
3
if X ³ 3
2.3.2 Chebyshev-Markov inequality
The following inequality is known as Markov's, or
Chebyshev's inequality. Despite its simplicity it has numerous
non-trivial applications,
see eg. Theorem .
P(U > t) £
EU
t
(23)
2.4 Expected values and simulations
'
'declarations
DECLARE FUNCTION Integrand! (X!, Y!)
DECLARE FUNCTION InDomain! (X!, Y!)
CONST True = -1
' simulation loop
NumTrials = 10000
FOR j = 1 TO NumTrials
'select random points from the square [-1,1]x[-1,1]
X = 2 * RND(1) - 1
Y = 2 * RND(1) - 1
'check if this is in the domain
IF InDomain(X, Y) THEN
NumTested = NumTested + 1
Sum = Sum + Integrand(X, Y)
Var = Var + Integrand(X, Y) ^ 2
END IF
NEXT j
'Print the answer
PRINT "Examined "; NumTested; " random points"
IF NumTested = 0 THEN END 'nothing found
N = NumTested
PRINT "The integral is approximately "; Sum / N
PRINT "With 95% confidence the error is less than "; 1.96 * SQR(Var / N - (Sum / N) ^ 2) / SQR(N)
END
FUNCTION InDomain (X, Y)
'This function checks if $x,y$ is in the integration domain
'The definition of the domain can be easily modified here, including
'more complicated domains
IF X ^ 2 + Y ^ 2 < 1 THEN InDomain = True
END FUNCTION
FUNCTION Integrand (X, Y)
'This is the function to be integrated
Integrand = COS(10 * X + 20 * Y)
'
END FUNCTION
2.5 Joint distributions
Pr
(X = 1,Y = 1) =
1
4
+e (24)
Pr
(X = -1,Y = -1) =
1
4
+e
Pr
(X = -1,Y = 1) =
1
4
-e
Pr
(X = 1,Y = -1) =
1
4
-e
Pr
(X Î U) =
ó
õ
...
ó
õ
U
f(x1,... xn) dx1 ... dxn
f(a,b) =
¶2
¶a ¶b
Pr
(X £ a, Y £ b). (25) 2.5.1 Independent random variables
Independence of random variables is defined in terms of joint distributions. The
intuition behind this definition is that the events that random variables may
generate should be independent. Notice however that the actual definition is simpler
than Definition 1.6. Can you explain why?
Definition 9 Random variables
X1,...,Xn are
independent,
or
stochastically independent, if
Pr
(X1 Î U1, ..., Xn Î Un) = Pr
(X1 Î U1) ... Pr
(Xn Î Un) (26) 2.6 Functions of r. v.
Some random variables are obtained by taking functions of another
ones, possibly multi-dimensional. In the notes we often limit our attention to
a single random variable Z given by a function Z = f(X,Y) of two arguments;
this is convenient for notation
and exhibits most of the interesting techniques.
2.7 Moments of functions
If Z = f(X,Y) has the expected value, then EZ can be computed directly without
computing the density, or probability mass function of Z. The following
identity is useful.
Ef(X,Y) =
å
x,y
f(x,y) Pr
(X = x, Y = y). (27)
Ef(X,Y) =
ó
õ
ó
õ
IR2
f(x,y)f(x,y) dx dy, (28)
E(aX+bY+c) = aEX+bEY+c. (29)
Var(aX+b) = a2Var(X). (30)
Var(X) = EX2-(EX)2 = E(X-EX)2. Name Probability distribution Var(X)
Normal f(x) = [1/(Ö{2p}s)]exp-[((x-m)2)/( 2s2)] s2
Exponential f(x) = le-lx [1/( l2)]
Uniform f(x) = [1/( b-a)] [1/ 12](b-a)2
Gamma
Weibull
Binomial Pr(X = k) = (nk)pk(1-p)n-k np(1-p)
Poisson Pr(X = k) = e-l[(lk)/ k!] l
Geometric Pr(X = k) = p(1-p)k-1 [1/( p2)]
Hypergeometric
Negative Binomial
(EX)2 £ EX2 (31) Tail integration formula revisited
Chebyshev-Markov inequality
Pr
(|X| > t) <
1
t
E|X| (32)
Pr
(|X-m| > t) <
1
t2
Var(X) (33)
Pr
(|X| > t) < e-atEeaX (34) 2.7.1 Simulations
The unknown variance s2 of a sequence of simulated random variables Xj
can be approximated by 1/n åj = 1n (Xj-[`X])2, where [`X] is
the arithmetic mean of X1,..., Xn. Thus can also use (33) and Theorem 2.7 to asses errors in
estimating variances. Another more accurate method is presented later on
in Chapter , but it also requires estimating variances.
2.8 Application:
scheduling
A critical path analysis involves estimating time of completing a project
consisting of many tasks of varying lengths. Some of the tasks can be done
concurrently, while other may begin only after other preliminary tasks are
completed. This is modeled by the dependency graph together with the estimated
times.
2.8.1 Deterministic scheduling problem
2.8.2 Stochastic scheduling problem
In some projects, the actual numbers are only the averages, and the actual
completion times of the projects may be random. The distribution of the actual
completion time, or even its average may be difficult to compute analytically.
Nevertheless, simulations let us estimate the average and analyze the
probabilities of events.
2.8.3 More scheduling questions
In more realistic analysis of production processes we also have
to decide how to split the tasks between available personnel.
Example 32 This exercise refers to the tasks presented in Example
2.8.1. On average, how fast can a single person bake chocolate-chip
cookies? What if there are two people?
2.9 Distributions of functions
Distributions of functions of random variables are often difficult to compute
explicitly. Special methods deal with more frequent cases.
Sums of discrete r. v.
Sums can be handled directly, but a more efficient method uses generating functions of
Chapter .
fZ(z) =
å
x
f(x,z-x) (35)
fZ(z) =
å
x
fX(x)fZ(z-x) (36) Sums of continuous r. v.
f(z) =
ó
õ
¥
-¥
fX(u)fY(z-u) du (37)
z
if 0 £ z £ 1
2-z
if 1 £ z £ 2 .
Minima, maxima
Minima and maxima occur for instance if we are waiting for one of the independent
events, and then we follow the first one (minimum), or the last one (maximum).
Embedded Markov chains construction in Section are based on minima of independent exponential r. v.
Order statistics
Order statistics generalize minima and maxima. Their main use is in (robust)
statistics, and this section can be safely skipped. Let X1,..., Xn be
independent continuous random variables with the cumulative distribution function
G and density g = G¢.
Pr
(Rk > x) =
n
å
j = k
(jn)(1-G(x))j(G(x))n-j (38) 2.10 L2-spaces
Inequalities related to expected values are best stated in the geometric language of
norms and normed spaces. We say that X Î L2, if X is square
integrable, ie. EX2 < ¥.
||X||2 =
_____
ÖE|X|2
.
EXY £ ||X||2||Y||2. (39)
E|X| £ E||X||2. (40)
||X+Y||2 £ ||X||2+||Y||2. (41)
2.11 Correlation coefficient
Correlation is a concept deeply rooted in statistics.
The correlation coefficient corr(X,Y) is defined for
square-integrable non-degenerate r. v. X, Y by the formula
r = corr(X,Y) =
EXY-EXEY
||X-EX||2||Y-EY||2
. 2.11.1 Best linear approximation
¶H
¶m
= 0 (42)
¶H
¶b
= 0.
(1-r2)Var(Y). (43) 2.12 Application: length of a random chain
2.13 Conditional expectations
2.13.1 Conditional distributions
For discrete X,Y the conditional distribution of variable Y given the value of
X is just the conditional probability, Pr(Y = y|X = x). In jointly continuous
case, define the conditional density
f(y|X = x) =
f(x,y)
fX(x)
. 2.13.2 Conditional expectations as random
variables
Since E(X|Y = y) depends on the actual value of Y, and Y is random, the
conditional expectation is a random variable itself. We shall write
E{X|Y} or EYX for the random variable defined by the conditional expectation
E{X|Y = y}.
E{X|Y}(w) =
n
å
k = 1
mk IAk(w),
E{X|\cal F} =
å
xj P(Bj|Ak) for w Î Ak.
EY = E(E(Y|X)) (44) 2.13.3 Conditional expectations (continued)
In discrete case conditional expectations of functions are given by,
E(g(X)|Y = y) =
å
x
g(x) Pr
(X = x|Y = y) (45)
Eg(X) = E(E(g(X)|Y)) (46) 2.14 Best non-linear approximations
This section explains the relevance of conditional expectations to the problem of
best mean-square approximation.
2.15 Lack of memory
Conditional probabilities help us to arrive at important classes of densities
in modeling.
In this section we want to analyze an non-aging device, which characteristics do
not change with time.
Pr
(T > t+s|T > t) = Pr
(T > s) (47)
N(x+y) = N(x)N(y) (48)
N(nx) = N(x)n. (49)
N(q) = rq. (50)
Remark 1 Geometric distribution also has the lack of memory property.
If equation (48) is assumed to hold for integer values of x,y only,
and T > 0 is integer valued, then T is
geometric.
2.16 Intensity of failures
The intuitive lack-of-memory, or non-aging property of the exponential
distribution can be generalized to include simple models of aging. We may want to
assume that a component analyzed becomes less reliable, or more reliable with
time. An example of the first one is perhaps a brand new car. An example of the
latter is perhaps a software operating system when updates are installed promptly.
l(t) =
lim
h® 0
1
h
Pr
(T < t+h|T > t) (51)
f(t) = C ta-1e-bta for t > 0. (52) 2.17 Poisson approximation
Of the discrete distributions, the formula for the Poisson distribution is perhaps
mysterious. Poisson distribution is often called the law of rare events.
Theorem 10
Suppose Xn are Bin(n,pn) and npn® l. Then
Pr(Xn = k)®e-llk/k!.
Example 45 How many raisins should a cookie have on average so that no more
than one cookie in a hundred has no raisins? So that no more
than one cookie in a thousand has no raisins?
2.18 Questions
Problem 20 The performance of the algorithm for selecting a random permutation
in GetPermutation SUB of RANDTOUR.BAS
can be estimated by the following.
More theoretical questions
Problem 24 [Hoeffding]
Show that if XY, X, Y are
discrete, then
EXY-EXEY =
ó
õ
¥
-¥
ó
õ
¥
-¥
(P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)) dt ds.
P(X > 2t) £ q P(X > t) for all t > T. Chapter 3
Moment generating functions3.1 Generating functions
3.2 Properties
The moment generating function of a real-valued random variable X is
defined by MX(t) = Eexp(tX). If X > 0 has the density f(x), the moment
generating function is its Laplace
transform: M(t) = ò0¥ etx f(x) dx.
Figure 3.1: Graph of the moment generating function M(t)
=
3/4+1/4et.
MaX+b(t) = etbMX(at). (53) Name Distribution Moment generating function
Normal N(0,1) f(x) = [1/(Ö{2p})]exp-[(x2)/ 2] M(t) = et2/2
Exponential f(x) = le-lx M(t) = [(l)/( l-t)]
Uniform U(-1,1) f(x) = 1/2 for -1 £ x £ 1 M(t) = 1/tsinht
Gamma f(x) = 1/G(a)b-a xa-1exp-x/b
M(t) = (1-bt)-a
Binomial Pr(X = k) = (nk)pk(1-p)n-k M(t) = (1-p+pet)n
Poisson Pr(X = k) = e-l[(lk)/ k!] M(t) = expl(et-1)
Geometric Pr(X = k) = p(1-p)k-1 M(t) = [(pet)/( 1-(1-p)et)]
MX+Y(t) = MX(t) MY(t) 3.2.1 Probability generating functions
For ZZ\scriptscriptstyle + -valued random variables it is convenient to consider the so called
generating function G(z) = M(lnz). In this
case Theorem 3.2(i) is elementary, as G(z) = åk = 0¥ pkzk
determines uniquely its Taylor series coefficients
pk = Pr(X = k).
3.3 Characteristic functions
eix = cosx+isinx.
fX(t) = EeitX. (54) 3.4 Questions
Problem 30
Let Sn = X1+... + Xn be the sum of mutually independent random variables
each assuming the values 1,2,...,a with probability 1/a.
Pr
(Sn = k) = a-n
¥
å
j = 0
(-1)j(jn)(n-1k-aj-1)
Hint: The answer is at first as the infinite series. To find its sum, use
generating function, or negative binomial expansion:
(1+x)-k = 1+[(k+1)/ 1!]x+[((k+1)k+2)/ 2!]x2+....
P(X > 2t) £ 10 (P(X > t))2 for all t > 0,
Eexp(tX) £ Cexp(
t2
2l
)
Chapter 4
Normal distributionNext to a stream in a forest you see a small tree with tiny, bell-shaped,
white flowers in dropping clusters.
The Auborn Society Field Guide to North
American
Trees.
f(x) =
1
æ
Ö
2p
s
e -[((x-m)2)/(2s2)].
M(t) = e[(t2)/ 2].
M(t) = exp(tm+
1
2
s2t2),
X = sZ + m, (55) If X has given mean m = 123 and given variance s2 = 456,
for what values of a we have Pr(|X| > a) = .8?
with the help of tabularized values of the
cumulative distribution function of standard normal Z.
4.1 Herschel's law of errors
The following narrative comes from J. F. W. Herschel17.
``Suppose a ball is dropped from a given height, with the
intention that it shall fall on a given mark. Fall as it may, its deviation from
the mark is error, and the probability of that error is the unknown function of
its square, ie. of the sum of the squares of its deviations in any two rectangular
directions. Now, the probability of any deviation depending solely on its
magnitude, and not on its direction, it follows that the probability of each of
these rectangular deviations must be the same function of its square. And since
the observed oblique deviation is equivalent to the two rectangular ones,
supposed concurrent, and which are essentially independent of one another, and
is, therefore, a compound event of which they are the simple independent
constituents, therefore its probability will be the product of their separate
probabilities. Thus the form of our unknown function comes to be determined from
this condition...''
N(2x) £ 8(N(x - x0))2 (56)
P(|X1| ³ 2x) £ P(|X1| ³ 2x)P(|X2| ³ 2x0)
+ P(|X1+X2| ³ 2(x - x0)) P(|X1- X2| ³ 2(x - x0) )
£
1
2
N(2x)+4N2(x - x0)
Proof of Theorem 4.1. Let M(u) = EuX be the moment generating
function of X. Since E eu(X+Y)+v(X-Y) = E e(u+v)X+(u-v)Y
M(Ö2 u)M(Ö2v) = M(u+v)M(u-v) (57)
Q(Ö2 u)+ Q(Ö2v) = Q(u+v)+Q(u-v) (58)
Q¢¢(u+v) = Q¢¢(u-v)
4.2 Bivariate Normal distribution
Definition 14 We say that X,Y have jointly normal distribution (bivariate
normal), if aX+bY is normal for all a,b Î IR.
f(x,y) =
1
æ
Ö
2p(1-r2)
s1s2
exp-
x2-y2-2rs1s2 xy
2s12s22(1-r2)
. (59) 4.3 Questions
Problem 36
Show that if X,Y are independent and normal, then X+Y is normal.
(Hint: moment generating functions are easier than convolution formula (36).)
Chapter 5
Limit theorems5.1 Stochastic Analysis
There are several different concepts of convergence of random variables.
0
if U > 1/n
1
otherwise . Then Xn® 0 almost surely.
5.2 Law of large numbers
Each law of large numbers (there are many of them) states that empirical averages converge to
the expected value. In statistical physics the law of large numbers impies that trajectory
averages and population averages are asymptoticaly the same.
In simulations, it provides a theoretical foundation, and
connects the frequency with the probability of an event.
[ 1
If Xn is Binomial Bin(n,p) then 1/n Xn ® p in probability.
5.2.1 Strong law of large numbers
5.3 Convergence of distributions
5.4 Central Limit Theorem
A lesser known aspect of the central limit theorem is that one can actually
simulate paths of certain continuous time processes by taking
Xn(t) = [1/( Ön)]åk £ nt xk, where xk are independent
mean-zero r. v.
'
'This program simulates random walk paths (with uniform incerments)
'reflected at the boundaries of a region
'
'declarations of subs
DECLARE SUB CenterPrint (Text$)
' minimal error handling - graphics card is required
ON ERROR GOTO ErrTrap
CLS
'request good graphics (some cards support SCREEN 7, etc)
SCREEN 9
LOCATE 1, 1 'title
CenterPrint "Path of reflected random walk"
LOCATE 9, 1 'timer location
PRINT "Timer"
scale = 10 '
WINDOW (0, 0)-(scale, scale): VIEW (150, 100)-(600, 300)
LINE (0, 0)-(scale, scale), 10, B': LINE (scale, scale)-(2 * scale, 0), 11, B
FOR j = -4 TO 4
LINE (0, scale / 2 + j)-(scale / 50, scale / 2 + j), 2
NEXT j
X = scale / 2
Y = scale / 2
dead = 0
T = 0
col = 14 * RND(1)
speed = scale / 100
WHILE INKEY$ = "" 'infinite loop until a key is pressed
T = T + 1
X0 = X
Y0 = Y
X = X + (RND(1) - 1 / 2) * speed
Y = Y + (RND(1) - 1 / 2) * speed
IF X < 0 THEN X = -X: col = 14 * RND(1)
IF Y < 0 THEN Y = -Y: col = 14 * RND(1)
IF X > scale THEN X = 2 * scale - X: col = 14 * RND(1)
IF Y > scale THEN Y = 2 * scale - Y: col = 14 * RND(1)
IF X > 2 * scale THEN X = 4 * scale - X: col = 14 * RND(1)
LINE (X0, Y0)-(X, Y), col
LOCATE 10, 1
PRINT T
WEND
END
ErrTrap: 'if there are errors then quit
CLS
PRINT "This error requires graphics card VGA"
PRINT "Error running program"
PRINT "Press any key ...'"
WHILE INKEY$ = ""
WEND
END
SUB CenterPrint (Text$)
' Print text centered in 80 column screen
offset = 41 - LEN(Text$) \ 2
IF offset < 1 THEN offset = 1
LOCATE , offset
PRINT Text$
'
END SUB
5.5 Limit theorems and simulations
5.6 Large deviation bounds
Pr
(
1
n
Xn ³ p¢) £ expn H(p¢|p) (60)
[ 2 If p = 1/2 then
Pr
(|
1
n
Xn-
1
2
| > t) £ e-nt2/2 (61)
5.7 Conditional limit theorems
5.8 Questions
[ 25
A 400-point multiple choice test has four possible responses, one of
which is correct.
Part 2
Stochastic processes
Chapter 6
Simulations6.1 Generating random numbers
Suppose x0 is an arbitrary number
between 0 and 1 with 5 decimal places or more. Let x1 = {147x0},
and xn+1 = {147xn}, where {a} = a-[a] denotes the fractional
part. Here are the questions: Is xn a random sequence? Does it have
``enough" propertries of a random sequence to be used for
simulations?
6.1.1 Random digits
0
if x < 0
x
if 0 £ x £ 1
1
if x > 1
Coefficient 2 does not play any special role. The same fact holds true in any number system -
if N > 1 is fixed, and U = åXj N-j is expanded in pase N, then
Xj are independent uniform {0,1,..., N-1}-valued discrete random variables.
Uj+1 = {UjN} (62) 6.1.2 Overview of random number generators
nj+1 = a nj mod N (63)
nj+1 = anj+b mod N (64) 6.2 Simulating discrete r. v.
6.2.1 Generic Method - discrete case
Section 2.1.2 described the generic method for
simulating a discrete random variables with finite number
of values.
Namely, take X = f(U), where f is a suitable piecewise constant functions on
the interval (0,1). Let Pr(X = vk) = pk. Then f(x) = vk for
x Î (åj = 1kpj,åj = 1k+1pj).
6.2.2 Geometric
A simple method for simulating geometric distribution is to simulate
independent binomial trials
until the first success.
6.2.3 Binomial
A simple method for simulating Binomial Bin(n,p) random variable is
to simulate
binomial trials with the prescribed probability of success. For a
sample program illustrating this method,
see TOSSCOIN.BAS page pageref.
6.2.4 Poisson
6.3 Simulating continuous r. v.
6.3.1 Generic Method - continuous case
Section 2.2.3 described the generic method for
simulating a continuous random variable, similar to the method used in the
discrete case.
Namely, take
X = f(U) where f is the inverse20 of the
cumulative distribution functions F(x) = Pr(X £ x).
6.3.2 Randomization
6.3.3 Simulating normal distribution
By the central limit theorem (Theorem 5.4), if U1,...,Un are i. i. d.
uniform U(0,1) then Ö{[12/ n]}åk = 1n (Uk-1/2) is
asymptotically normal
N(0,1).
In particular a computer-efficient approximation to normal distribution is given
by
12
å
k = 1
Uk - 6.
X1 = RcosQ (65)
X2 = RsinQ (66) 6.4 Rejection sampling
The idea of rejection method is very simple. In order to simulate
a random variable with the density f(x), select a point X,Y at random uniformly from the region
{(x,y): y < g(x)}. Then Pr(X < x) = ò-¥x f(t) dt is the cumulative distribution function,
which we do not have to know analytically.
6.5 Simulating discrete experiments
6.5.1 Random subsets
To simulate uniformly selected random subsets of {1,..., n},
define sets by a sequences S(j) Î {0,1} with the interpretation j Î S if
S(j) = 1. Now select independently 0 or 1 with probablity 1/2 for each
of the values of S(j), j = 1,..., n.
SUB RandomSubset( S())
n = UBOUND(S) ' read out the size of array
FOR j = 1 TO n
'select entries at random
S(j) = INT(RND(1) + 1)
NEXT j
6.5.2 Random Permutations
SUB Rearrange( A())
n = UBOUND(A) ' read out the size of array
ns = n 'initial size of randomization
FOR j = 1 TO n
'take a card at random and put it away
k = INT(RND(1) * ns + 1)
SWAP A(k), A(ns)
ns = ns - 1 'select from remaining a(j)
NEXT j
6.6 Integration by simulations
Var
æ
ç
è
1
N
N
å
j = 1
f(Xj)/g(Xj)
ö
÷
ø
=
1
ÖN
ó
õ
b
a
æ
ç
è
f(x)
g(x)
-J
ö
÷
ø
2
dx. 6.6.1 Stratified
sampling
6.7 Monte Carlo estimation of small
probabilities
Unlikely events happen too rarely to have any reasonable hope of
simulating
them directly. Under such circumstances a special method of
selective sampling23
was developed.
Example
Suppose we want to find Pr(X1+...+Xn > an) for large n and a given
density f(x) of independent r. v. X. Consider instead independent random
variables Yj with the ``tilted density" Celx f(x), where C is the
normalizer, and l is such that EY = a. By the law of large numbers (Theorem 5.2), the event
Y1+...+Yn > an has large probability, and
'
'Simulating N fair coins
' declarations
DECLARE FUNCTION NumHeads% (p!, n%)
DEFINT I-N ' declare integer variables
'prepare screen
CLS
PRINT "Simulating toss of n fair coins"
'get users input
n = 10 'number of trials
INPUT "Number of coins n=", n
pr = .5 ' fairness of a coin
frac = .8 ' percentage of heads seeked
' get the frac from the user
PRINT " How often we get more than f heads? (where 0<f<"; n; ")"
INPUT "f=", frac
IF frac >= 1 THEN frac = frac / n 'rescale if too large
'tell user what is going on
PRINT "Hit any key to see the final answer"
LOCATE 20, 10
PRINT "With some patience you may see digits stabilize"
DO 'main loop
T = T + 1
IF NumHeads(pr, n) > frac * n THEN s = s + 1
IF INKEY$ > "" THEN EXIT DO
LOCATE 10, 10
PRINT "Trial #"; T
PRINT "Current estimate"; USING "##.#####"; s / T
LOOP
'print the answer
PRINT
PRINT "Prob of more then "; INT(frac * n); " heads in "; n; " trials is about "; s / T
END
DEFINT A-H, O-Z
FUNCTION NumHeads (p!, n)
'simulate a run of n coins
h = 0
FOR k = 1 TO n
IF RND(1) < p! THEN h = h + 1
NEXT k
NumHeads = h
'
END FUNCTION
Chapter 7
Introduction to stochastic processesstochastic, a. conjectural; able to conjecture
Webster's New Universal Unabridged Dictionary
7.1 Difference Equations
A difference equation determines an unknown sequence (yn) through a
relation that specifies the pattern. The
relation often has the form of an equation that ties the
consecutive values of the sequence. The technical
term for such an equation is recurrence.
yn+1 = f(n,yn,yn-1,...,yn-k+1). 7.1.1 Examples
yn+1 = cos(yn), (67) y0 y1 y2 y3 y4 y5 y6 y7 y8
.5403023 .8575532 .6542898 .7934803 .7013688 .7639596 .7221025 .7504177
-.5 .8775826 .6390125 .8026851 .694778 .7681958 .7191654 .7523558 .7300811
0 1 .5403023 .8575532 .6542898 .7934803 .7013688 .7639596 .7221025
.5 .8775826 .6390125 .8026851 .694778 .7681958 .7191654 .7523558 .7300811
1 .5403023 .8575532 .6542898 .7934803 .7013688 .7639596 .7221025 .7504177
1.5 .0707372 .9974992 .542405 .8564697 .6551088 .7929816 .7017242 .7637303
2 -.4161468 .9146533 .6100653 .8196106 .6825058 .7759946 .7137247 .7559287 Example
Some difference equations are simpler than others.
Suppose a sequence yn is to satisfy
yn+1 = yn +d, (68)
yn = y0+dn. (69)
yn+1 = ryn , (70)
yn = y0rn (71) Example
The following equation defines the Fibonacci sequence
yn+1 = yn+yn-1. (72)
1, 1, 2, 3, 5, 8, 13, 21, ...
Without much difficulty this can be converted to a computer program and used to answer
questions like When will the population of rabbits exceed 1 million? The general expression (solution) of
the equation corresponding to this situation is given by the
following formula.
yn =
1
Ö5
æ
ç
è
1+Ö5
2
ö
÷
ø
n
-
1
Ö5
æ
ç
è
1-Ö5
2
ö
÷
ø
n
7.2 Linear difference equations
yn+1+a yn = g(n).
yn+2+ayn+1+b yn = g(n).
Example
Here is an example of the applied problem that leads to a natural,
but not obviously solvable difference equation.
yn+1 = (1+r)yn - p . (73)
yn+1 = yn +n+1 (74)
un+1 = un +n+
3
2
. (75)
vn+1 = vn+
1
2
. 7.2.1 Problems
7.3 Recursive equations, chaos, randomness
xn+1 = cos(xn)
xn+1 = {2xn}
xn+1 = 4xn(1-xn)
For
instance, we can ask and get reliable answers to questions like:
Theory of stochastic processes uses descriptive rather than casual
models. Its primary goal is to
isolate methods that answer questions that can be answered - about the
averages and chances of events. It is setup in the form that makes it
more natural to ask the "correct" questions. But in real life, and in
simulations, we do have access to aspects of the phenomenon than what the theory
does not expose. In analyzing simulations it is
important to keep in mind the examples above. Avoid collecting data
that deal with instances rather than statistical phenomena. Print out
well defined statistics of the simulation only. Do not clutter your simulations
with irrelevant details.
7.4 Modeling and simulation
7.5 Random walks
7.5.1 Stopping times
The stopping times are random variables that describe phenomena which depend on
the trajectory of a random walk. The definition captures the intuition that
their values are determined by the history of a Markov chain.
EXT = mET (76)
Theorem 22 If xj are i.i.d., Ex = m, Var(x) = s2 < ¥, and
ET < ¥ then
E(ST-Tm)2 = s2ET (77) 7.5.2 Example: chromatography
Chromatography is a technique of
separation mixtures into compounds. One of its uses is to produce the
DNA bands.
7.5.3 Ruining a gambler
The following model is a reasonable approximation to some of the
examples in Section 7.4.
7.5.4 Random growth model
The following models various growth phenomena like the spread of a disease,
where the infected individual may either die, or infect a number of
other individuals. Here we concentrate on bacteria which have
simple
reproduction mechanism, and all spatial relations are neglected.
Chapter 8
Markov processes
Webster's New Universal Unabridged Dictionary
8.1 Markov chains
Pr
(Xk+1 Î U|X0,...,Xk) = Pr
(Xk+1 Î U|Xk)
8.1.1 Finite state space
If a Markov chain has a finite state
space, we can always assume28 it consists of integers.
Pr
(Xk+1 = x|X0,..., Xk) = Pr
(Xk+1 = x|Xk) (78)
Powers of moderately sized matrices are easy to compute on the computer.
Section indicates a mathematical method of
computing Pn for small dimensions using the Cayley-Hamilton theorem. Under certain conditions the
powers converge.
[ 28
Find limn®¥Pn for the matrix from Problem 8.1.1.
8.1.1 Stationary Markov processes
å
k
pk = 1 (79)
[p1,..., pd] = [p1,..., pd]×P (80)
(81) 8.1.2 Markov processes and graphs
The states of a Markov chain may be represented by vertices of a graph, and one step
transitions may be described by directed edges with weights. Such representation of a markov chain
aids in visualizing a Markov chain.
8.1.2 Classification of states
Graph notions have bearing on properties of the Markov chain. In particular,
Markov chain is irreducible, if the corresponding graph is connected.
Markov chain is periodic, if there is N > 1 (the period) such that all cycles of the graph
are multiples of N. If there is no such N then Markov chain is called
aperiodic.
8.1.2 Trajectory averages
8.1.2 Asymptotic probabilities
Let pk(n) = Pr(Xn = k), and suppose that the limit
pk(¥) = limn®¥pk(n) exists. Since pk(n+1) = åj = 1dpj(n)P(j,k),
therefore the limit probabilities satisfy the stationarity
equations (79)
1/4
3/4
3/4
1/4 ].
0
1
1
0 ].
1
0
0
1 ].
8.1.2 Example: two-state Markov chain
1-a
a
b
1-b ]. Then
Pn =
1
a+b
é
ê
ë
b
a
b
a
ù
ú
û
+
(1-a-b)n
a+b
é
ê
ë
a
-a
-b
b
ù
ú
û
.
Pn®
é
ê
ê
ê
ê
ë
b
a+b
a
a+b
b
a+b
a
a+b
ù
ú
ú
ú
ú
û
Then Yn = (Xn,Xn+1) is also a Markov process.
Find its transition matrix and the stationary distribution.
P = [
a
1-a
1-b
b
]. 8.2 Simulating Markov chains
Homogeneous and non-homogeneous Markov chains with finite state space
are fairly straightforward to simulate. A generic prescription is to
simulate Xn+1 using the conditional distribution determined by
Xn. This simulation differs very little from the generic method
described in Section 2.1.2
8.2 Example: how long a committee should discuss a topic?
This
example involves simulation of a Markov chain. For Markov chains many
theoretical results are available, but simulation is often the most
expedient way to study it.
8.2.1 Example: soccer
The soccer game29 is played by two teams, each with 10 players in the field and a goalkeeper. A
modern line-up split the players into between the zones of defence, center, and
attack. Thus a configuration (3-4-4) means 3 defenders (backs), 4 midfield link
men and 4 strikers (forwards). In the Markov model of soccer we will just watch
the position of the ball, and we assume it can be only in one of the five
positions: left goal, left defence, midfield, right defence, right goal. Wee
shall assume that at every unit of time the ball has to move left or right with
chances proportional to the number of players lined-up for a given position.
8.2.1 A brief review of game theory
In game theory, a rational player is supposed to minimize
her losses against the best choice of the opponent. The terminology uses
minimax= minimize maximal loss,
maximin= maximize minimal gain; the amazing fact is that these are equal and
often the optimal strategies are randomized (mixed).
8.2.1 Modifications
As given above, the model of the soccer game is simple enough to be analyzed
without computer. In a more realistic model one could consider additional factors.
8.3 One step analysis
For homogeneous Markov chains a surprising number of quantities of
interest can be computed using the so called one step analysis.
The idea is to find out what happens to the quantity of interest within
one step of the evolution, and solve the resulting difference equation.
8.3.1 Example: vehicle insurance claims
C(s1,s2,s3,s4) =
å
pj(Pj+mj), (82) 8.3.2 Example: a game of piggy
In a game of piggy, each player tosses two dice, and has an option of
adding the outcome to his
score, or rolling again. The game ends when the first player exceeds 100 points.
8.4 Recurrence
A state x of Markov chain is recurrent, if with probability one the
chain returns back to x. Otherwise it is called transient.
If Xt is irreducible then all states are
simultaneously either recurrent, or transient.
Interesting fact: simple random walks in Rd are recurrent when
d < 3, transient when d ³ 3. However the return times have infinite
average.
8.5 Simulated annealing
P(u,v) = C(u)e-q(U(v)-U(u))+, (83)
An efficient realization of the above is to
pick v Î \cal Nu at random and accept it with
probability
1 if U(v) £ U(u)
e-QU(v) otherwise.
8.5.1 Program listing
The program is available online, or on the disk.
'
'Declarations of subprograms
DECLARE SUB Modify (Path!(), i!)
DECLARE SUB GetNeighbor (P!(), T!)
DECLARE SUB GetRandomPair (i0!, P!(), n!)
DECLARE SUB display ()
DECLARE SUB AssignDistances (n!)
DECLARE SUB GetPermutation (P!())
DECLARE FUNCTION PathLen! (P!())
'
'Clear display and write a title
CLS
LOCATE 2, 1
INPUT ; "Shortest distance between how many cities?", n
CLS
nMax = 19 'only 19 cities in DATA statement
IF n > nMax THEN n = nMax
'declare matrices required
DIM P(n), BestP(n) ' permutation and best permutation P(i)=# of i-th city visited
'shared matrices
DIM SHARED dist(nMax, nMax) 'distances between cities i, j
DIM SHARED city(nMax) AS STRING 'names of cities
'read distances
CALL AssignDistances(nMax)
'initial permutation
FOR j = 1 TO n
P(j) = j + 1
BestP(j) = j + 1
NEXT j
P(n) = 1
BestP(n) = 1
MinLen = PathLen(P())
T = 1 / 10000
DO 'infinite loop
'theta =1/temperature
T = T * 1.1
PRINT "T="; 1 / T
FOR ct = 1 TO 100 * n
'check if key pressed
k$ = INKEY$
IF k$ > "" THEN EXIT DO 'exit infinite loop
CALL GetNeighbor(P(), T)
display
x = PathLen(P())
IF x < MinLen THEN 'memorize
MinLen = x
FOR j = 1 TO n
BestP(j) = P(j)
PRINT city(P(j)); "->";
NEXT j
PRINT city(P(1))
PRINT "Best so far: "; MinLen
END IF
NEXT ct
LOOP
'Print final message
CLS
PRINT "ALPHABETIC TOUR OF FIRST "; n; " CITIES in the USA"
PRINT "Simulated Annealing Recommended Route ";
FOR j = 1 TO n - 1
PRINT city(BestP(j)); "-->";
NEXT j
PRINT city(BestP(n)); "-->"; city(BestP(1))
PRINT "Total distance: "; MinLen
PRINT "Can you find a better one?"
LOCATE 22, 40
PRINT "(Distances subject to change)"
END
cities: 'FIRST 19 cities in alphabetic order (Source: AAA map)
DATA " Albany", "Albuquerque", "Atlanta", "Atlantic City", "Augusta"
DATA "Austin", "Baltimore","Baton Rouge", "Billings", "Birmingham", "Bismarck", "Boise", "Boston"
DATA "Buffalo", "Calgary", "Charlotte", "Cheyenne", "Chicago", "Cincinnati"
distances: 'from city number RowNumber to RowNumber+k, k=0 to nMax
DATA 0,2046,1067,315,336,1985,358,1585,2071,1167,1654,2615,174,285,2485,808,1789,823,724
DATA 0,1430,1986,2384,849,1949,1110,1006,1298,1423,1205,2222,1761,1712,1663,541,1289,1347
DATA 0,872,1287,1011,671,580,2019,158,1571,2451,1125,926,2425,258,1625,740,487
DATA 0,535,1790,163,1390,2085,968,1668,2629,373,455,2522,614,1803,837,664
DATA 0,2205,578,1805,2407,1383,1990,2951,162,621,2677,1029,2125,1159,1060
DATA 0,1627,436,1650,853,1558,2011,2043,1591,2356,1269,1204,1155,1152
DATA 0,1227,1948,805,1531,2492,416,459,2385,451,1666,700,527
DATA 0,1878,422,1589,2239,1643,1309,2443,838,1432,976,870
DATA 0,1861,417,732,2245,1786,706,2019,465,1248,1542
DATA 0, 1521,2293,1221,922,2375,416,1467,690,483
DATA 0,1149,1828,1369,927,1602,882,831,1125
DATA 0,2789,2330,875,2495,826,1792,2040
DATA 0,459,2515,867,1963,997,898
DATA 0,2200,693,1504,538,439
DATA 0,2456,1171,1685,1979
DATA 0,1669,771,490
DATA 0,966,1214
DATA 0,294
DATA 0
SUB AssignDistances (n)
'This reads all distances to dist(i,j)
RESTORE cities
FOR j = 1 TO n
READ city(j)
NEXT j
RESTORE distances
REDIM dist(n, n)
FOR j = 1 TO n
FOR i = j TO n
READ x
dist(i, j) = x
dist(j, i) = x
'minimal error handling
IF x = 0 AND i <> j THEN PRINT "Error in DATA before row "; j; "col"; i: STOP
NEXT i
NEXT j
END SUB
SUB display
'show information about what the program does
STATIC tst
tst = tst + 1
y = CSRLIN
x = POS(1)
LOCATE 1, 45
PRINT " Hit any key to stop"
LOCATE 2, 60
PRINT " "
LOCATE 2, 60
PRINT "Trying path #"; tst
LOCATE y, x
END SUB
SUB GetNeighbor (P(), T)
'Move at random to another nearby permutation P
n = UBOUND(P)
CP = PathLen(P())
DIM Q(n)
DO
FOR k = 1 TO n: Q(k) = P(k): NEXT
i = INT(n * RND(1)) + 1
j = INT(i * RND(1)) + 2
IF j > n THEN j = 1
SWAP Q(i), Q(j)
Prob = EXP(-T * (CP - PathLen(Q())))
'IF Prob > 1 THEN EXIT DO
IF RND(1) < Prob THEN EXIT DO
LOOP
SWAP P(i), P(j)
END SUB
FUNCTION PathLen (P())
'Compute length of path P according to distances in Dist(i,j)
n = UBOUND(P)
s = dist(P(1), P(n))
FOR j = 1 TO n - 1
s = s + dist(P(j), P(j + 1))
NEXT j
PathLen = s
'
END FUNCTION
8.6 Solving difference equations through simulations
u(x,t+1) = u(x,t)+A
1
2d
å
v = ±ek
u(x+v,t) (84)
u(x,o) = u0(x) (85) 8.7 Markov Autoregressive processes
8.8 Sorting at random
Efficiency of sorting algorithms is often
measured by the number of comparisons required.
'
SUB QuickSort (Index(), Aux(), First%, Last%)
'sorts two matrices in increasing order by the valuies of Index() from pocz to kon
' Note: mixes order of indices corresponding to equal Index(j)
'** Quick-sort (ascending) the fields in Array$(), from field First% thru Field Last%
IF First% >= Last% THEN EXIT SUB
CONST max = 30
DIM Lft%(max + 1), Rght%(max + 1)
Temp% = 1
Lft%(1) = First%
Rght%(1) = Last%
DO
Start% = Lft%(Temp%)
Ende% = Rght%(Temp%)
Temp% = Temp% - 1
DO
IndexLft% = Start%
IndexRght% = Ende%
x = Index((Start% + Ende%) \ 2)
DO
WHILE Index(IndexLft%) < x AND IndexLft% < Ende%
IndexLft% = IndexLft% + 1
WEND
WHILE x < Index(IndexRght%) AND IndexRght% > Start%
IndexRght% = IndexRght% - 1
WEND
IF IndexLft% > IndexRght% THEN EXIT DO
SWAP Index(IndexLft%), Index(IndexRght%) '** switch elements
SWAP Aux(IndexLft%), Aux(IndexRght%)
IndexLft% = IndexLft% - (IndexLft% < Ende%)
IndexRght% = IndexRght% + (IndexRght% > Start%)
LOOP
IF IndexRght% - Start% >= Ende% - IndexLft% THEN
IF Start% < IndexRght% THEN
Temp% = Temp% + 1
Lft%(Temp%) = Start%
Rght%(Temp%) = IndexRght%
END IF
Start% = IndexLft%
ELSE
IF IndexLft% < Ende% THEN
Temp% = Temp% + 1
Lft%(Temp%) = IndexLft%
Rght%(Temp%) = Ende%
END IF
Ende% = IndexRght%
END IF
LOOP WHILE Start% < Ende%
IF Temp% > max THEN Temp% = 0
LOOP WHILE Temp%
'
END SUB
8.9 An application: find k-th largest number
The following theorem occasionally helps to estimate the average time of
accomplishing a numerical task.
As an application we consider the following algorithm to pick the k-th number
in order from a set S of n
numbers.
Chapter 9
Branching processes
un+1 =
¥
å
k = 0
pk (un)k. (86) 9.1 The mean and the variance
mn = mn (87)
Vn = s2mn-1(1-mn)/(1-m). (88) 9.2 Generating functions of Branching processes
Let g(z) = åk = 0¥ pkzk be the
generating function.
Clearly g(z) » p0+mz for small z.
Equation (86) for probabilities un of extinction by
the n-th generation is
un+1 = g(un).
In particular, EXn = [d/ dz] g°(n)(z)|z = 1 = mn and
un = Pr(Xn = 0) = g°(n)(0).
9.2.1 Extinction
g(s) = s. (89)
9.3 Two-valued case
Below we re-analyzes the growth model presented
in Section 7.5.4. Suppose that the probabilities of
offspring are p0 = q, p2 = 1-q.
9.4 Geometric case
Suppose that the
probabilities of offspring are
p0 = 1-rq,pk = q(1-r)rk.
Chapter 10
Multivariate normal distribution10.1 Multivariate moment generating
function
a1
:
ak ].
By AT we denote the transpose of a matrix A.
M(t) = exp( t·m+
st2
2
). 10.2 Bivariate normal distribution
In this section we consider a pair of
(jointly) normal random variables X1, X2. For simplicity of
notation we suppose EX1 = 0, EX2 = 0. Let
Var(X1) = s12, Var(X2) = s2 2 and denote corr(X1, X2) = r.
Then the covariance matrix is
s12
r
r
s22 ] and the joint moment generating function is
M(t1, t2) = exp(
1
2
t12s12+
1
2
t22 s2 2+t1t2r).
1
r
r
1 ];
it generates scalar product given by
á
é
ê
ë
x1
x2
ù
ú
û
,
é
ê
ë
y1
y2
ù
ú
û
ñ
= x1y1+x2y2+rx1y2+rx2y1
x1
x2 ] ||| = (x12+x22+2rx1x2)1/2.
Notice that when r = ±1 the semi-norm is degenerate
and equals |x1±x2|.
Y1 = g1cosq+ g2sinq,
Y2 = g1sinq+ g2cosq
f(x, y) =
1
2pcos2q
exp(-
1
2cos22q
( x2+ y2-2xysin2q)) (90)
Y1 = g1,
Y2 = rg1+
æ
Ö
1-r2
g2 10.2.1 Example: normal
random walk
In this example we analyze a discrete
time Gaussian random walk {Xk}0 £ k £ T.
Let x1,x2,¼ be i. i. d. N(0,1).
We are interested in explicit formulas for the moment generating function
and
for the density of the IRT-valued random variable
X = (X1, X2, ¼, XT), where
Xk =
k
å
j = 1
xj (91)
A =
é
ê
ê
ê
ê
ê
ë
1
0
¼
0
1
1
¼
0
:
···
:
1
1
¼
1
ù
ú
ú
ú
ú
ú
û
.
M(t) = exp
1
2
(t12+(t1+t2)2+¼+(t1+t2+¼+tT)2).
A-1 =
é
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ë
1
0
0
¼
¼
0
-1
1
0
¼
¼
0
0
-1
1
¼
¼
0
0
0
-1
¼
¼
0
:
···
···
:
0
¼
0
¼
-1
1
ù
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
û
.
f(x) = (2p)-n/2exp-
1
2
(x12+(x2-x1)2+¼+(xT-xT-1)2). (92) 10.3 Simulating a multivariate normal distribution
10.3.1 General multivariate normal law
M(t) = exp( t·m+
1
2
Ct·t). (93) 10.4 Covariance matrix
Theorem 3.2 identifies m Î IRd as the mean of
the normal random variable Z = (Z1, ¼, Zd); similarly,
double differentiation M(t) at t = 0 shows that
C = [ci, j]
is given by
ci, j = Cov(Zi, Zj). This establishes the following.
M(t) = exp( t·m+
1
2
(At)·(At)). (94)
10.4.1 Multivariate normal density
Now we consider the multivariate normal density.
The density of independent g1,..., in Theorem
10.4 is the product of the one-dimensional standard normal densities, ie.
fg (x) = (2p)-d/2exp(-
1
2
|| x||2).
fZ (x) = (2p)-d/2 ( det
A)-1exp(-
1
2
|| A-1x||2),
fZ (x) = (2p)-d/2 ( det
C)-1/2exp(-
1
2
C-1x·x),
Eexp(e|| Z||2) < ¥. 10.4.2 Linear regression
For general multivariate normal random variables X and Y we have the following
linearity of regression result.
E{X|Y} = a+QY; (95)
CX
R
RT
CY ].
10.5 Gaussian Markov processes
Chapter 11
Continuous time processes are the families of random variables
Xt, with t ³ 0 interpreted as time.
Continuous time processes11.1 Poisson process
Pr
(Nt = k+1) = Pr
(T1+...+Tk+1 > t) =
ó
õ
¥
t
1/G(k+1)xke-x dx.
Pr
(Nt = k+1) =
t
k+1
Pr
(Nt = k).
Similar
processes are considered in reliability theory, and in queueing theory,
also for non-exponential sojourn times Tj.
l =
lim
h® 0
Pr
(Nt+h-Nt = 1)
h
(96) 11.1.1 The law of rare events
Let N(I) denote the number of
events that occur in interval I. We make the following postulates.
In the limit, what fraction of memory locations has two or
more variables assigned?
11.1.2 Compound Poisson process
11.1.2 Examples
11.2 Continuous time Markov processes
11.2.1 Examples of continuous time Markov processes
Despite many
similarities, continuous models differ from discrete ones in their
predictions.
11.2.1 Growth model
In Section 7.5.4 we considered discrete-time growth model
which assumed that bacteria divide at fixed time intervals. This assumption
is well known not to be satisfied - mitosis is a process that
consists of several
stages of various lengths, of which the longest may perhaps be
considered to have exponential density. In this section we shall assume that a
colony of bacteria consists
of
X(t) cells which divide at exponential moments of time, or die. We
assume individual cells multiply at a rate a and die at a rate b.
One way to interpret these numbers is to assume that there are two
competing effects: extinction or division. When there are k such
cells, the population grows one cell at a time, rate of growth is ak,
rate of death is kb. We assume a > b.
11.3 Gaussian processes
11.4 The Wiener process
W0
=
0; (97)
EWt
=
0 for all t ³ 0; (98)
EWtWs
=
min
{t, s} for all t, s ³ 0. (99) Chapter 12
Time Series12.1 Second order stationary processes
12.1.1 Positive definite functions
K(t) = K(0)Ecos(tQ). (100) 12.2 Trajectory averages
When a time series Xt is observed, 1/n(X1+...+Xn) is the
``trajectory average". It is interesting how does this compare to the
``probabilistic" average EX.
12.3 The general prediction problem
The basic problem in filtering and prediction is as follows. Given variables
X1,...,Xn, find the estimator of the value of Y with smallest quadratic
error. Case n = 1 is presented is Sections 2.11.1 and
2.14 for the linear and non-linear case.
12.4 Autoregressive processes
Xt+1 =
d
å
j = 0
aj Xt-j + xt+1 (101)
d
å
j = 0
aj
a0
Xt-j. (102)
d
å
j = 0
aj Xt-j =
å
i
bi xt-i (103) Chapter 13
Additional topics13.1 A simple probabilistic modeling in Genetics
PAA = pA2 (104)
PAa = 2papA (105)
Paa = pa2 (106)
(107) 13.2 Application: verifying matrix multiplication
Suppose one has an algorithm to multiply large matrices and we want to check if the
output is correct. A possible method is to pick the vector X of 0,1 and
check if AB X = CX. This is the so called Freivalds
technique
13.3 Exchangeability
Definition 26 A sequence (Xk) of random variables is
exchangeable, if the joint distribution of
Xs(1),Xs(2), ¼, Xs(n) is the same as the joint
distribution of X1, X2, ¼, Xn for
all n ³ 1 and for all permutations s of
{1, 2, ¼,n}.
P(X1 < a1, X2 < a2, ¼, Xn < an |\cal N)
= P(X1 < a1|\cal N) P(X1 < a2 |\cal N)¼P(X1 < an |\cal N)
EXnXm = EE{XnXm|\cal Fn} = EXnE{Xm|\cal Fn}
= EXnE{E{X|\cal Fm}|\cal Fn} = EXnE{X|\cal Fn} = EXn2.
Proof of Theorem 13.3. Let \cal N be the tail s-field, ie.
\cal N =
¥
Ç
k = 1
s(Xk, Xk+1, ¼)
Fn = f(X1, ¼, Xn);
G n, m = g(Xn+1, ¼, Xm+n) ;
H n, m, N = h(Xm+n+N+1, Xm+n+N+2, ¼),
EFnGn, mHn, m, N = EFnGn+r, mHn, m, N
E{FnGn, m|\cal Nm+n+N+1} = E{FnGn+r, m|\cal Nm+n+N+1}.
E{FnGn, m|\cal N} = E{FnGn+r, m|\cal N}.
E{FnGn, m|\cal N} = E{Gn+r, mE{Fn|\cal Nn+r+1}|\cal N}.
E{Gn+r, mE{Fn|\cal Nn+r+1}|\cal N}
E{Gn+r, mE{Fn |\cal N}|\cal N} = E{Fn |\cal N} E{Gn+r, m |\cal N}
E{FnGn, m|\cal N} = E{Fn|\cal N} E{Gn, m |\cal N}.
13.4 Distances between strings
Function Dist (U$, V$) As Integer
'Returns the edit distance
'(number of elementary changes: replacement, deletion, insertion,transposition)
'that are required to transform word U$ into V$
'
'Declare auxiliary variables
Dim m As Integer, n As Integer, j As Integer, i As Integer
Dim x As Integer, y As Integer, z As Integer, A$, B$
If Len(U$) < Len(V$) Then A$ = U$: B$ = V$: Else A$ = V$: B$ = U$
m = Len(A$)
n = Len(B$)
'Declare matrix of distances between substrings of i,j characters
ReDim D(m, n) As Integer
'Assign boundary values: distances from empty string
For i = 0 To m: D(i, 0) = i: Next i
For j = 1 To n: D(0, j) = j: Next j
'
'Main recurrence: Compute next distance D(i,j) from previously found values
For i = 1 To m
For j = 1 To n
x = D(i - 1, j) + 1 'delete character
y = D(i, j - 1) + 1 'inserte character
x = Intmin(x, y) 'choose better (Integer Minimum)
y = D(i - 1, j - 1) - (Mid$(A$, i, 1) <> Mid$(B$, j, 1)) 'swap characters i,j
x = Intmin(x, y) ' choose better
z = 0
If i > 1 And j > 1 Then
'If Mid$(A$, i, 1) <> Mid$(B$, j, 1) Then
z = (Mid$(A$, i, 1) = Mid$(B$, j - 1, 1)) * (Mid$(A$, i - 1, 1) = Mid$(B$, j, 1))
'End If
y = (1 + D(i - 2, j - 2)) * z + x * (1 - z)
x = Intmin(x, y)
End If
D(i, j) = x
Next j
Next i
Dist = D(m, n)'current value
End Function
13.5 A model of cell growth
13.5 Questions of interest
How does the growth of cells affect other cells? How to control mixed populations
of cells to stay within prescribed limits?
13.6 Shannon's Entropy
Let X1,... Xn be independent identically distributed
(i. i. d.) discrete r. v with k values, say {v1,..., vk}. Put
X = (X1,...,Xk).
H(X) = -Elogf(X) = -
å
x
f(x)logf(x) (108) 13.6.1 Optimal search
Coding
Huffman's code
13.7 Application: spread of epidemic
Appendix A
Theoretical complementsA.1 Lp-spaces
Inequalities related to expected values are best stated in geometric language of
norms and normed spaces. We say that X Î Lp, if X is p-integrable, i.e.
E|X|p < ¥. In particular, X is square integrable if EX2 < ¥.
||X||p =
ì
ï
ï
í
ï
ï
î
_____
pÖE|X|p
if p ³ 1;
ess sup|X|
if p = ¥.
||X||p £ ||X||q. (109)
EXY £ ||X||p||Y||q. (110)
||X+Y||p £ ||X||p+||Y||p. (111)
ab £ ap/p+bq/q. (112)
tp/p+t-q/q ³ 1.
Proof of Theorem A.1 (ii).
If either ||X||p = 0 or ||Y||q = 0, then XY = 0 a. s. Therefore we
consider only the case ||X||p||Y||q > 0 and after rescaling we assume
||X||p = ||Y||q = 1. Furthermore, the case p = 1, q = ¥ is trivial
as
|XY| £ |X| ||Y||¥. For 1 < p < ¥ by (112) we have
|XY| £ |X|p/p+|Y|q/q.
Proof of Theorem A.1 (i).
For p = 1 this is just Jensen's inequality; for a more general
version see Theorem . For 1 < p < ¥
by Hölder's inequality applied to the product of 1 and |X|p
we have
||X||pp = E{|X|p ·1} £ (E|X|q)p/q (E1r)1/r = ||X||qp,
Proof of Theorem A.1 (iii).
The inequality is trivial if p = 1 or if ||X+Y||p = 0. In the remaining
cases
||X+Y||pp £ E{(|X|+|Y|)|X+Y|p-1} = E{|X||X+Y|p-1}+ E{|Y||X+Y|p-1}.
||X+Y||pp £ ||X||p||X+Y||pp/q+||Y||p||X+Y||pp/q.
A.2 Properties of conditional expectations
In more advanced courses conditional expectation E(X|Y) is defined as a random
variable f(Y) that satisfies EXf(Y) = Ef(Y)f(Y) for all bounded measurable
(or just continuous) functions f.
Proof of Theorem A.2.
ó
õ
A
Y1 dP =
ó
õ
A
Y2 dP
ó
õ
A
Y1 dP =
ó
õ
A
g(X) dP ³ fa, b(
ó
õ
A
X) dP = fa, b(
ó
õ
A
E{X|\cal F}) dP =
ó
õ
A
Y2 dP.
Problem 79 Prove part (v) of Theorem A.2.
P(|X| > t |Y) £ E{|X| |Y}/t
E{X|Y} = aY, E{Y|X} = bX, Appendix B
The following sections are short reference on material from general math
(calculus, linear algebra, etc).
Math backgroundB.1 Interactive links
The following links are operational as of Mar 13, 1998
. Please note that these may
change at any time.
B.2 Elementary combinatorics
The art of counting is called combinatorics.
Here is a short listing of the formulas. All are the consequences of the
product rule of counting.
Permutation is an arrangement (ordering) of k out of n distinct objects
without repetitions. The number of permutations is [n!/((n-k)!)]. In
particular, the number of ways to order n objects is n! B.3 Limits
The following limit can be computed by L'Hospital rule.
lim
n® ¥
(1+a/n)n = ea (113) B.4 Power series expansions
The following power series expansions are of interest in this course.
ex =
¥
å
k = 0
xk/k! (114)
1
1-x
=
¥
å
k = 0
xk (115)
ln1+x =
¥
å
k = 0
(-1)k
xk+1
k+1
ex2/2 =
¥
å
k = 0
ó
õ
x
0
e-t2/2 dt =
¥
å
k = 0
(116) B.5 Multivariate integration
ó
õ
ó
õ
U
f(x,y) dxdy =
ó
õ
ó
õ
V
f(x(u,v), y(u,v)) |J(u,v)| dudv, (117)
J = det
é
ê
ê
ê
ê
ë
¶x
¶u
¶x
¶v
¶y
¶u
¶y
¶v
ù
ú
ú
ú
ú
û
(118) B.6 Differential equations
B.7 Linear algebra
B.8 Fourier series
The Fourier series for a function f(x) is the expansion
f(x) = ån an sinpnx+ bn cospnx. Every periodic continuous function f can be expanded in the Fourier
series. The coefficients are
b0 =
1
p
ó
õ
p
-p
f(x) dx (119)
bn =
1
2p
ó
õ
p
-p
f(x)cosnpx dx for n ¹ 0 (120)
an =
1
2p
ó
õ
p
-p
f(x)sinnpx dx (121)
(122) B.9 Powers of matrices
The Cayley - Hamilton theorem asserts that each d×d matrix A
satisfies the polynomial equation Qd(A) = 0, where Qd(x) = det(A-xI) is the
characteristic polynomial of degree d.
Appendix C
Calculators and more sophisticated math-geared software have efficient numerical
methods built-in. Here are short prescriptions that may be used by general
programmer. A more complete source is e.g.Numerical Recipes: The Art of
Scientific Computing series.
Numerical MethodsC.1 Numerical integration
C.2 Solving equations
C.3 Searching for minimum
The analog of the bisection method for finding minima is to begin with three
points a < b < c such that f(b) is the smallest of the three. The next triple is
produced by partitioning the larger of the two segments in
proportion38 [(3-Ö5)/ 2] » 0.38197, and
comparing the resulting four values to pick the narrower triplet.
Appendix D
Programming help is grouped by the type of software.
Currently available (as of Mar 13, 1998
) are preliminary versions of:
Programming Help
D.1 Introducing BASIC
Correctly written programs halt at the end. But not all programs do that, so an
``emergency exit" is build in.
D.1 how to use this chapter
This text was written for a complete novice to BASIC. If you fit the description,
read the pages below, and type in each of the sample programs. Once you have them
in the QBASIC, run them to see the effects of various instrunctions.
D.2 Computing
The mathematical conventions in QBASIC are
2^(1/2) for 21/2 , SQR(13) for [Ö13],
LOG(2) for the natural logarithm ln2, etc.
With these, one can use QBASIC as a calculator.
For instance the instruction
PRINT LOG(1+2^(1/2))
D.3 The structure of a program
PRINT "Today is "; DATE$
PRINT TIME$
PRINT "HELLO" 'greet the user
PRINT "Today is "; DATE$ 'print date
PRINT TIME$ ' print time could have printer TIMER=# seconds since 1980
CLS 'clear display
CenterPrint "HELLO" 'greet the user
CenterPrint "Today is " & DATE$ 'print date - string has to be concatenated
CenterPrint TIME$ ' print time could have printer TIMER=# seconds since 1980
SUB CenterPrint (Text$)
'**** print text Text$ centered on screen
l=41-LEN(Text$)/2
if l<0 then l=0 'too long text cannot be centered
print TAB(l); Text$
D.4 Conditionals and loops
Loops of fixed size are best handled by
FOR j=1 TO 100
S=S+j
NEXT J
PRINT S
CASE 0 ...
CASE .5 ...
CASE ELSE
END SELECT
n=1
max = 30000
nc = n - 1
WHILE k < max
DO
nc = nc + 1
prime = -1 'TRUE
FOR j = 2 TO SQR(nc)
IF (nc MOD j) = 0 THEN prime = 0: EXIT FOR
NEXT j
IF prime THEN EXIT DO
LOOP
k = k + 1
Print k;"-th prime is "; nc
WEND
D.5 User input
D.6 More about printing
Instruction PRINT is used to print to the screen, and in a slightly
modified version, to the file on the disk.
QBASIC provides sophisticated ways of controlling the text output by format,
colors, location on the screen. In addition it does have graphic statements, as
long as the computer has graphics card. But the only thing needed for us is the
regular
D.7 Arrays and matrices
Vectors and matrices are handled as arrays. They need to be declared to reserve room in memory. This is accomplished by the dimensioning statement
tt DIM ArrayName (Size1, Size 2, Size 3).
For
instance DIM A(100) specifies a vector, DIM A(20,50) defines a
20×50 matrix. Arrays use up a lot of memory, so don't declare arrays
larger than what you need, and pay attention to .
D.8 Data types
QBASIC supports many built in data types. Use declaration like DIM n AS INTEGER,
or append the name by the corresponding marker n%.
D.9 User defined FUNCTIONs and SUBs
To create a SUB, choose New SUB from the Edit menu.
To create a FUNCTION, choose New FUNCTION from the Edit menu.
Solution=-b/a
END FUNCTION
D.10 Graphics
D.11 Binary operations
D.12 File Operations
Beginners in BASIC need no file operations to solve the exercises.
OPEN FileName FOR INPUT as #17
D.13 Good programming
Adhering to good programming principles
pays in clarity of programs, and facilitates debugging (finding errors).
Adhering
to the principles below is not a guarantee that the programs will work. It is also possible
to write programs that execute correctly without any of the below. Nevertheless,
it is a good habit to follow these recommendation. The gain is in clarity of the program,
readability of its portions. Consequently, you will be able to design more complex
programs that execute as expected. You will also re-use components easier.
D.14 Example: designing an automatic card dealer
A deck of cards consists of 52 cards. Each card has two attributes:
D.14.1 First Iteration
Once we realize what are the
natural steps the real person would go through, the program is very easy to design. Here is the program(!)
'PROGRAM: GIVECARD.BAS
InitializeDeck
ShuffleDeck
'ask first player
n=HowManyCards
DealCards(n)
'ask second player
n=HowManyCards
DealCards(n)
Suite as string
Value as integer ' we may want string here, too!
End type
DIM Shared Order(52)
D.14.1 SUBS
SUB InitializeDeck
'Initialize cards to their values.
'
END SUB
'Make a random permutation
'Store it in shared array Order()
'Order(1), Order(2) are distinct random numbers range 1,...n,
END SUB
FUNCTION HowManyCards
' Ask user how many cards (s)he requests
' store in variable.
' Check for possible errors in input
' return value if enough cards are left.
END FUNCTION
' Remember how many cards are left
' Check how many cards are left
' Print out Error message if not enough cards
'Print next n cards
' You have to decide here HOW the cards will appear on screen:
'words? pictures? numbers?
' (In this example, it will be numbers)
END SUB
D.14.2 Second Iteration
DEF TYPE Cards
'***test***
REM Suite as string
Suite as integer
'*** end test ***
Value as integer ' we may want string here, too!
End type
Dim Shared Deck(52) as cards
DIM Shared Order(52)
InitializeDeck
ShuffleDeck
'ask first player
n=HowManyCards
DealCards(n)
'ask second player
n=HowManyCards
DealCards(n)
SUB InitializeDeck
'Initialize cards to their values.
'*** test ***
for j = 1 to 52
Deck(j).value=j mod 13 +1
Deck(j).suit=j mod 4 +1
next j
'*** end test ***
END SUB
SUB ShuffleDeck
'Make a random permutation
'Store it in Order()
'Order(1), Order(2) are distinct random numbers range 1,...n,
'*** test ***
'Factory order
For j= 1 to 52
Order(j)=j
NEXT j
'*** end test ***
END SUB
FUNCTION HowManyCards%
' Ask user how many cards (s)he requests
' store in variable.
' Check how many cards are left
' Print out Error message if not enough cards
' return value if enough cards are left.
'*** test ***
INPUT ``How many cards"; x
' should check for "crazy" answers here
HowManyCards=x
'*** end test ***
END FUNCTION
SUB DealCards(n)
' Remember how many cards are left
'Print next n cards
'*** test ***
' just print first n cards for now
Print "Your cards are:"
FOR j=1 TO n
Print Deck(Order(j)).Value ; " of suit No" ;Deck(Order(j)).Suit
NEXT j
'*** end test ***
END SUB
D.14.3 Third iteration
Now we are ready to design/code each SUB. This is left for the reader to do.
Here are some hints.
Bibliography
Footnotes:
PRINT ".";
NEXT J
File translated from TEX by TTH, version 1.31.