[section] [section] Theorem]Corollary Theorem]Lemma [section] [section] [section] [chapter] [section] Theorem]Proposition
This book is a concise presentation of the normal distribution on the real line and its counterparts on more abstract spaces, which we shall call the Gaussian distributions. The material is selected towards presenting characteristic properties, or characterizations, of the normal distribution. There are many such properties and there are numerous relevant works in the literature. In this book special attention is given to characterizations generated by the so called Maxwell's Theorem of statistical mechanics, which is stated in the introduction as Theorem . These characterizations are of interest both intrinsically, and as techniques that are worth being aware of. The book may also serve as a good introduction to diverse analytic methods of probability theory. We use characteristic functions, tail estimates, and occasionally dive into complex analysis.
In the book we also show how the characteristic properties can be used to prove important results about the Gaussian processes and the abstract Gaussian vectors. For instance, in Section we present Fernique's beautiful proofs of the zero-one law and of the integrability of abstract Gaussian vectors. The central limit theorem is obtained via characterizations in Section .
The excellent book by Kagan, Linnik & Rao [] overlaps with ours in the coverage of the classical characterization results. Our presentation of these is sometimes less general, but in return we often give simpler proofs. On the other hand, we are more selective in the choice of characterizations we want to present, and we also point out some applications. Characterization results that are not included in [] can be found in numerous places of the book, see Section , Chapter and Chapter .
We have tried to make this book accessible to readers with various backgrounds. If possible, we give elementary proofs of important theorems, even if they are special cases of more advanced results. Proofs of several difficult classic results have been simplified. We have managed to avoid functional equations for non-differentiable functions; in many proofs in the literature lack of differentiability is a major technical difficulty.
The book is primarily aimed at graduate students in mathematical statistics and probability theory who would like to expand their bag of tools, to understand the inner workings of the normal distribution, and to explore the connections with other fields. Characterization aspects sometimes show up in unexpected places, cf. Diaconis & Ylvisaker []. More generally, when fitting any statistical model to the data, it is inevitable to refer to relevant properties of the population in question; otherwise several different models may fit the same set of empirical data, cf. W. Feller []. Monograph [] by Prakasa Rao is written from such perspective and for a statistician our book may only serve as a complementary source. On the other hand results presented in Sections and are quite recent and virtually unknown among statisticians. Their modeling aspects remain to be explored, see Section . We hope that this book will popularize the interesting and difficult area of conditional moment descriptions of random fields. Of course it is possible that such characterizations will finally end up far from real life like many other branches of applied mathematics. It is up to the readers of this book to see if the following sentence applies to characterizations as well as to trigonometric series.
``Thinking of the extent and refinement reached by the theory of trigonometric series in its long development one sometimes wonders why only relatively few of these advanced achievements find an application.''
(A. Zygmund, Trigonometric Series, Vol. 1, Cambridge Univ. Press, Second Edition, 1959, page xii)
There is more than one way to use this book. Parts of it have been used in a graduate one-quarter course Topics in statistics. The reader may also skim through it to find results that he needs; or look up the techniques that might be useful in his own research. The author of this book would be most happy if the reader treats this book as an adventure into the unknown - picks a piece of his liking and follows through and beyond the references. With this is mind, the book has a number of references and digressions. We have tried to point out the historical perspective, but also to get close to current research.
An appropriate background for reading the book is a one year course in real analysis including measure theory and abstract normed spaces, and a one-year course in complex analysis. Familiarity with conditional expectations would also help. Topics from probability theory are reviewed in Chapter , frequently with proofs and exercises. Exercise problems are at the end of the chapters; solutions or hints are in Appendix .
The book benefited from the comments of Chris Burdzy, Abram Kagan, Samuel Kotz, Wodek Smole\'nski, Pawe Szabowski, and Jacek Wesoowski. They read portions of the first draft, generously shared their criticism, and pointed out relevant references and errors. My colleagues at the University of Cincinnati also provided comments, criticism and encouragement. The final version of the book was prepared at the Institute for Applied Mathematics of the University of Minnesota in fall quarter of 1993 and at the Center for Stochastic Processes in Chapel Hill in Spring 1994. Support by C. P. Taft Memorial Fund in the summer of 1987 and in the spring of 1994 helped to begin and to conclude this endeavor.
The following narrative comes from J. F. W. Herschel [].
``Suppose a ball is dropped from a given height, with the intention that it shall fall on a given mark. Fall as it may, its deviation from the mark is error, and the probability of that error is the unknown function of its square, ie. of the sum of the squares of its deviations in any two rectangular directions. Now, the probability of any deviation depending solely on its magnitude, and not on its direction, it follows that the probability of each of these rectangular deviations must be the same function of its square. And since the observed oblique deviation is equivalent to the two rectangular ones, supposed concurrent, and which are essentially independent of one another, and is, therefore, a compound event of which they are the simple independent constituents, therefore its probability will be the product of their separate probabilities. Thus the form of our unknown function comes to be determined from this condition...''
Ten years after Herschel, the reasoning was repeated by J. C. Maxwell []. In his theory of gases he assumed that gas consists of small elastic spheres bumping each other; this led to intricate mechanical considerations to analyze the velocities before and after the encounters. However, Maxwell answered the question of his Proposition IV: What is the distribution of velocities of the gas particles? without using the details of the interaction between the particles; it lead to the emergence of the trivariate normal distribution. The result that velocities are normally distributed is sometimes called Maxwell's theorem. At the time of discovery, probability theory was in its beginnings and the proof was considered ``controversial" by leading mathematicians.
The beauty of the reasoning lies in the fact that the interplay of two very natural assumptions: of independence and of rotation invariance, gives rise to the normal law of errors - the most important distribution in statistics. This interplay of independence and invariance shows up in many of the theorems presented below.
Here we state the Herschel-Maxwell theorem in modern notation but without proof; for one of the early proofs, see []. The reader will see several proofs that use various, usually weaker, assumptions in Theorems , , , , and .
Theorem 1
Suppose random variables
X, Y have joint probability distribution m(dx, dy)
such that
(i) m(·) is invariant under the rotations of IR2;
(ii) X, Y are independent.
Then X, Y are normally distributed.
This theorem has generated a vast literature. Here is a quick preview of pertinent results in this book.
Polya's theorem [] presented in Section says that if just two rotations by angles p/2 and p/4, preserve the distribution of X, then the distribution is normal. Generalizations to characterizations by the equality of distributions of more general linear forms are given in Chapter . One of the most interesting results here is Marcinkiewicz's theorem [], see Theorem .
An interesting modification of Theorem *, discovered by M. Sh. Braverman [] and presented in Section below, considers three i. i. d. random variables X, Y, Z with the rotation-invariance assumption (i) replaced by the requirement that only some absolute moments are rotation invariant.
Another insight is obtained, if one notices that assumption (i) of Maxwell's theorem implies that rotations preserve the independence of the original random variables X, Y. In this approach we consider a pair X, Y of independent random variables such that the rotation by an angle a produces two independent random variables Xcosa+Ysina and Xsina-Ycosa. Assuming this for all angles a, M. Kac [] showed that the distribution in question has to be normal. Moreover, careful inspection of Kac's proof reveals that the only essential property he had used was that X, Y are independent and that just one p/4-rotation: (X+Y)/ Ö2, (X-Y)/ Ö2 produces the independent pair. The result explicitly assuming the latter was found independently by Bernstein []. Bernstein's theorem and its extensions are considered in Chapter ; Bernstein's theorem also motivates the assumptions in Chapter .
The following is a more technical description the contents of the book.
Chapter collects
probabilistic prerequisites. The emphasis is on analytic aspects; in particular
elementary but useful tail estimates collected in Section . In Chapter
we
approach multivariate normal distributions through characteristic functions.
This is a less intuitive but powerful method. It leads rapidly to several
fundamental facts, and to associated Reproducing Kernel Hilbert Spaces
(RKHS). As an illustration, we prove the large deviation estimates on IRd
which use the conjugate RKHS norm. In Chapter
the reader is introduced to
stability and equidistribution of linear forms in independent random
variables. Stability is directly related to the CLT. We show that in the
abstract setup stability is also responsible for the zero-one law. Chapter
presents the analysis of rotation invariant distributions on IRd and on
IR¥ . We study when a rotation invariant distribution has to be
normal. In the process we analyze structural properties of rotation
invariant laws and introduce the relevant techniques. In this chapter we
also present surprising results on rotation invariance of the absolute
moments.
We conclude with a short proof of
de Finetti's theorem and point out its implications for infinite spherically symmetric
sequences. Chapter parallels Chapter in analyzing the
role of independence of linear forms. We show that independence of certain
linear forms, a characteristic property of the normal distribution, leads to
the zero-one law, and it is also responsible for exponential moments. Chapter
is a short introduction to measures of dependence and stability
issues. Theorem establishes integrability under conditions of
interest, eg. in polynomial biorthogonality as studied by Lancaster
[]. In Chapter we extend results in Chapter
to conditional moments. Three interesting aspects emerge here. First,
normality can frequently be recognized from the conditional moments of linear
combinations of independent random variables; we illustrate this by a simple
proof of the well known fact that the independence of the sample mean and the
sample variance characterizes normal populations, and by
the proof of the central limit theorem. Secondly, we show that for infinite
sequences, conditional moments determine normality without any reference to
independence. This part has its natural continuation in Chapter .
Thirdly, in the exercises we point out
the versatility of conditional moments in handling other infinitely
divisible distributions. Chapter is a short introduction to
continuous parameter random fields, analyzed through their conditional
moments. We also present a self-contained analytic construction of the
Wiener process.
Most of the contents of this section is fairly standard probability theory. The reader shouldn't be under the impression that this chapter is a substitute for a systematic course in probability theory. We will skip important topics such as limit theorems. The emphasis here is on analytic methods; in particular characteristic functions will be extensively used throughout.
Let (W, M, P) be the probability space, ie. W is a set, M is a s-field of its subsets and P is the probability measure on (W, M). We follow the usual conventions: X,Y,Z stand for real random variables; boldface X, Y, Z denote vector-valued random variables. Throughout the book EX = òW X(w) dP (Lebesgue integral) denotes the expected value of a random variable X. We write X @ Y to denote the equality of distributions, ie. P(X Î A) = P(Y Î A) for all measurable sets A. Equalities and inequalities between random variables are to be interpreted almost surely (a. s.). For instance X £ Y+1 means P(X £ Y+1) = 1; the latter is a shortcut that we use for the expression P({w Î W: X(w) £ Y(w)+1}) = 1.
Boldface A, B, C will denote matrices. For a complex z = x+iy Î \sf CC by x = Âz and y = Áz we denote the real and the imaginary part of z. Unless otherwise stated, loga = logea denotes the natural logarithm of number a.
Given a real number r ³ 0, the absolute moment of order r is defined by E|X|r; the ordinary moment of order r = 0, 1, ¼ is defined as EXr. Clearly, not every sequence of numbers is the sequence of moments of a random variable X; it may also happen that two random variables with different distributions have the same moments. However, in Corollary below we will show that the latter cannot happen for normal distributions.
The following inequality is known as Chebyshev's inequality. Despite its simplicity it has numerous non-trivial applications, see eg. Theorem or [].
[ 1
If f: IR+® IR+ is
a non-decreasing function and Ef(|X|) = C < ¥, then
for all t > 0 such that
f(t) ¹ 0 we have
P(|X| > t) £ C/f(t). (1)
Indeed, Ef(|X|) = òW f(|X|) dP ³ ò|X| ³ tf(|X|) dP ³ ò|X| ³ tf(t) dP = f(t)P(|X| > t).
It follows immediately from Chebyshev's inequality that if E|X|p = C < ¥, then P(|X| > t) £ C/tp, t > 0. An implication in converse direction is also well known: if P(|X| > t) £ C/tp+e for some e > 0 and for all t > 0, then E|X|p < ¥, see () below.
The following formula will often be useful1.
[ 2
If f: IR+® IR is a function such
that
f(x) = f(0)+ ò0xg(t) dt, E{|f(X)|} < ¥
and X ³ 0, then
Ef(X) = f(0) +
ó
õ
¥
0
g(t)P(X ³ t) dt. (2)
Proof. The formula follows from Fubini's theorem2, since for X ³ 0
|
|
[ 1
If E|X|r < ¥ for an integer r > 0, then
EXr = r
ó
õ
¥
0
tr-1P(X ³ t) dt - r
ó
õ
¥
0
tr-1P(-X ³ t) dt. (3)
E|X|r = r
ó
õ
¥
0
tr-1P(|X| ³ t) dt. (4)
Proof. Formula (4) follows directly from Proposition 1.1 (with f(x) = xr and g(t) = [d/ dt]f(t) = rtr-1).
Since EX = EX+ - EX-, where X+ = max{X, 0} and
X- = min{X, 0}, therefore applying Proposition 1.1
separately to each of this expectations we get (3).
[¯]
| |||||||||||||||
Several useful inequalities are collected in the following.
Theorem 2
||X||p £ ||X||q. (5)
EXY £ ||X||p||Y||q. (6)
||X+Y||p £ ||X||p+||Y||p. (7)
For 1 £ p < ¥ the conjugate space to Lp (ie. the space of all bounded linear functionals on Lp) is usually identified with Lq, where 1/p+1/q = 1. The identification is by the duality áf,gñ = òf(w)g(w) dP.
For the proof of Theorem 1.2 we need the following elementary inequality.
[ 1 For a,b > 0, 1 < p < ¥ and 1/p+1/q = 1 we have
ab £ ap/p+bq/q. (8)
Proof. Function t® tp/p+t-q/q has the derivative tp-1-t-q-1. The derivative is positive for t > 1 and negative for 0 < t < 1. Hence the maximum value of the function for t > 0 is attained at t = 1, giving
|
|
|
|
|
By Var(X) we shall denote the variance of a square integrable r. v. X
|
|
Theorem 3
If there are C > 1, 0 < q < 1, x0 ³ 0 such that
for all x > x0
N(Cx) £ q N(x-x0), (9)
Proof. Let an be such that when an = xn-x0 then an+1 = Cxn. Solving the resulting recurrence we get an = Cn-b, where b = Cx0(C-1)-1. Equation (9) says N(an+1) £ CN(an). Therefore
|
|
[ 2 If there is 0 < q < 1 and x0 ³ 0 such that N(2x) £ q N(x-x0) for all x > x0, then E|X|b < ¥ for all b < log2 1/q.
[ 3
Suppose there is C > 1 such that
for every 0 < q < 1 one can find x0 ³ 0 such that
N(Cx) £ q N(x) (10)
As a special case of Corollary 1.3 we have the following.
[ 4
Suppose there are C > 1, K < ¥ such that
N(Cx) £ K
N(x)
x2
(11)
The next result deals with exponentially small tails.
Theorem 4
If there are C > 1, 1 < K < ¥, x0 ³ 0 such that
N(Cx) £ K N2(x-x0) (12)
N(x) £ M exp(-bxa),
Proof. As in the proof of Theorem 1.3, let an = Cn-b, b = Cx0/(C-1). Put qn = logKN(an). Then (12) gives
|
| (13) |
| (14) |
Since an®¥, we have N(an)® 0 and qn®-¥. Choose m large enough to have 1+qm < 0. Then (14) implies
|
|
[¯]
[ 5
If there are C < ¥, x0 ³ 0 such that
N(Ö2x) £ C N2(x-x0),
[ 6
If there are C < ¥, x0 ³ 0 such that
N(2x) £ C N2(x-x0),
Below we recall the definition of the conditional expectation of a r. v. with respect to a s-field and we state several results that we need for future reference. The definition is as old as axiomatic probability theory itself, see []. The reader not familiar with conditional expectations should consult textbooks, eg. Billingsley [], Durrett [], or Neveu [].
Definition 1 Let (W, M, P) be a probability space. If F Ì M is a s-field and X is an integrable random variable, then the conditional expectation of X given F is an integrable F-measurable random variable Z such that òAX dP = òA Z dP for all A Î F.
Conditional expectation of an integrable random variable X with respect to a s-field F Ì M will be denoted interchangeably by E{X| F} and E FX. We shall also write E{X|Y} or EYX for the conditional expectation E{X| F} when F = s(Y) is the s-field generated by a random variable Y.
Existence and almost sure uniqueness of the conditional expectation E{X| F} follows from the Radon-Nikodym theorem, applied to the finite signed measures m(A) = òAX dP and P| F, both defined on the measurable space (W, F). In some simple situations more explicit expressions can also be found.
Example. Suppose F is a s-field generated by the events A1, A2, ¼, An which form a non-degenerate disjoint partition of the probability space W. Then it is easy to check that
|
|
Example. Suppose that f(x, y) is the joint density with respect to the Lebesgue measure on IR2 of the bivariate random variable (X, Y) and let fY(y) ¹ 0 be the (marginal) density of Y. Put f(x|y) = f(x, y)/fY(y). Then E{X|Y} = h(Y), where h(y) = ò-¥¥ x f(x|y) dx.
The next theorem lists properties of conditional expectations that will be used without further mention.
Theorem 5
The proof uses the following.
[ 2 If Y1 and Y2 are F-measurable and òAY1 dP £ òA Y2 dP for all A Î F, then Y1 £ Y2 almost surely. If òAY1 dP = òA Y2 dP for all A Î F, then Y1 = Y2.
Proof. Let Ae = {Y1 > Y2+e} Î F. Since òAeY1 dP ³ òAeY2 dP + eP(Ae), thus P(Ae) > 0 is impossible. Event {Y1 > Y2} is the countable union of the events Ae (with e rational); thus it has probability 0 and Y1 £ Y2 with probability one.
The second part follows from the first by symmetry.
[¯]
Proof of Theorem 1.4.
(i) This is verified first for Y = IB (the indicator function of an event B Î F). Let Y1 = E{XY| F}, Y2 = YE{X| F}. From the definition one can easily see that both òAY1 dP and òA Y2 dP are equal to òA ÇB X dP. Therefore Y1 = Y2 by the Lemma 1.4.
For the general case, approximate Y by simple random variables and use (vi).
(ii) This follows from Lemma 1.4: random variables Y1 = E{X| F}, Y2 = E{X| G} are G-measurable and for A in G both òAY1 dP and òA Y2 dP are equal to òAX dP.
(iii) Let Y1 = E{X| NÚ F}, Y2 = E{X| F}. We check first that
|
(iv) Here we need the first part of Lemma 1.4. We also need to know that each convex function g(x) can be written as the supremum of a family of affine functions fa, b (x) = ax+b. Let Y1 = E{g(X)| F}, Y2 = fa, b(E{X| F}), A Î F. By (vi) we have
|
(v), (vi), (vii) These proofs are left as exercises.
[¯]
Theorem 1.4 gives geometric interpretation of the conditional expectation E{·| F} as the projection of the Banach space Lp(W, M, P) onto its closed subspace Lp(W, F, P), consisting of all p-integrable F-measurable random variables, p ³ 1. This projection is ``self adjoint'' in the sense that the adjoint operator is given by the same ``conditional expectation'' formula, although the adjoint operator acts on Lq rather than on Lp; for square integrable functions E{.| F} is just the orthogonal projection onto L2(W, F, P). Monograph [] considers conditional expectation from this angle.
We will use the following (weak) version of the martingale3 convergence theorem.
Theorem 6 Suppose Fn is a decreasing family of s-fields, ie. Fn+1 Ì Fn for all n ³ 1. If X is integrable, then E{X| Fn}® E{X| F} in L1-norm, where F is the intersection of all Fn.
Proof. Suppose first that X is square integrable. Subtracting m = EX if necessary, we can reduce the convergence question to the centered case EX = 0. Denote Xn = E{X| Fn}. Since Fn+1 Ì Fn, by Jensen's inequality EXn2 ³ 0 is a decreasing non-negative sequence. In particular, EXn2 converges.
Let m < n be fixed. Then E(Xn-Xm)2 = EXn2+EXm2-2EXnXm. Since Fn Ì Fm, by Theorem 1.4 we have
|
|
If X is not square integrable, then for every e > 0 there is a square integrable Y such that E|X-Y| < e. By Jensen's inequality E{X| Fn} and E{Y| Fn} differ by at most e in L1-norm; this holds uniformly in n. Since by the first part of the proof E{Y| Fn} is convergent, it satisfies the Cauchy condition in L2 and hence in L1. Therefore for each e > 0 we can find N such that for all n, m > N we have E{|E{X| Fn}-E{X| Fm}|} < 3e. This shows that E{X| Fn} satisfies the Cauchy condition and hence converges in L1.
The fact that the limit is X¥ = E{X| F} can be seen as follows. Clearly X¥ is Fn-measurable for all n, ie. it is F-measurable. For A Î F (hence also in Fn), we have EXIA = EXnIA. Since
|EXnIA-EX¥ IA| £ E|Xn-X¥ |IA £ E|Xn-X¥ |® 0,
therefore EXnIA® EX¥ IA.
This shows that EXIA = EX¥ IA and by definition,
X¥ = E{X| F}.
[¯]
| (15) |
|
Example 1
The following gives an example of
characteristic function that has finite support.
Let f(t) = 1-|t| for |t < | < 1 and 0 otherwise. Then
f(x) =
1
2p
ó
õ
1
-1
e-itx(1-|t|) dt = -
1
p
ó
õ
1
0
(1-t)costx dt =
1
p
1-cosx
x2
.
The following properties of characteristic functions are proved in any standard probability course, see eg. [,].
Theorem 7
(i) The distribution of X is determined uniquely by its characteristic
function f(t).
(ii) If E|X|r < ¥ for some r = 0,1,¼, then f(t)
is r-times differentiable, the derivative is uniformly continuous
and
(iii) If f(t) is 2r-times differentiable for some natural r,
then EX2r < ¥.
(iv) If X, Y are independent random variables, then
fX+Y(t) = fX(t) fY(t) for all t Î IR.
EXk = (-i)k
dk
dtk
f(t)
ê
ê
ê
t = 0
For a d-dimensional random variable X = (X1, ¼, Xd) the characteristic function fX: IRd® \sf CC is defined by fX(t) = Eexp(it·X), where the dot denotes the dot (scalar) product, ie. x·y = åxkyk. For a pair of real valued random variables X, Y, we also write f(t, s) = f(X, Y)((t, s)) and we call f(t, s) the joint characteristic function of X and Y.
The following is the multi-dimensional version of Theorem 1.5.
Theorem 8
(i) The distribution of X is determined uniquely by its
characteristic function f(t).
(ii) If E||X||r < ¥, then f(t) is r-times
differentiable and
(iii) If X, Y are independent IRd-valued random variables,
then
EXj1¼Xjk = (-i)k
¶k
¶tj1¼¶tjk
f(t)
ê
ê
ê
t = 0
fX+Y(t) = fX(t) fY(t)
The next result seems to be less known although it is both easy to prove and to apply. We shall use it on several occasions in Chapter . The converse is also true if we assume that the integer parameter r in the proof below is even or that joint characteristic function f(t, s) is differentiable; to prove the converse, one can follow the usual proof of the inversion formula for characteristic functions, see, eg. []. Kagan, Linnik & Rao [] state explicitly several most frequently used variants of ().
Theorem 9
Suppose real valued random variables
X, Y have the joint characteristic function f(t, s).
Assume that E|X|m < ¥ for some m Î IN. Let
g(y) be such that
E{Xm|Y} = g(Y).
(-i)m
¶m
¶tm
f(t, s)
ê
ê
ê
t = 0
= Eg(Y)exp( isY). (16)
(-i)m
¶m
¶tm
f(t, s)
ê
ê
ê
t = 0
=
å
k
(-i)kck
dk
dsk
f(0, s). (17)
Proof. Since by assumption E|X|m < ¥, the joint characteristic function f(t, s) = Eexp(itX+isY) can be differentiated m times with respect to t and
|
In order to prove (17), we need to show first that E|Y|r < ¥, where r is the degree of the polynomial g(y). By Jensen's inequality E|g(Y)| £ E|X|m < ¥, and since |g(y)/yr|® const ¹ 0 as |y|® ¥, therefore there is C > 0 such that |y|r £ C|g(y)| for all y. Hence E|Y|r < ¥ follows.
Formula (17) is
now a simple consequence of (16); indeed, for 0 £ k £ r
we have EYkexp(isY) = (-i)kkf(0, s); this formula is obtained by
differentiating k-times Eexp(isY) under the integral sign.
[¯]
Definition 2 A random variable X (also: a vector valued random variable X) is symmetric if X and -X have the same distribution.
Symmetrization techniques deal with comparison of properties of an arbitrary variable X with some symmetric variable Xsym. Symmetric variables are usually easier to deal with, and proofs of many theorems (not only characterization theorems, see eg. []) become simpler when reduced to the symmetric case.
There are two natural ways to obtain a symmetric random variable Xsym from an arbitrary random variable X. The first one is to multiply X by an independent random sign ±1; in terms of the characteristic functions this amounts to replacing the characteristic function f of X by its symmetrization 1/2 ( f(t)+ f(-t)). This approach has the advantage that if X is symmetric, then its symmetrization Xsym has the same distribution as X. Integrability properties are also easy to compare, because |X| = |Xsym|.
The other symmetrization, which has perhaps less obvious properties but is frequently found more useful, is defined as follows. Let X¢ be an independent copy of X. The symmetrization [X\tilde] of X is defined by [X\tilde] = X-X¢. In terms of the characteristic functions this corresponds to replacing the characteristic function f(t) of X by the characteristic function |f(t)|2. This procedure is easily seen to change the distribution of X, except when X = 0.
Theorem 10
(i) If the symmetrization [X\tilde] of a random variable X
has a finite moment of order p ³ 1, then E|X|p < ¥.
(ii) If the symmetrization [X\tilde] of a random variable
X has finite exponential moment Eexp(l|[X\tilde]|),
then Eexpl|X| < ¥, l > 0.
(iii) If the symmetrization [X\tilde] of a random variable X
satisfies Eexpl|[X\tilde]|2 < ¥, then
Eexpl|X|2 < ¥, l > 0.
The usual approach to Theorem 1.6 uses the symmetrization inequality, which is of independent interest (see Problem ) and formula (2). Our proof requires extra assumptions, but instead is short, does not require X and X¢ to have the same distribution, and it also gives a more accurate bound (within its domain of applicability).
Proof in the case, when E|X| < ¥ and EX = 0:
Let g(x) ³ 0 be a convex function, such that Eg([X\tilde]) < ¥
and let X, X¢ be the independent copies of X, so that conditional
expectation EXX¢ = EX = 0. Then Eg(X) = Eg(X-EXX¢) = Eg(EX{X-X¢}).
Since by Jensen's inequality, see Theorem 1.4 (iv) we have
Eg(EX{X-X¢}) £ Eg(X-X¢), therefore
Eg(X) £ Eg(X-X¢) = Eg([X\tilde]) < ¥. To end the proof,
consider three convex functions g(x) = |x|p, g(x) = exp(lx) and
g(x) = exp(lx2).
|
Uniform integrability is often used in conjunction with weak convergence to verify the convergence of moments. Namely, if Xn is uniformly integrable and converges in distribution to Y, then Y is integrable and
| (18) |
[ 3 If X1,X2,... are centered i. i. d. random variables with finite second moments and Sn = åj = 1nXj then {1/nSn2}n ³ 1 is uniformly integrable.
The following lemma is a special case of the celebrated Khinchin inequality.
[ 3
If ej are ±1 valued symmetric independent r. v., then for all real
numbers aj
E
æ
è
n
å
j = 1
ajej
ö
ø
4
£ 3
æ
è
n
å
j = 1
aj2
ö
ø
2
(19)
Proof. By independence and symmetry we have
|
[ 4
If Xk are i. i. d. centered
with fourth moments, then there is a constant C < ¥ such that
ESn4 £ C n2 EX14 (20)
Proof. As in the proof of Theorem 1.6 we can estimate the fourth moments of a centered r. v. by the fourth moment of its symmetrization, ESn4 £ E[S\tilde]n4.
Let ej be independent of [X\tilde]k's as in Lemma 1.7. Then in distribution [S\tilde]n @ åj = 1nej[X\tilde]j. Therefore, integrating with respect to the distribution of ej first, from (19) we get
|
[ 5
If U,V ³ 0 then
ó
õ
U+V > 2t
(U+V)2 dP £ 4
æ
è
ó
õ
U > t
U2 dP+
ó
õ
V > t
V2 dP
ö
ø
.
Proof. By (2) applied to f(x) = x2 Ix > 2t we have
|
|
Let e > 0 and choose M > 0 such that ò{|X| > M}|X| dP < e. Split Xk = Xk¢+Xk¢¢, where Xk¢ = XkI{|Xk| £ M}-E{XkI{|Xk| £ M}} and let S¢, S¢¢ denote the corresponding sums.
Notice that for any U ³ 0 we have UI{|U| > m} £ U2/m. Therefore 1/nò|Sn¢| > t Ön(Sn¢)2 dP £ t-2n-2E(Sn¢)4, which by Lemma 1.7 gives
| (21) |
Now we use orthogonality to estimate the second term:
| (22) |
|
limsupt®¥supn1/nò{|Sn| > 2tÖn}Sn2 dP £ e.
Since e > 0 is arbitrary, this ends the proof.
[¯]
Definition 3 5 The Mellin transform of a random variable X ³ 0 is defined for all complex s such that EXÂs -1 < ¥ by the formula M(s) = EXs-1.
The definition is consistent with the usual definition of the Mellin transform of an integrable function: if X has a probability density function f(x), then the Mellin transform of X is given by M(s) = ò0¥ xs-1f(x) dx.
Theorem 11 6 If X ³ 0 is a random variable such that EXa-1 < ¥ for some a ³ 1, then the Mellin transform M(s) = EXs-1, considered for s Î \sf CC such that Âs = a, determines the distribution of X uniquely.
Proof. The easiest case is when a = 1 and X > 0. Then M (s) is just the characteristic function of log(X); thus the distribution of log(X), and hence the distribution of X, is determined uniquely.
In general consider finite non-negative measure m defined on (IR+, B) by
|
Theorem 12 If X ³ 0 and EXa < ¥ for some a > 0, then the Mellin transform of X is analytic in the strip 1 < Âs < 1+a.
Proof. Since for every s with 0 < Âs < a the modulus
of the function w® Xslog(X) is bounded by an integrable function
C1+C2|X|a, therefore EXs can be differentiated with
respect
to s under the expectation sign at each point s, 0 < Âs < a.
[¯]
Problem 1 [[]]
Use Fubini's theorem to show that if XY, X, Y are
integrable, then
EXY-EXEY =
ó
õ
¥
-¥
ó
õ
¥
-¥
(P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)) dt ds.
Problem 2
Let X ³ 0 be a random variable and suppose that
for every 0 < q < 1 there is T = T(q) such that
P(X > 2t) £ q P(X > t) for all t > T.
Problem 3
Show that if X ³ 0 is a random variable such that
P(X > 2t) £ (P(X > t))2 for all t > 0,
Problem 4
Show that if Eexp(lX2) = C < ¥ for some
a > 0, then
Eexp(tX) £ Cexp(
t2
2l
)
Problem 5 Show that (11) implies E{|X||X|} < ¥.
Problem 6 Prove part (v) of Theorem 1.4.
Problem 7 Prove part (vi) of Theorem 1.4.
Problem 8 Prove part (vii) of Theorem 1.4.
Problem 9
Prove the following conditional version of Chebyshev's
inequality: if F is a s-field, and E|X| < ¥,
then
P(|X| > t | F) £ E{|X| | F}/t
Problem 10 Show that if (X, Y) is uniformly distributed on a circle centered at (0, 0), then for every a, b there is a non-random constant C = C(a, b) such that E{X|aX+bY} = C(a,b)(aX+bY).
Problem 11 Show that if (U,V,X) are such that in distribution (U,X) @ (V,X) then E{U|X} = E{V|X} almost surely.
Problem 12
Show that if X, Y are integrable non-degenerate random variables,
such that
E{X|Y} = aY, E{Y|X} = bX,
Problem 13
Suppose that X, Y are square-integrable random variables such
that
E{X|Y} = Y, E{Y|X} = 0.
Problem 14 Show that if X, Y are integrable such that E{X|Y} = Y and E{Y|X} = X, then X = Y a. s.
Problem 15 Prove that if X ³ 0, then function f(t): = EXit, where t Î IR, determines the distribution of X uniquely.
Problem 16
Prove that function f(t): = Emax{X, t} determines
uniquely the distribution of an integrable random variable X
in each of the following cases:
Problem 18 Let p > 2 be fixed. Show that exp(-|t|p) is not a characteristic function.
Problem 19
Let Q(t,s) = logf(t,s), where f(t,s) is the
joint characteristic function of square-integrable r. v. X,Y.
¶
¶t
Q(t,s)
ê
ê
ê
t = 0
= r
d
ds
Q(0,s).
¶2
¶t2
Q(t,s)
ê
ê
ê
t = 0
+
æ
ç
è
¶
¶t
Q(t,s)
ö
÷
ø
2
ê
ê
ê
t = 0
= -a +ib
d
ds
Q(0,s)+c
d2
d s2
Q(0,s)+c
æ
ç
è
d
d s
Q(0,s)
ö
÷
ø
2
.
P(|X| ³ t) £ 2P(|
~
X
| ³ t -|a|)
Problem 22 Prove (18).
|
|
In multivariate case it is more convenient to use characteristic functions directly. Besides, characteristic functions are our main technical tool and it doesn't hurt to start using them as soon as possible. We shall therefore begin with the following definition.
Definition 4 A real valued random variable X has the
normal N(m, s)
distribution if its characteristic
function has the form
f(t) = exp(itm-
1
2
s2t2),
From Theorem 1.5 it is easily to check by direct differentiation that m = EX and s2 = Var(X). Using (15) it is easy to see that every univariate normal X can be written as
| (23) |
The following properties of standard normal distribution N(0,1) are self-evident:
|
For future reference we state the following simple but useful observation. Computing EXk for k = 0, 1, 2 from Theorem 1.5 we immediately get.
[ 4 A characteristic function which can be expressed in the form f(t) = exp(at2+bt+c) for some complex constants a, b, c, corresponds to the normal random variable, ie. a Î IR and a < 0, b Î i IR is imaginary and c = 0.
We follow the usual linear algebra notation. Vectors are denoted by small bold letters x, v, t, matrices by capital bold initial letters A, B, C and vector-valued random variables by capital boldface X, Y, Z; by the dot we denote the usual dot product in IRd, ie. x·y: = åj = 1d xjyj; ||x|| = (x·x)1/2 denotes the usual Euclidean norm. For typographical convenience we sometimes write (a1,¼,ak) for the vector
[
| |
| |
| |
Below we shall also consider another scalar product á·,·ñ associated with the normal distribution; the corresponding semi-norm will be denoted by the triple bar |||·|||.
Definition 5 An IRd-valued random variable Z is multivariate normal, or Gaussian (we shall use both terms interchangeably; the second term will be preferred in abstract situations) if for every t Î IRd the real valued random variable t·Z is normal.
Clearly the distribution of univariate t·Z is determined uniquely by its mean m = mt and its standard deviation s = st. It is easy to see that mt = t·m, where m = EZ. Indeed, by linearity of the expected value mt = Et·Z = t·EZ. Evaluating the characteristic function f(s) of the real-valued random variable t·Z at s = 1 we see that the characteristic function of Z can be written as
|
|
The following observations are easy to check.
We shall need the following well known linear algebra fact (the proofs are explained below; explicit reference is, eg. []).
[ 6 Each bilinear form B has the dot
product representation
B(x, y) = Cx·y,
Indeed, expand x and y with respect to the standard orthogonal basis e 1,¼, e d. By bilinearity we have B(x, y) = åi,jxiyj B( e i, e j), which gives the dot product representation with ci, j = B( e i, e j). Clearly, for symmetric B(·, ·) we get ci, j = cj, i; hence C is symmetric.
[ 7 If in addition B(·, ·) is positive definite
then
for a d×d matrix A. Moreover, A can be chosen to be symmetric.
C = A×AT (24)
The easiest way to see the last fact is to diagonalize C (this is always possible, as C is symmetric). The eigenvalues of C are real and, since B(·, ·) is positive definite, they are non-negative. If L denotes a (diagonal) matrix (consisting of eigenvalues of C) in the diagonal representation C = ULUT and D is the diagonal matrix formed by the square roots of the eigenvalues, then A = UDUT. Moreover, this construction gives symmetric A = AT. In general, there is no unique choice of A and we shall sometimes find it more convenient to use non-symmetric A, see Example below.
The linear algebra results imply that the characteristic function corresponding to a normal distribution on IRd can be written in the form
| (25) |
Theorem 13 The characteristic function corresponding to a normal random variable Z = (Z1, ¼, Zd) is given by (25), where m = EZ and C = [ci, j], ci, j = Cov(Zi, Zj), is the covariance matrix.
From (24) and (25) we get also
|