[section] [section] Theorem]Corollary Theorem]Lemma [section] [section] [section] [chapter] [section] Theorem]Proposition \Large Normal Distribution \\ \large characterizations with applications

Normal Distribution
characterizations with applications

W odzimierz Bryc
Department of Mathematics
University of Cincinnati
P O Box 210025
Cincinnati, OH 45221-0025
bryc@ucbeh.san.uc.edu

December 22, 1994

Preface

This book is a concise presentation of the normal distribution on the real line and its counterparts on more abstract spaces, which we shall call the Gaussian distributions. The material is selected towards presenting characteristic properties, or characterizations, of the normal distribution. There are many such properties and there are numerous relevant works in the literature. In this book special attention is given to characterizations generated by the so called Maxwell's Theorem of statistical mechanics, which is stated in the introduction as Theorem . These characterizations are of interest both intrinsically, and as techniques that are worth being aware of. The book may also serve as a good introduction to diverse analytic methods of probability theory. We use characteristic functions, tail estimates, and occasionally dive into complex analysis.

In the book we also show how the characteristic properties can be used to prove important results about the Gaussian processes and the abstract Gaussian vectors. For instance, in Section we present Fernique's beautiful proofs of the zero-one law and of the integrability of abstract Gaussian vectors. The central limit theorem is obtained via characterizations in Section .

The excellent book by Kagan, Linnik & Rao [] overlaps with ours in the coverage of the classical characterization results. Our presentation of these is sometimes less general, but in return we often give simpler proofs. On the other hand, we are more selective in the choice of characterizations we want to present, and we also point out some applications. Characterization results that are not included in [] can be found in numerous places of the book, see Section , Chapter and Chapter .

We have tried to make this book accessible to readers with various backgrounds. If possible, we give elementary proofs of important theorems, even if they are special cases of more advanced results. Proofs of several difficult classic results have been simplified. We have managed to avoid functional equations for non-differentiable functions; in many proofs in the literature lack of differentiability is a major technical difficulty.

The book is primarily aimed at graduate students in mathematical statistics and probability theory who would like to expand their bag of tools, to understand the inner workings of the normal distribution, and to explore the connections with other fields. Characterization aspects sometimes show up in unexpected places, cf. Diaconis & Ylvisaker []. More generally, when fitting any statistical model to the data, it is inevitable to refer to relevant properties of the population in question; otherwise several different models may fit the same set of empirical data, cf. W. Feller []. Monograph [] by Prakasa Rao is written from such perspective and for a statistician our book may only serve as a complementary source. On the other hand results presented in Sections and are quite recent and virtually unknown among statisticians. Their modeling aspects remain to be explored, see Section . We hope that this book will popularize the interesting and difficult area of conditional moment descriptions of random fields. Of course it is possible that such characterizations will finally end up far from real life like many other branches of applied mathematics. It is up to the readers of this book to see if the following sentence applies to characterizations as well as to trigonometric series.

``Thinking of the extent and refinement reached by the theory of trigonometric series in its long development one sometimes wonders why only relatively few of these advanced achievements find an application.''
(A. Zygmund, Trigonometric Series, Vol. 1, Cambridge Univ. Press, Second Edition, 1959, page xii)

There is more than one way to use this book. Parts of it have been used in a graduate one-quarter course Topics in statistics. The reader may also skim through it to find results that he needs; or look up the techniques that might be useful in his own research. The author of this book would be most happy if the reader treats this book as an adventure into the unknown - picks a piece of his liking and follows through and beyond the references. With this is mind, the book has a number of references and digressions. We have tried to point out the historical perspective, but also to get close to current research.

An appropriate background for reading the book is a one year course in real analysis including measure theory and abstract normed spaces, and a one-year course in complex analysis. Familiarity with conditional expectations would also help. Topics from probability theory are reviewed in Chapter , frequently with proofs and exercises. Exercise problems are at the end of the chapters; solutions or hints are in Appendix .

The book benefited from the comments of Chris Burdzy, Abram Kagan, Samuel Kotz, Wodek Smole\'nski, Pawe Szabowski, and Jacek Wesoowski. They read portions of the first draft, generously shared their criticism, and pointed out relevant references and errors. My colleagues at the University of Cincinnati also provided comments, criticism and encouragement. The final version of the book was prepared at the Institute for Applied Mathematics of the University of Minnesota in fall quarter of 1993 and at the Center for Stochastic Processes in Chapel Hill in Spring 1994. Support by C. P. Taft Memorial Fund in the summer of 1987 and in the spring of 1994 helped to begin and to conclude this endeavor.

Introduction

The following narrative comes from J. F. W. Herschel [].

``Suppose a ball is dropped from a given height, with the intention that it shall fall on a given mark. Fall as it may, its deviation from the mark is error, and the probability of that error is the unknown function of its square, ie. of the sum of the squares of its deviations in any two rectangular directions. Now, the probability of any deviation depending solely on its magnitude, and not on its direction, it follows that the probability of each of these rectangular deviations must be the same function of its square. And since the observed oblique deviation is equivalent to the two rectangular ones, supposed concurrent, and which are essentially independent of one another, and is, therefore, a compound event of which they are the simple independent constituents, therefore its probability will be the product of their separate probabilities. Thus the form of our unknown function comes to be determined from this condition...''

Ten years after Herschel, the reasoning was repeated by J. C. Maxwell []. In his theory of gases he assumed that gas consists of small elastic spheres bumping each other; this led to intricate mechanical considerations to analyze the velocities before and after the encounters. However, Maxwell answered the question of his Proposition IV: What is the distribution of velocities of the gas particles? without using the details of the interaction between the particles; it lead to the emergence of the trivariate normal distribution. The result that velocities are normally distributed is sometimes called Maxwell's theorem. At the time of discovery, probability theory was in its beginnings and the proof was considered ``controversial" by leading mathematicians.

The beauty of the reasoning lies in the fact that the interplay of two very natural assumptions: of independence and of rotation invariance, gives rise to the normal law of errors - the most important distribution in statistics. This interplay of independence and invariance shows up in many of the theorems presented below.

Here we state the Herschel-Maxwell theorem in modern notation but without proof; for one of the early proofs, see []. The reader will see several proofs that use various, usually weaker, assumptions in Theorems , , , , and .

Theorem 1 Suppose random variables X, Y have joint probability distribution m(dx, dy) such that

(i) m(·) is invariant under the rotations of IR2;

(ii) X, Y are independent.

Then X, Y are normally distributed.

This theorem has generated a vast literature. Here is a quick preview of pertinent results in this book.

Polya's theorem [] presented in Section says that if just two rotations by angles p/2 and p/4, preserve the distribution of X, then the distribution is normal. Generalizations to characterizations by the equality of distributions of more general linear forms are given in Chapter . One of the most interesting results here is Marcinkiewicz's theorem [], see Theorem .

An interesting modification of Theorem *, discovered by M. Sh. Braverman [] and presented in Section below, considers three i. i. d. random variables X, Y, Z with the rotation-invariance assumption (i) replaced by the requirement that only some absolute moments are rotation invariant.

Another insight is obtained, if one notices that assumption (i) of Maxwell's theorem implies that rotations preserve the independence of the original random variables X, Y. In this approach we consider a pair X, Y of independent random variables such that the rotation by an angle a produces two independent random variables Xcosa+Ysina and Xsina-Ycosa. Assuming this for all angles a, M. Kac [] showed that the distribution in question has to be normal. Moreover, careful inspection of Kac's proof reveals that the only essential property he had used was that X, Y are independent and that just one p/4-rotation: (X+Y)/ Ö2, (X-Y)/ Ö2 produces the independent pair. The result explicitly assuming the latter was found independently by Bernstein []. Bernstein's theorem and its extensions are considered in Chapter ; Bernstein's theorem also motivates the assumptions in Chapter .


 The following is a more technical description the contents of the book. Chapter collects probabilistic prerequisites. The emphasis is on analytic aspects; in particular elementary but useful tail estimates collected in Section . In Chapter we approach multivariate normal distributions through characteristic functions. This is a less intuitive but powerful method. It leads rapidly to several fundamental facts, and to associated Reproducing Kernel Hilbert Spaces (RKHS). As an illustration, we prove the large deviation estimates on IRd which use the conjugate RKHS norm. In Chapter the reader is introduced to stability and equidistribution of linear forms in independent random variables. Stability is directly related to the CLT. We show that in the abstract setup stability is also responsible for the zero-one law. Chapter presents the analysis of rotation invariant distributions on IRd and on IR¥ . We study when a rotation invariant distribution has to be normal. In the process we analyze structural properties of rotation invariant laws and introduce the relevant techniques. In this chapter we also present surprising results on rotation invariance of the absolute moments. We conclude with a short proof of de Finetti's theorem and point out its implications for infinite spherically symmetric sequences. Chapter parallels Chapter in analyzing the role of independence of linear forms. We show that independence of certain linear forms, a characteristic property of the normal distribution, leads to the zero-one law, and it is also responsible for exponential moments. Chapter is a short introduction to measures of dependence and stability issues. Theorem establishes integrability under conditions of interest, eg. in polynomial biorthogonality as studied by Lancaster []. In Chapter we extend results in Chapter to conditional moments. Three interesting aspects emerge here. First, normality can frequently be recognized from the conditional moments of linear combinations of independent random variables; we illustrate this by a simple proof of the well known fact that the independence of the sample mean and the sample variance characterizes normal populations, and by the proof of the central limit theorem. Secondly, we show that for infinite sequences, conditional moments determine normality without any reference to independence. This part has its natural continuation in Chapter . Thirdly, in the exercises we point out the versatility of conditional moments in handling other infinitely divisible distributions. Chapter is a short introduction to continuous parameter random fields, analyzed through their conditional moments. We also present a self-contained analytic construction of the Wiener process.

Chapter 1
Probability tools

Most of the contents of this section is fairly standard probability theory. The reader shouldn't be under the impression that this chapter is a substitute for a systematic course in probability theory. We will skip important topics such as limit theorems. The emphasis here is on analytic methods; in particular characteristic functions will be extensively used throughout.

Let (W, M, P) be the probability space, ie. W is a set, M is a s-field of its subsets and P is the probability measure on (W, M). We follow the usual conventions: X,Y,Z stand for real random variables; boldface X, Y, Z denote vector-valued random variables. Throughout the book EX = òW X(w) dP (Lebesgue integral) denotes the expected value of a random variable X. We write X @ Y to denote the equality of distributions, ie. P(X Î A) = P(Y Î A) for all measurable sets A. Equalities and inequalities between random variables are to be interpreted almost surely (a. s.). For instance X £ Y+1 means P(X £ Y+1) = 1; the latter is a shortcut that we use for the expression P({w Î W: X(w) £ Y(w)+1}) = 1.

Boldface A, B, C will denote matrices. For a complex z = x+iy Î \sf CC by x = Âz and y = Áz we denote the real and the imaginary part of z. Unless otherwise stated, loga = logea denotes the natural logarithm of number a.

1.1  Moments

Given a real number r ³ 0, the absolute moment of order r is defined by E|X|r; the ordinary moment of order r = 0, 1, ¼ is defined as EXr. Clearly, not every sequence of numbers is the sequence of moments of a random variable X; it may also happen that two random variables with different distributions have the same moments. However, in Corollary below we will show that the latter cannot happen for normal distributions.

The following inequality is known as Chebyshev's inequality. Despite its simplicity it has numerous non-trivial applications, see eg. Theorem or [].

[ 1 If f: IR+® IR+ is a non-decreasing function and Ef(|X|) = C < ¥, then for all t > 0 such that f(t) ¹ 0 we have

P(|X| > t) £ C/f(t).
(1)

Indeed, Ef(|X|) = òW f(|X|) dP ³ ò|X| ³ tf(|X|) dP ³ ò|X| ³ tf(t) dP = f(t)P(|X| > t).

It follows immediately from Chebyshev's inequality that if E|X|p = C < ¥, then P(|X| > t) £ C/tp, t > 0. An implication in converse direction is also well known: if P(|X| > t) £ C/tp+e for some e > 0 and for all t > 0, then E|X|p < ¥, see () below.

The following formula will often be useful1.

[ 2 If f: IR+® IR is a function such that f(x) = f(0)+ ò0xg(t) dt, E{|f(X)|} < ¥ and X ³ 0, then

Ef(X) = f(0) + ó
õ
¥

0 
g(t)P(X ³ t) dt.
(2)
Moreover, if g ³ 0 and if the right hand side of (2) is finite, then Ef(X) < ¥.

Proof. The formula follows from Fubini's theorem2, since for X ³ 0

ó
õ


W 
f(X) dP = ó
õ


W 
æ
è
f(0)+ ó
õ
¥

0 
1t £ Xg(t) dt ö
ø
 dP
= f(0)+ ó
õ
¥

0 
g(t) ( ó
õ


W 
1t £ X dP) dt = f(0)+ ó
õ
¥

0 
g(t)P(X ³ t) dt.
[¯]

[ 1 If E|X|r < ¥ for an integer r > 0, then

EXr = r ó
õ
¥

0 
tr-1P(X ³ t) dt - r ó
õ
¥

0 
tr-1P(-X ³ t) dt.
(3)
If E|X|r < ¥ for real r > 0 then
E|X|r = r ó
õ
¥

0 
tr-1P(|X| ³ t) dt.
(4)
Moreover, the left hand side of (4) is finite if and only if the right hand side is finite.

Proof. Formula (4) follows directly from Proposition 1.1 (with f(x) = xr and g(t) = [d/ dt]f(t) = rtr-1).

Since EX = EX+ - EX-, where X+ = max{X, 0} and X- = min{X, 0}, therefore applying Proposition 1.1 separately to each of this expectations we get (3). [¯]

1.2  Lp-spaces

By Lp(W, M, P), or Lp if no misunderstanding may result, we denote the Banach space of a. s. classes of equivalence of p-integrable M-measurable random variables X with the norm
||X||p = ì
ï
ï
í
ï
ï
î
   _____
pÖE|X|p
 
if p ³ 1;
ess sup|X|
if p = ¥.
If X Î Lp, we shall say that X is p-integrable; in particular, X is square integrable if EX2 < ¥. We say that Xn converges to X in Lp, if ||Xn-X||p® 0 as n® ¥. If Xn converges to X in L2, we shall also use the phrase sequence Xn converges to X in mean-square.

Several useful inequalities are collected in the following.

Theorem 2

Special case p = q = 2 of Hölder's inequality (6) reads EXY £ [Ö(EX2EY2)]. It is frequently used and is known as the Cauchy-Schwarz inequality.

For 1 £ p < ¥ the conjugate space to Lp (ie. the space of all bounded linear functionals on Lp) is usually identified with Lq, where 1/p+1/q = 1. The identification is by the duality áf,gñ = òf(w)g(w) dP.

For the proof of Theorem 1.2 we need the following elementary inequality.

[ 1 For a,b > 0, 1 < p < ¥ and 1/p+1/q = 1 we have

ab £ ap/p+bq/q.
(8)

Proof. Function t® tp/p+t-q/q has the derivative tp-1-t-q-1. The derivative is positive for t > 1 and negative for 0 < t < 1. Hence the maximum value of the function for t > 0 is attained at t = 1, giving

tp/p+t-q/q ³ 1.
Substituting t = a1/q b-1/p we get (8). [¯]
Proof of Theorem 1.2 (ii). If either ||X||p = 0 or ||Y||q = 0, then XY = 0 a. s. Therefore we consider only the case ||X||p||Y||q > 0 and after rescaling we assume ||X||p = ||Y||q = 1. Furthermore, the case p = 1, q = ¥ is trivial as |XY| £ |X| ||Y||¥. For 1 < p < ¥ by (8) we have
|XY| £ |X|p/p+|Y|q/q.
Integrating this inequality we get |EXY| £ E|XY| £ 1 = ||X||p||Y||q. [¯]
Proof of Theorem 1.2 (i). For p = 1 this is just Jensen's inequality; for a more general version see Theorem . For 1 < p < ¥ by Hölder's inequality applied to the product of 1 and |X|p we have
||X||pp = E{|X|p ·1} £ (E|X|q)p/q (E1r)1/r = ||X||qp,
where r is computed from the equation 1/r+p/q = 1. (This proof works also for p = 1 with obvious changes in the write-up.) [¯]
Proof of Theorem 1.2 (iii). The inequality is trivial if p = 1 or if ||X+Y||p = 0. In the remaining cases
||X+Y||pp £ E{(|X|+|Y|)|X+Y|p-1} = E{|X||X+Y|p-1}+ E{|Y||X+Y|p-1}.
By Hölder's inequality
||X+Y||pp £ ||X||p||X+Y||pp/q+||Y||p||X+Y||pp/q.
Since p/q = p-1, dividing both sides by ||X+Y||pp/q we get the conclusion. [¯]

By Var(X) we shall denote the variance of a square integrable r. v. X

Var(X) = EX2-(EX)2 = E(X-EX)2.
The correlation coefficient corr(X,Y) is defined for square-integrable non-degenerate r. v. X, Y by the formula
corr(X,Y) = EXY-EXEY
||X-EX||2||Y-EY||2
.
The Cauchy-Schwarz inequality implies that -1 £ corr(X,Y) £ 1.

1.3  Tail estimates

The function N(x) = P(|X| ³ x) describes tail behavior of r. v. a X. Inequalities involving N(·) similar to Problems and are sometimes easy to prove. Integrability that follows is of considerable interest. Below we give two rather technical tail estimates and we state several corollaries for future reference. The proofs use only the fact that N:[0,¥)® [0,1] is a non-increasing function such that limx®¥N(x) = 0.

Theorem 3 If there are C > 1, 0 < q < 1, x0 ³ 0 such that for all x > x0

N(Cx) £ q N(x-x0),
(9)
then there is M < ¥ such that N(x) £ [M/( xb)], where b = -logCq.

Proof. Let an be such that when an = xn-x0 then an+1 = Cxn. Solving the resulting recurrence we get an = Cn-b, where b = Cx0(C-1)-1. Equation (9) says N(an+1) £ CN(an). Therefore

N(an) £ N(a0)qn.
This implies the tail estimate for arbitrary x > 0. Namely, given x > 0 choose n such that an £ x < an+1. Then
N(x) £ N(an) £ K qn = K
q
qlogC(an+1+b) = M(x+b)-b.
[¯]
The next results follow from Theorem 1.3 and (4) and are stated for future reference.

[ 2 If there is 0 < q < 1 and x0 ³ 0 such that N(2x) £ q N(x-x0) for all x > x0, then E|X|b < ¥ for all b < log2 1/q.

[ 3 Suppose there is C > 1 such that for every 0 < q < 1 one can find x0 ³ 0 such that

N(Cx) £ q N(x)
(10)
for all x > x0. Then E|X|p < ¥ for all p.

As a special case of Corollary 1.3 we have the following.

[ 4 Suppose there are C > 1, K < ¥ such that

N(Cx) £ K N(x)
x2
(11)
for all x large enough. Then E|X|p < ¥ for all p.

The next result deals with exponentially small tails.

Theorem 4 If there are C > 1, 1 < K < ¥, x0 ³ 0 such that

N(Cx) £ K N2(x-x0)
(12)
for all x > x0, then there are M < ¥, b > 0 such that
N(x) £ M exp(-bxa),
where a = logC2.

Proof. As in the proof of Theorem 1.3, let an = Cn-b, b = Cx0/(C-1). Put qn = logKN(an). Then (12) gives

N(an+1) £ KN2(an),
which implies
qn+1 £ 2 qn+1.
(13)
Therefore by induction we get
qm+n £ 2n(1+qm)-1.
(14)
Indeed, (14) becomes equality for n = 0. If it holds for n = k, then qm+k+1 £ 2qm+k+1 £ 2 (2k(1+qm)-1)+1 = 2k+1(1+qm)-1. This proves (14) by induction.

Since an®¥, we have N(an)® 0 and qn®-¥. Choose m large enough to have 1+qm < 0. Then (14) implies

N(an+m) £ K2n(1+qm) = exp-b2n.
The proof is now concluded by the standard argument. Selecting large enough M we have N(x) £ 1 £ Mexp-bxa for all x £ am. Given x > am choose n ³ 0 such that an+m £ x < an+m+1. Then
N(x) £ N(an+m) £ exp-b2n £ M exp(-b2logC an+m+1) £ M exp-bxa.

[¯]

[ 5 If there are C < ¥, x0 ³ 0 such that

N(Ö2x) £ C N2(x-x0),
then there is b > 0 such that Eexp(b|X|2) < ¥.

[ 6 If there are C < ¥, x0 ³ 0 such that

N(2x) £ C N2(x-x0),
then there is b > 0 such that Eexp(b|X|) < ¥.

1.4  Conditional expectations

Below we recall the definition of the conditional expectation of a r. v. with respect to a s-field and we state several results that we need for future reference. The definition is as old as axiomatic probability theory itself, see []. The reader not familiar with conditional expectations should consult textbooks, eg. Billingsley [], Durrett [], or Neveu [].

Definition 1 Let (W, M, P) be a probability space. If F Ì M is a s-field and X is an integrable random variable, then the conditional expectation of X given F is an integrable F-measurable random variable Z such that òAX dP = òA Z dP for all A Î F.

Conditional expectation of an integrable random variable X with respect to a s-field F Ì M will be denoted interchangeably by E{X| F} and E FX. We shall also write E{X|Y} or EYX for the conditional expectation E{X| F} when F = s(Y) is the s-field generated by a random variable Y.

Existence and almost sure uniqueness of the conditional expectation E{X| F} follows from the Radon-Nikodym theorem, applied to the finite signed measures m(A) = òAX dP and P| F, both defined on the measurable space (W, F). In some simple situations more explicit expressions can also be found.

Example. Suppose F is a s-field generated by the events A1, A2, ¼, An which form a non-degenerate disjoint partition of the probability space W. Then it is easy to check that

E{X| F}(w) = n
å
k = 1 
mk IAk(w),
where mk = òAkX dP /P(Ak). In other words, on Ak we have E{X| F} = òAkX dP /P(Ak). In particular, if X is discrete and X = åxj IBj, then we get intuitive expression
E{X| F} = å
xj P(Bj|Ak) for w Î Ak.

Example. Suppose that f(x, y) is the joint density with respect to the Lebesgue measure on IR2 of the bivariate random variable (X, Y) and let fY(y) ¹ 0 be the (marginal) density of Y. Put f(x|y) = f(x, y)/fY(y). Then E{X|Y} = h(Y), where h(y) = ò-¥¥ x f(x|y) dx.

The next theorem lists properties of conditional expectations that will be used without further mention.

Theorem 5

Remark: Inequality (iv) is known as Jensen's inequality and this is how we shall refer to it.

The proof uses the following.

[ 2 If Y1 and Y2 are F-measurable and òAY1 dP £ òA Y2 dP for all A Î F, then Y1 £ Y2 almost surely. If òAY1 dP = òA Y2 dP for all A Î F, then Y1 = Y2.

Proof. Let Ae = {Y1 > Y2+e} Î F. Since òAeY1 dP ³ òAeY2 dP + eP(Ae), thus P(Ae) > 0 is impossible. Event {Y1 > Y2} is the countable union of the events Ae (with e rational); thus it has probability 0 and Y1 £ Y2 with probability one.

The second part follows from the first by symmetry. [¯]
Proof of Theorem 1.4.

(i) This is verified first for Y = IB (the indicator function of an event B Î F). Let Y1 = E{XY| F}, Y2 = YE{X| F}. From the definition one can easily see that both òAY1 dP and òA Y2 dP are equal to òA ÇB X dP. Therefore Y1 = Y2 by the Lemma 1.4.

For the general case, approximate Y by simple random variables and use (vi).

(ii) This follows from Lemma 1.4: random variables Y1 = E{X| F}, Y2 = E{X| G} are G-measurable and for A in G both òAY1 dP and òA Y2 dP are equal to òAX dP.

(iii) Let Y1 = E{X| NÚ F}, Y2 = E{X| F}. We check first that

ó
õ


A 
Y1 dP = ó
õ


A 
Y2 dP
for all A = BÇC, where B Î N and C Î F. This holds true, as both sides of the equation are equal to P(B)òCX dP. Once equality òAY1 dP = òA Y2 dP is established for the generators of the s-field, it holds true for the whole s-field NÚ F; this is standard measure theory, see p-l Theorem [].

(iv) Here we need the first part of Lemma 1.4. We also need to know that each convex function g(x) can be written as the supremum of a family of affine functions fa, b (x) = ax+b. Let Y1 = E{g(X)| F}, Y2 = fa, b(E{X| F}), A Î F. By (vi) we have

ó
õ


A 
Y1 dP = ó
õ


A 
g(X) dP ³ fa, b( ó
õ


A 
X) dP = fa, b( ó
õ


A 
E{X| F}) dP = ó
õ


A 
Y2 dP.
Hence fa, b(E{X| F}) £ E{g(X)| F}; taking the supremum (over suitable a, b) ends the proof.

(v), (vi), (vii) These proofs are left as exercises. [¯]

Theorem 1.4 gives geometric interpretation of the conditional expectation E{·| F} as the projection of the Banach space Lp(W, M, P) onto its closed subspace Lp(W, F, P), consisting of all p-integrable F-measurable random variables, p ³ 1. This projection is ``self adjoint'' in the sense that the adjoint operator is given by the same ``conditional expectation'' formula, although the adjoint operator acts on Lq rather than on Lp; for square integrable functions E{.| F} is just the orthogonal projection onto L2(W, F, P). Monograph [] considers conditional expectation from this angle.

We will use the following (weak) version of the martingale3 convergence theorem.

Theorem 6 Suppose Fn is a decreasing family of s-fields, ie. Fn+1 Ì Fn for all n ³ 1. If X is integrable, then E{X| Fn}® E{X| F} in L1-norm, where F is the intersection of all Fn.

Proof. Suppose first that X is square integrable. Subtracting m = EX if necessary, we can reduce the convergence question to the centered case EX = 0. Denote Xn = E{X| Fn}. Since Fn+1 Ì Fn, by Jensen's inequality EXn2 ³ 0 is a decreasing non-negative sequence. In particular, EXn2 converges.

Let m < n be fixed. Then E(Xn-Xm)2 = EXn2+EXm2-2EXnXm. Since Fn Ì Fm, by Theorem 1.4 we have

EXnXm = EE{XnXm| Fn} = EXnE{Xm| Fn}
= EXnE{E{X| Fm}| Fn} = EXnE{X| Fn} = EXn2.
Therefore E(Xn-Xm)2 = EXm2-EXn2. Since EXn2 converges, Xn satisfies the Cauchy condition for convergence in L2 norm. This shows that for square integrable X, sequence {Xn} converges in L2.

If X is not square integrable, then for every e > 0 there is a square integrable Y such that E|X-Y| < e. By Jensen's inequality E{X| Fn} and E{Y| Fn} differ by at most e in L1-norm; this holds uniformly in n. Since by the first part of the proof E{Y| Fn} is convergent, it satisfies the Cauchy condition in L2 and hence in L1. Therefore for each e > 0 we can find N such that for all n, m > N we have E{|E{X| Fn}-E{X| Fm}|} < 3e. This shows that E{X| Fn} satisfies the Cauchy condition and hence converges in L1.

The fact that the limit is X¥ = E{X| F} can be seen as follows. Clearly X¥ is Fn-measurable for all n, ie. it is F-measurable. For A Î F (hence also in Fn), we have EXIA = EXnIA. Since

|EXnIA-EX¥ IA| £ E|Xn-X¥ |IA £ E|Xn-X¥ |® 0, therefore EXnIA® EX¥ IA. This shows that EXIA = EX¥ IA and by definition, X¥ = E{X| F}. [¯]

1.5  Characteristic functions

The characteristic function of a real-valued random variable X is defined by fX(t) = Eexp(itX), where i is the imaginary unit (i2 = -1). It is easily seen that
faX+b(t) = eitbfX(at).
(15)
If X has the density f(x), the characteristic function is just its Fourier transform: f(t) = ò-¥¥ eitx f(x) dx. If f(t) is integrable, then the inverse Fourier transform gives
f(x) = 1
2p
ó
õ
¥

-¥ 
e-itxf(t) dt.
This is occasionally useful in verifying whether the specific f(t) is a characteristic function as in the following example.

Example 1 The following gives an example of characteristic function that has finite support. Let f(t) = 1-|t| for |t < | < 1 and 0 otherwise. Then

f(x) = 1
2p
ó
õ
1

-1 
e-itx(1-|t|) dt = - 1
p
ó
õ
1

0 
(1-t)costx  dt = 1
p
1-cosx
x2
.
Since f(x) = [1/(p)][(1-cosx)/( x2)] is non-negative and integrable, f(t) is indeed a characteristic function.

The following properties of characteristic functions are proved in any standard probability course, see eg. [,].

Theorem 7 (i) The distribution of X is determined uniquely by its characteristic function f(t).

(ii) If E|X|r < ¥ for some r = 0,1,¼, then f(t) is r-times differentiable, the derivative is uniformly continuous and

EXk = (-i)k dk
dtk
f(t) ê
ê
ê


t = 0 
for all 0 £ k £ r.

(iii) If f(t) is 2r-times differentiable for some natural r, then EX2r < ¥.

(iv) If X, Y are independent random variables, then fX+Y(t) = fX(t) fY(t) for all t Î IR.

For a d-dimensional random variable X = (X1, ¼, Xd) the characteristic function fX: IRd® \sf CC is defined by fX(t) = Eexp(it·X), where the dot denotes the dot (scalar) product, ie. x·y = åxkyk. For a pair of real valued random variables X, Y, we also write f(t, s) = f(X, Y)((t, s)) and we call f(t, s) the joint characteristic function of X and Y.

The following is the multi-dimensional version of Theorem 1.5.

Theorem 8 (i) The distribution of X is determined uniquely by its characteristic function f(t).

(ii) If E||X||r < ¥, then f(t) is r-times differentiable and

EXj1¼Xjk = (-i)k k
tj1¼tjk
f(t) ê
ê
ê


t = 0 
for all 0 £ k £ r.

(iii) If X, Y are independent IRd-valued random variables, then

fX+Y(t) = fX(t) fY(t)
for all t in IRd.

The next result seems to be less known although it is both easy to prove and to apply. We shall use it on several occasions in Chapter . The converse is also true if we assume that the integer parameter r in the proof below is even or that joint characteristic function f(t, s) is differentiable; to prove the converse, one can follow the usual proof of the inversion formula for characteristic functions, see, eg. []. Kagan, Linnik & Rao [] state explicitly several most frequently used variants of ().

Theorem 9 Suppose real valued random variables X, Y have the joint characteristic function f(t, s). Assume that E|X|m < ¥ for some m Î IN. Let g(y) be such that

E{Xm|Y} = g(Y).
Then for all real s
(-i)m m
tm
f(t, s) ê
ê
ê


t = 0 
= Eg(Y)exp( isY).
(16)
In particular, if g(y) = åckyk is a polynomial, then
(-i)m m
tm
f(t, s) ê
ê
ê


t = 0 
=
å
k 
(-i)kck dk
dsk
f(0, s).
(17)

Proof. Since by assumption E|X|m < ¥, the joint characteristic function f(t, s) = Eexp(itX+isY) can be differentiated m times with respect to t and

m
tm
f(t, s) = imEXmexp(itX+isY).
Putting t=0 establishes (16), see Theorem 1.4(i).

In order to prove (17), we need to show first that E|Y|r < ¥, where r is the degree of the polynomial g(y). By Jensen's inequality E|g(Y)| £ E|X|m < ¥, and since |g(y)/yr|® const ¹ 0 as |y|® ¥, therefore there is C > 0 such that |y|r £ C|g(y)| for all y. Hence E|Y|r < ¥ follows.

Formula (17) is now a simple consequence of (16); indeed, for 0 £ k £ r we have EYkexp(isY) = (-i)kkf(0, s); this formula is obtained by differentiating k-times Eexp(isY) under the integral sign. [¯]

1.6  Symmetrization

Definition 2 A random variable X (also: a vector valued random variable X) is symmetric if X and -X have the same distribution.

Symmetrization techniques deal with comparison of properties of an arbitrary variable X with some symmetric variable Xsym. Symmetric variables are usually easier to deal with, and proofs of many theorems (not only characterization theorems, see eg. []) become simpler when reduced to the symmetric case.

There are two natural ways to obtain a symmetric random variable Xsym from an arbitrary random variable X. The first one is to multiply X by an independent random sign ±1; in terms of the characteristic functions this amounts to replacing the characteristic function f of X by its symmetrization 1/2 ( f(t)+ f(-t)). This approach has the advantage that if X is symmetric, then its symmetrization Xsym has the same distribution as X. Integrability properties are also easy to compare, because |X| = |Xsym|.

The other symmetrization, which has perhaps less obvious properties but is frequently found more useful, is defined as follows. Let X¢ be an independent copy of X. The symmetrization [X\tilde] of X is defined by [X\tilde] = X-X¢. In terms of the characteristic functions this corresponds to replacing the characteristic function f(t) of X by the characteristic function |f(t)|2. This procedure is easily seen to change the distribution of X, except when X = 0.

Theorem 10 (i) If the symmetrization [X\tilde] of a random variable X has a finite moment of order p ³ 1, then E|X|p < ¥.

(ii) If the symmetrization [X\tilde] of a random variable X has finite exponential moment Eexp(l|[X\tilde]|), then Eexpl|X| < ¥, l > 0.

(iii) If the symmetrization [X\tilde] of a random variable X satisfies Eexpl|[X\tilde]|2 < ¥, then Eexpl|X|2 < ¥, l > 0.

The usual approach to Theorem 1.6 uses the symmetrization inequality, which is of independent interest (see Problem ) and formula (2). Our proof requires extra assumptions, but instead is short, does not require X and X¢ to have the same distribution, and it also gives a more accurate bound (within its domain of applicability).


Proof in the case, when E|X| < ¥ and EX = 0: Let g(x) ³ 0 be a convex function, such that Eg([X\tilde]) < ¥ and let X, X¢ be the independent copies of X, so that conditional expectation EXX¢ = EX = 0. Then Eg(X) = Eg(X-EXX¢) = Eg(EX{X-X¢}). Since by Jensen's inequality, see Theorem 1.4 (iv) we have Eg(EX{X-X¢}) £ Eg(X-X¢), therefore Eg(X) £ Eg(X-X¢) = Eg([X\tilde]) < ¥. To end the proof, consider three convex functions g(x) = |x|p, g(x) = exp(lx) and g(x) = exp(lx2).

1.7  Uniform integrability

Recall that a sequence {Xn}n ³ 1 is uniformly integrable4, if

lim
t®¥ 

sup
n ³ 1 
ó
õ


{|Xn| > t|} 
|Xn| dP = 0.

Uniform integrability is often used in conjunction with weak convergence to verify the convergence of moments. Namely, if Xn is uniformly integrable and converges in distribution to Y, then Y is integrable and

EY =
lim
n®¥ 
EXn.
(18)
The following result will be used in the proof of the Central Limit Theorem in Section .

[ 3 If X1,X2,... are centered i. i. d. random variables with finite second moments and Sn = åj = 1nXj then {1/nSn2}n ³ 1 is uniformly integrable.

The following lemma is a special case of the celebrated Khinchin inequality.

[ 3 If ej are ±1 valued symmetric independent r. v., then for all real numbers aj

E æ
è
n
å
j = 1 
ajej ö
ø
4
 
£ 3 æ
è
n
å
j = 1 
aj2 ö
ø
2
 
(19)

Proof. By independence and symmetry we have

E æ
è
n
å
j = 1 
ajej ö
ø
4
 
= n
å
j = 1 
aj4+6
å
i ¹ j 
ai2aj2
which is less than 3(åj = 1naj4+2åi ¹ jai2aj2). [¯]
The next lemma gives the Marcinkiewicz-Zygmund inequality in the special case needed below.

[ 4 If Xk are i. i. d. centered with fourth moments, then there is a constant C < ¥ such that

ESn4 £ C n2 EX14
(20)

Proof. As in the proof of Theorem 1.6 we can estimate the fourth moments of a centered r. v. by the fourth moment of its symmetrization, ESn4 £ E[S\tilde]n4.

Let ej be independent of [X\tilde]k's as in Lemma 1.7. Then in distribution [S\tilde]n @ åj = 1nej[X\tilde]j. Therefore, integrating with respect to the distribution of ej first, from (19) we get

ESn4 £ 3 E æ
è
n
å
j = 1 
~
X
 
2
j 
ö
ø
2
 
= 3 E n
å
i,j = 1 
~
X
 
2
i 
~
X
 
2
j 
£ 3 n2 E ~
X
 
4
1 
.
Since ||X-X¢||4 £ 2 ||X||4 by triangle inequality (7), this ends the proof with C = 3·24. [¯]
We shall also need the following inequality.

[ 5 If U,V ³ 0 then

ó
õ


U+V > 2t 
(U+V)2 dP £ 4 æ
è
ó
õ


U > t 
U2 dP+ ó
õ


V > t 
V2 dP ö
ø
.

Proof. By (2) applied to f(x) = x2 Ix > 2t we have

ó
õ


U+V > 2t 
(U+V)2 dP = ó
õ
¥

2t 
2x P(U+V > x) dx.
Since P(U+V > x) £ P(U > x/2)+P(V > x/2), we get
ó
õ


U+V > 2t 
(U+V)2 dP £ 4 ó
õ
¥

t 
(2y P(U > y)+2y P(V > y)) dy = 4 ó
õ


U > t 
U2 dP+ 4 ó
õ


V > t 
V2 dP.
[¯]
Proof of Proposition 1.7. We follow Billingsley [].

Let e > 0 and choose M > 0 such that ò{|X| > M}|X| dP < e. Split Xk = Xk¢+Xk¢¢, where Xk¢ = XkI{|Xk| £ M}-E{XkI{|Xk| £ M}} and let S¢, S¢¢ denote the corresponding sums.

Notice that for any U ³ 0 we have UI{|U| > m} £ U2/m. Therefore 1/nò|Sn¢| > t Ön(Sn¢)2 dP £ t-2n-2E(Sn¢)4, which by Lemma 1.7 gives

1
n
ó
õ


|Sn¢| > t Ön 
(Sn¢)2 dP £ C M4/t2.
(21)

Now we use orthogonality to estimate the second term:

1
n
ó
õ


|Sn¢¢| > t Ön 
(Sn¢¢)2 dP £ 1
n
E(Sn¢¢)2 £ E|X1¢¢|2 < e
(22)
To end the proof notice that by Lemma 1.7 and inequalities (21), (22) we have
1
n
ó
õ


{|Sn| > 2tÖn} 
Sn2 dP £ 1
n
ó
õ


{|Sn¢|+|Sn¢¢| > 2tÖn} 
(|Sn¢|+|Sn¢¢|)2 dP £ CM4
t2
+e.
Therefore

limsupt®¥supn1/nò{|Sn| > 2tÖn}Sn2 dP £ e. Since e > 0 is arbitrary, this ends the proof. [¯]

1.8  The Mellin transform

Definition 3 5 The Mellin transform of a random variable X ³ 0 is defined for all complex s such that EXÂs -1 < ¥ by the formula M(s) = EXs-1.

The definition is consistent with the usual definition of the Mellin transform of an integrable function: if X has a probability density function f(x), then the Mellin transform of X is given by M(s) = ò0¥ xs-1f(x) dx.

Theorem 11 6 If X ³ 0 is a random variable such that EXa-1 < ¥ for some a ³ 1, then the Mellin transform M(s) = EXs-1, considered for s Î \sf CC such that Âs = a, determines the distribution of X uniquely.

Proof. The easiest case is when a = 1 and X > 0. Then M (s) is just the characteristic function of log(X); thus the distribution of log(X), and hence the distribution of X, is determined uniquely.

In general consider finite non-negative measure m defined on (IR+, B) by

m(A) = ó
õ


X-1(A) 
Xa-1 dP.
Then M (s)/ M (a) is the characteristic function of a random variable x: x®log(x) defined on the probability space (IR+, B, P¢) with the probability distribution P¢(.) = m(.)/m(IR+). Thus the distribution of x is determined uniquely by M (s). Since ex has distribution P¢(.), m is determined uniquely by M (.). It remains to notice that if F is the distribution of our original random variable X, then dF = x1-a m(dx)+m(IR+)d0(dx), so F(.) is determined uniquely, too. [¯]

Theorem 12 If X ³ 0 and EXa < ¥ for some a > 0, then the Mellin transform of X is analytic in the strip 1 < Âs < 1+a.

Proof. Since for every s with 0 < Âs < a the modulus of the function w® Xslog(X) is bounded by an integrable function C1+C2|X|a, therefore EXs can be differentiated with respect to s under the expectation sign at each point s, 0 < Âs < a. [¯]

1.9  Problems

Problem 1 [[]] Use Fubini's theorem to show that if XY, X, Y are integrable, then

EXY-EXEY = ó
õ
¥

-¥ 
ó
õ
¥

-¥ 
(P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)) dt ds.

Problem 2 Let X ³ 0 be a random variable and suppose that for every 0 < q < 1 there is T = T(q) such that

P(X > 2t) £ q P(X > t) for all t > T.
Show that all the moments of X are finite.

Problem 3 Show that if X ³ 0 is a random variable such that

P(X > 2t) £ (P(X > t))2 for all t > 0,
then Eexp(l|X|) < ¥ for some l > 0.

Problem 4 Show that if Eexp(lX2) = C < ¥ for some a > 0, then

Eexp(tX) £ Cexp( t2
2l
)
for all real t.

Problem 5 Show that (11) implies E{|X||X|} < ¥.

Problem 6 Prove part (v) of Theorem 1.4.

Problem 7 Prove part (vi) of Theorem 1.4.

Problem 8 Prove part (vii) of Theorem 1.4.

Problem 9 Prove the following conditional version of Chebyshev's inequality: if F is a s-field, and E|X| < ¥, then

P(|X| > t | F) £ E{|X| | F}/t
almost surely.

Problem 10 Show that if (X, Y) is uniformly distributed on a circle centered at (0, 0), then for every a, b there is a non-random constant C = C(a, b) such that E{X|aX+bY} = C(a,b)(aX+bY).

Problem 11 Show that if (U,V,X) are such that in distribution (U,X) @ (V,X) then E{U|X} = E{V|X} almost surely.

Problem 12 Show that if X, Y are integrable non-degenerate random variables, such that

E{X|Y} = aY,  E{Y|X} = bX,
then |ab| £ 1.

Problem 13 Suppose that X, Y are square-integrable random variables such that

E{X|Y} = Y,  E{Y|X} = 0.
Show that Y = 0 almost surely7.

Problem 14 Show that if X, Y are integrable such that E{X|Y} = Y and E{Y|X} = X, then X = Y a. s.

Problem 15 Prove that if X ³ 0, then function f(t): = EXit, where t Î IR, determines the distribution of X uniquely.

Problem 16 Prove that function f(t): = Emax{X, t} determines uniquely the distribution of an integrable random variable X in each of the following cases:

Problem 17 Prove that, if E|X| < ¥, then function f(t): = E|X-t| determines uniquely the distribution of X.

Problem 18 Let p > 2 be fixed. Show that exp(-|t|p) is not a characteristic function.

Problem 19 Let Q(t,s) = logf(t,s), where f(t,s) is the joint characteristic function of square-integrable r. v. X,Y.

Problem 20 [see eg. []] Suppose a Î IR is the median of X.

Problem 21 Suppose (Xn,Yn) converge to (X,Y) in distribution and {Xn}, {Yn} are uniformly integrable. If E(Xn|Yn) = rYn for all n, show that E(X|Y) = rY.

Problem 22 Prove (18).

Chapter 2
Normal distributions

In this chapter we use linear algebra and characteristic functions to analyze the multivariate normal random variables. More information and other approaches can be found, eg. in [,,]. In Section we give criteria for normality which will be used often in proofs in subsequent chapters.

2.1  Univariate normal distributions

The usual definition of the standard normal variable Z specifies its density f(x) = [1/( Ö{2p})]e-[(x2)/ 2]. In general, the so called N(m, s) density is given by
f(x) = 1
  æ
Ö

2p
 
s
e -[((x-m)2)/(2s2)].
By completing the square one can check that the characteristic function f(t) = EeitZ = ò-¥¥ eitx f(x) dx of the standard normal r. v. Z is given by
f(t) = e-[(t2)/ 2],
see Problem .

In multivariate case it is more convenient to use characteristic functions directly. Besides, characteristic functions are our main technical tool and it doesn't hurt to start using them as soon as possible. We shall therefore begin with the following definition.

Definition 4 A real valued random variable X has the normal N(m, s) distribution if its characteristic function has the form

f(t) = exp(itm- 1
2
s2t2),
where m, s are real numbers.

From Theorem 1.5 it is easily to check by direct differentiation that m = EX and s2 = Var(X). Using (15) it is easy to see that every univariate normal X can be written as

X = sZ + m,
(23)
where Z is the standard N(0,1) random variable with the characteristic function e-[(t2)/ 2].

The following properties of standard normal distribution N(0,1) are self-evident:

  1. The characteristic function e-[(t2)/ 2] has analytic extension e-[(z2)/ 2] to all complex z Î \sf CC. Moreover, e-[(z2)/ 2] ¹ 0.
  2. Standard normal random variable Z has finite exponential moments Eexp(l|Z|) < ¥ for all l; moreover, Eexp(lZ2) < ¥ for all l < 1/2 (compare Problem 1.9).

Relation (23) translates the above properties to the general N(m,s) distributions. Namely, if X is normal, then its characteristic function has non-vanishing analytic extension to \sf CC and
Eexp(lX2) < ¥
for some l > 0.

For future reference we state the following simple but useful observation. Computing EXk for k = 0, 1, 2 from Theorem 1.5 we immediately get.

[ 4 A characteristic function which can be expressed in the form f(t) = exp(at2+bt+c) for some complex constants a, b, c, corresponds to the normal random variable, ie. a Î IR and a < 0, b Î i IR is imaginary and c = 0.

2.2  Multivariate normal distributions

We follow the usual linear algebra notation. Vectors are denoted by small bold letters x, v, t, matrices by capital bold initial letters A, B, C and vector-valued random variables by capital boldface X, Y, Z; by the dot we denote the usual dot product in IRd, ie. x·y: = åj = 1d xjyj; ||x|| = (x·x)1/2 denotes the usual Euclidean norm. For typographical convenience we sometimes write (a1,¼,ak) for the vector

[
a1
:
ak
]. By AT we denote the transpose of a matrix A.

Below we shall also consider another scalar product á·,·ñ associated with the normal distribution; the corresponding semi-norm will be denoted by the triple bar |||·|||.

Definition 5 An IRd-valued random variable Z is multivariate normal, or Gaussian (we shall use both terms interchangeably; the second term will be preferred in abstract situations) if for every t Î IRd the real valued random variable t·Z is normal.

Clearly the distribution of univariate t·Z is determined uniquely by its mean m = mt and its standard deviation s = st. It is easy to see that mt = t·m, where m = EZ. Indeed, by linearity of the expected value mt = Et·Z = t·EZ. Evaluating the characteristic function f(s) of the real-valued random variable t·Z at s = 1 we see that the characteristic function of Z can be written as

f(t) = exp( it·m- st2
2
).
In order to rewrite this formula in a more useful form, consider the function B(x, y) of two arguments x, y Î IRd defined by
B(x, y) = E{(x·Z)(y·Z)}-(x·mx)(y·my).
That is, B(x, y) is the covariance of two real-valued (and jointly Gaussian) random variables x·Z and y·Z.

The following observations are easy to check.

We shall need the following well known linear algebra fact (the proofs are explained below; explicit reference is, eg. []).

[ 6 Each bilinear form B has the dot product representation

B(x, y) = Cx·y,
where C is a linear mapping, represented by a d×d matrix C = [ci, j]. Furthermore, if B(·, ·) is symmetric then C is symmetric, ie. we have C = CT.

Indeed, expand x and y with respect to the standard orthogonal basis e 1,¼, e d. By bilinearity we have B(x, y) = åi,jxiyj B( e i, e j), which gives the dot product representation with ci, j = B( e i, e j). Clearly, for symmetric B(·, ·) we get ci, j = cj, i; hence C is symmetric.

[ 7 If in addition B(·, ·) is positive definite then

C = A×AT
(24)

for a d×d matrix A. Moreover, A can be chosen to be symmetric.

The easiest way to see the last fact is to diagonalize C (this is always possible, as C is symmetric). The eigenvalues of C are real and, since B(·, ·) is positive definite, they are non-negative. If L denotes a (diagonal) matrix (consisting of eigenvalues of C) in the diagonal representation C = ULUT and D is the diagonal matrix formed by the square roots of the eigenvalues, then A = UDUT. Moreover, this construction gives symmetric A = AT. In general, there is no unique choice of A and we shall sometimes find it more convenient to use non-symmetric A, see Example below.

The linear algebra results imply that the characteristic function corresponding to a normal distribution on IRd can be written in the form

f(t) = exp( it·m- 1
2
Ct·t).
(25)
Theorem 1.5 identifies m Î IRd as the mean of the normal random variable Z = (Z1, ¼, Zd); similarly, double differentiation f(t) at t = 0 shows that C = [ci, j] is given by ci, j = Cov(Zi, Zj). This establishes the following.

Theorem 13 The characteristic function corresponding to a normal random variable Z = (Z1, ¼, Zd) is given by (25), where m = EZ and C = [ci, j], ci, j = Cov(Zi, Zj), is the covariance matrix.

From (24) and (25) we get also

f(t) = exp( it·m- 1
2
(At)·(At)).
(26)
In the centered case it is perhaps more intuitive to write B(x,y) = áx,yñ; this bilinear product might (in degenerate cases) turn out to be 0 on some non-zero vectors. In this notation (26) can be written as
Eexp( it·Z) = exp- 1
2
át, tñ.
(27)

From the above discussion, we have the following multivariate generalization of (23).

Theorem 14 Each d-dimensional normal random variable Z has the same distribution as m+A[(g)\vec], where m Î IRd is deterministic, A is a (symmetric) d×d matrix and [(g)\vec] = (g1, ¼, gd) is a random vector such that the components g1, ¼, gd are independent N(0, 1) random variables.

Proof. Clearly, Eexp( it·(m+A[(g)\vec])) = exp(it·m) Eexp( it·(A[(g)\vec])). Since the characteristic function of [(g)\vec] is Eexp( ix·[(g)\vec]) = exp-1/2 || x|| 2 and t·(A[(g)\vec]) = (ATt)·[(g)\vec], therefore we get

Eexp( it·(m+A[(g)\vec])) = expit·mexp -1/2 || ATt|| 2, which is another form of (26). [¯]

Theorem 2.2 can be actually interpreted as the almost sure representation. However, if A is not of full rank, the number of independent N(0,1) r. v. can be reduced. In addition, the representation Z @ m+A[(g)\vec] from Theorem 2.2 is not unique if the symmetry condition is dropped. Theorem gives the same representation with non-symmetric A = [e1,¼,ek]. The argument given below has more geometric flavor. Infinite dimensional generalizations are also known, see () and the comment preceding Lemma .

Theorem 15 Each d-dimensional normal random variable Z can be written as

Z = m+ k
å
j = 1 
gj e j,
(28)
where k £ d, m Î IRd, e 1, e 2, ¼, e k are deterministic linearly independent vectors in IRd and g1, ¼, gk are independent identically distributed normal N(0, 1) random variables.

Proof. Without loss of generality we may assume EZ = 0 and establish the representation with m = 0.

Let IH denote the linear span of the columns of A in IRd, where A is the matrix from (26). From Theorem 2.2 it follows that with probability one Z Î IH. Consider now IH as a Hilbert space with a scalar product áx, yñ, given by áx, yñ = (Ax)·(Ay). Since the null space of A and the column space of A have only zero vector in common, this scalar product is non-degenerate, ie. áx, xñ ¹ 0 for IH ' x ¹ 0.

Let e 1, e 2, ¼, e k be the orthonormal (with respect to á·,·ñ) basis of IH, where k = dimIH. By Theorem 2.2 Z is IH-valued. Therefore with probability one we can write Z = åj = 1k gj e j, where gj = áe j,Zñ are random coefficients in the orthogonal expansion. It remains to verify that g1, ¼, gk are i. i. d. normal N(0, 1) r. v. With this in mind, we use (26) to compute their joint characteristic function:

Eexp( i k
å
j = 1 
tjgj) = Eexp( i k
å
j = 1 
tjá e j, Zñ) = Eexp(iá k
å
j = 1 
tjej, Zñ).
By (27)
Eexp(iá k
å
j = 1 
tjej, Zñ) = exp(- 1
2
á k
å
j = 1 
tjej, k
å
j = 1 
tjejñ) = exp(- 1
2
k
å
j = 1 
tj2).
The last equality is a consequence of orthonormality of vectors e 1, e 2, ¼, ek with respect to the scalar product á·, ·ñ. [¯]

The next theorem lists two important properties of the normal distribution that can be easily verified by writing the joint characteristic function. The second property is a consequence of the polarization identity

|||t+s|||2+ |||t-s|||2 = |||t|||2+ |||s|||2,
where
|||x|||2: = áx, xñ: = || Ax|| 2;
(29)
the proof is left as an exercise.

Theorem 16 If X, Y are independent with the same centered normal distribution, then

a) [(X+Y)/( Ö2)] has the same distribution as X;

b) X+Y and X-Y are independent.

Now we consider the multivariate normal density. The density of [(g)\vec] in Theorem 2.2 is the product of the one-dimensional standard normal densities, ie.

f[(g)\vec](x) = (2p)-d/2exp(- 1
2
|| x||2).
Suppose that detC ¹ 0, which ensures that A is nonsingular. By the change of variable formula, from Theorem 2.2 we get the following expression for the multivariate normal density.

Theorem 17 If Z is centered normal with the nonsingular covariance matrix C, then the density of Z is given by

fZ(x) = (2p)-d/2 ( det
A)-1exp(- 1
2
|| A-1x||2),
or
fZ(x) = (2p)-d/2 ( det
C)-1/2exp(- 1
2
C-1x·x),
where matrices A and C are related by (24).

In the nonsingular case this immediately implies strong integrability.

Theorem 18 If Z is normal, then there is e > 0 such that

Eexp(e|| Z||2) < ¥.

Remark: Theorem 2.2 holds true also in the singular case and for Gaussian random variables with values in infinite dimensional spaces; for the proof based on Theorem 2.2, see Theorem below.

The Hilbert space IH introduced in the proof of Theorem 2.2 is called the Reproducing Kernel Hilbert Space (RKHS) of a normal distribution, cf. [,]. It can be defined also in more general settings. Suppose we want to consider jointly two independent normal r. v. X and Y, taking values in IRd1 and IRd2 respectively, with corresponding reproducing kernel Hilbert spaces IH1, IH2 and the corresponding dot products á·,·ñ1 and á·,·ñ2. Then the IRd1+d2-valued random variable (X, Y) has the orthogonal sum IH1ÅIH2 as the Reproducing Kernel Hilbert Space.

This method shows further geometric aspects of jointly normal random variables. Suppose an IRd1+d2-valued random variable (X, Y) is (jointly) normal and has IH as the reproducing kernel Hilbert space (with the scalar product á·,·ñ). Recall that

IH = {A[
x
y
]: [
x
y
] Î IRd1+d2}. Let IHY be the subspace of IH spanned by the vectors

{[
0
y
]: y Î IRd2}; similarly let IHX be the subspace of IH spanned by the vectors

[
x
0
]. Let P be (the matrix of) the linear transformation IHX® IHY obtained from the á·,·ñ-orthogonal projection IH® IHX by narrowing its domain to IHX. Denote Q = PT; Q represents the orthogonal projection in the dual norm defined in Section below.

Theorem 19 If (X, Y) has jointly normal distribution on IRd1+d2, then random vectors Y-QY and X are stochastically independent.

Proof. The joint characteristic function of X-QY and Y factors as follows:

f(t, s) = Eexp( i t·(X-QY)+i s·Y)
= Eexp(i t·X-Pt·Y+i s·Y)
= exp(- 1
2
||| é
ê
ë
t
s-Pt
ù
ú
û
|||2) = exp(- 1
2
||| é
ê
ë
t
-Pt
ù
ú
û
|||2)exp(- 1
2
||| é
ê
ë
0
s
ù
ú
û
|||2).
The last identity holds because by our choice of P, vectors

[
0
s
] and

[
t
-Pt
] are orthogonal with respect to scalar product á·,·ñ. [¯]
In particular, since E{X| Y} = E{X-QY| Y}+QY, we get

[ 7 If both X and Y have mean zero, then

E{ X| Y} = QY.

For general multivariate normal random variables X and Y applying the above to centered normal random variables X-mX and Y-mY respectively, we get

E{X|Y} = a+QY;
(30)
vector a = mX-QmY and matrix Q are determined by the expected values mX, mY and by the (joint) covariance matrix C (uniquely if the covariance CY of Y is non-singular). To find Q, multiply (30) (as a column vector) from the right by (Y-EY)T and take the expected value. By Theorem 1.4(i) we get Q = R×CY-1, where we have written C as the (suitable) block matrix

C = [
CX
R
RT
CY
]. An alternative proof of (30) (and of Corollary 2.2) is to use the converse to Theorem 1.5.

Equality (30) is usually referred to as linearity of regression. For the bivariate normal distribution it takes the form E{X|Y} = a+bY and it can be established by direct integration; for more than two variables computations become more difficult and the characteristic functions are quite handy.

[ 8 Suppose (X, Y) has a (joint) normal distribution on IRd1+d2 and IHX, IHY are á,·,·ñ-orthogonal, ie. every component of X is uncorrelated with all components of Y. Then X, Y are independent.

Indeed, in this case Q is the zero matrix; the conclusion follows from Theorem 2.2.

Example 2 In this example we consider a pair of (jointly) normal random variables X1, X2. For simplicity of notation we suppose EX1 = 0, EX2 = 0. Let Var(X1) = s12, Var(X2) = s2 2 and denote corr(X1, X2) = r. Then

C = [
s12
r
r
s22
] and the joint characteristic function is

f(t1, t2) = exp( - 1
2
t12s12- 1
2
t22 s2 2-t1t2r).
If s1 s2 ¹ 0 we can normalize the variables and consider the pair Y1 = X1/s1 and Y2 = X2/s2. The covariance matrix of the last pair is

CY = [
1
r
r
1
]; the corresponding scalar product is given by


á
é
ê
ë
x1
x2
ù
ú
û
, é
ê
ë
y1
y2
ù
ú
û

ñ
= x1y1+x2y2+rx1y2+rx2y1
and the corresponding RKHS norm is

|||[
x1
x2
] ||| = (x12+x22+2rx1x2)1/2. Notice that when r = ±1 the RKHS norm is degenerate and equals |x1±x2|.

Denoting r = sin2q, it is easy to check that

AY = [
cosq
sinq
sinq
cosq
] and its inverse

AY-1 = [1/( cos2q)] [
cosq
-sinq
-sinq
cosq
] exists if q ¹ ±p/4, ie. when r2 ¹ 1. This implies that the joint density of Y1 and Y2 is given by

f(x, y) = 1
2pcos2q
exp(- 1
2cos22q
( x2+ y2-2xysin2q)).
(31)
We can easily verify that in this case Theorem 2.2 gives
Y1 = g1cosq+ g2sinq,
Y2 = g1sinq+ g2cosq
for some i.i.d normal N(0, 1) r. v. g1, g2. One way to see this, is to compare the variances and the covariances of both sides. Another representation Y1 = g1, Y2 = rg1+Ö{1-r2}g2 illustrates non-uniqueness and makes Theorem 2.2 obvious in bivariate case.

Returning back to our original random variables X1, X2, we have X1 = g1 s1cosq+ g2 s1sinq and X2 = g1 s2sinq+ g2 s2cosq; this representation holds true also in the degenerate case.

To illustrate previous theorems, notice that Corollary 2.2 in the bivariate case follows immediately from (31). Theorem 2.2 says in this case that Y1-rY2 and Y2 are independent; this can also be easily checked either by using density (31) directly, or (easier) by verifying that Y1-rY2 and Y2 are uncorrelated.

Example 3 In this example we analyze a discrete time Gaussian random walk {Xk}0 £ k £ T. Let x1,x2,¼ be i. i. d. N(0,1). We are interested in explicit formulas for the characteristic function and for the density of the IRT-valued random variable X = (X1, X2, ¼, XT), where

Xk = k
å
j = 1 
xj
(32)
are partial sums.

Clearly, m = 0. Comparing (32) with (28) we observe that

A = é
ê
ê
ê
ê
ê
ë
1
0
¼
0
1
1
¼
0
:
···
:
1
1
¼
1
ù
ú
ú
ú
ú
ú
û
.
Therefore from (26) we get
f(t) = exp- 1
2
(t12+(t1+t2)2+¼+(t1+t2+¼+tT)2).
To find the formula for joint density, notice that A is the matrix representation of the linear operator, which to a given sequence of numbers (x1,x2,¼,xT) assigns the sequence of its partial sums (x1, x1+x2,¼, x1+x2+¼+xT). Therefore, its inverse is the finite difference operator D: (x1,x2,¼,xT)®(x1, x2-x1,¼, xT-xT-1). This implies
A-1 = é
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ë
1
0
0
¼
¼
0
-1
1
0
¼
¼
0
0
-1
1
¼
¼
0
0
0
-1
¼
¼
0
:
···
···
:
0
¼
0
¼
-1
1
ù
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
û
.
Since detA = 1, we get
f(x) = (2p)-n/2exp- 1
2
(x12+(x2-x1)2+¼+(xT-xT-1)2).
(33)
Interpreting X as the discrete time process X1, X2,¼, the probability density function for its trajectory x is given by f(x) = Cexp(-1/2 ||Dx||2). Expression 1/2 ||Dx||2 can be interpreted as proportional to the kinetic energy of the motion described by the path x; assigning probabilities by Ce-Energy/(kT) is a well known practice in statistical physics. In continuous time, the derivative plays analogous role, compare Schilder's theorem [].

2.3  Analytic characteristic functions

The characteristic function f(t) of the univariate normal distribution is a well defined differentiable function of complex argument t. That is, f has analytic extension to complex plane \sf CC. The theory of functions of complex variable provides a powerful tool; we shall use it to recognize the normal characteristic functions. Deeper theory of analytic characteristic functions and stronger versions of theorems below can be found in monographs [,].

Definition 6 We shall say that a characteristic function f(t) is analytic if it can be extended from the real line IR to the function analytic in a domain in complex plane \sf CC.

Because of uniqueness we shall use the same symbol f to denote both.

Clearly, normal distribution has analytic characteristic function. Example 1.5 presents a non-analytic characteristic function.

We begin with the probabilistic (moment) condition for the existence of the analytic extension.

Theorem 20 If a random variable X has finite exponential moment Eexp(a|X|) < ¥, where a > 0, then its characteristic function f(s) is analytic in the strip -a < Ás < a.

Proof. The analytic extension is given explicitly: f(s) = Eexp(isX). It remains only to check that f(s) is differentiable in the strip -a < Ás < a. This follows either by differentiation with respect to s under the expectation sign (the latter is allowed, since E{|X|exp(|sX|)} < ¥, provided -a < Ás < a), or by writing directly the series expansion: f(s) = ån = 0¥ inEXn sn/n! (the last equality follows by switching the order of integration and summation, ie. by Fubini's theorem). The series is easily seen to be absolutely convergent for all -a £ Ás £ a. [¯]

[ 9 If X is such that Eexp(a|X|) < ¥ for every real a > 0, then its characteristic function f(s) is analytic in \sf CC.

The next result says that normal distribution is determined uniquely by its moments. For more information on the moment problem, the reader is referred to the beautiful book by N. I. Akhiezer [].

[ 10 If X is a random variable with finite moments of all orders and such that EXk = EZk, k = 1, 2,¼, where Z is normal, then X is normal.

Proof. By the Taylor expansion

Eexp(a|X|) = å
akE|X|k/k! = Eexp(a|Z|) < ¥
for all real a > 0. Therefore by Corollary 2.3 the characteristic function of X is analytic in \sf CC and it is determined uniquely by its Taylor expansion coefficients at 0. However, by Theorem 1.5(ii) the coefficients are determined uniquely by the moments of X. Since those are the same as the corresponding moments of the normal r. v. Z, both characteristic functions are equal. [¯]
We shall also need the following refinement of Corollary 2.3.

[ 11 Let f(t) be a characteristic function, and suppose there is s2 > 0 and a sequence {tk} convergent to 0 such that f(tk) = exp(-s2tk2) and tk ¹ 0 for all k. Then f(t) = exp(-s2t2) for every t Î IR.

Proof. The idea of the proof is simply to calculate all the derivatives at 0 of f(t) along the sequence {tk}. Since the derivatives determine moments uniquely, by Corollary 2.3 we shall conclude that f(t) = exp(- s2 t2). The only nuisance is to establish that all the moments of the distribution are finite. This fact is established by modifying the usual proof of Theorem 1.5(iii). Let Dt2 be a symmetric second order difference operator, ie.

Dt2(g)(y): = g(y+t)+g(y-t)-2 g(y)
t2
.
The assumption that f(t) is differentiable 2n times along the sequence {tk} implies that

sup
k 
| Dt(k)2n(f)(0)| =
sup
k 
|Dt(k)2Dt(k)2¼Dt(k)2 (f)(0)| < ¥.
Indeed, the assumption says that limk® ¥Dt(k)2n(f)(0) exists for all n. Therefore to end the proof we need the following result.

Claim 1 If f(t) is the characteristic function of a random variable X, t(k)® 0 is a given sequence such that t(k) ¹ 0 for all k and


sup
k 
| Dt(k)2n(f)(0)| < ¥
for an integer n, then EX2n < ¥.

The proof of the claim rests on the formula which can be verified by elementary calculations:

{Dt2exp(iay)}(y)|y = x = 4t-2exp(iax)sin2(at/2).
This permits to express recurrently the higher order differences, giving
{Dt(k)2nexp(iay)}(y)|y = x = 4nt-2nsin2n(at/2)exp(iax).
Therefore
|Dt(k)2n(f)(0)| = 4n t(k)-2nEsin2n(t(k)X/2)
³ 4n t(k)-2nE1|X| £ 2/|t(k)|sin2n(t(k)X/2).
The graph of sin(x) shows that inequality |sin(x)| ³ [2/(p)] |x| holds for all |x| £ [(p)/ 2]. Therefore
|Dt(k)2n(f)(0)| ³ æ
ç
è
2
p
ö
÷
ø
2n

 
E1|X| £ 2/|t(k)| X2n.
By the monotone convergence theorem
EX2n £
limsup
k® ¥ 
E1|X| £ 2/|t(k)| X2n < ¥,
which ends the proof. [¯]
The next result is converse to Theorem 2.3.

Theorem 21 If the characteristic function f(t) of a random variable X has the analytic extension in a neighborhood of 0 in \sf CC, and the extension is such that the Taylor expansion series at 0 has convergence radius R £ ¥, then Eexp(a|X|) < ¥ for all 0 £ a < R.

Proof. By assumption, f(s) has derivatives of all orders. Thus the moments of all orders are finite and

mk = EXk = (-i)k k
sk
f(s) ê
ê
ê


s = 0 
, k ³ 1.
Taylor's expansion of f(s) at s = 0 is given by f(s) = åk = 0¥ ikmksk/k!. The series has convergence radius R if and only if limsupk® ¥ (mk/k!)1/k = 1/R. This implies that for any 0 £ a < A < R, there is k0, such that mk £ A-kk! for all k ³ k0. Hence Eexp(a|X|) = åk = 0¥ akmk/k! < ¥, which ends the proof of the theorem. [¯]
Theorems 2.3 and 2.3 combined together imply the following.

[ 12 If a characteristic function f(t) can be extended analytically to the circle |s| < a, then it has analytic extension f(s) = Eexp( isX) to the strip -a < Ás < a.

2.4  Hermite expansions

A normal N(0,1) r. v. Z defines a dot product áf, g ñ = Ef(Z)g(Z), provided that f(Z) and g(Z) are square integrable functions on W. In particular, the dot product is well defined for polynomials. One can apply the usual Gram-Schmidt orthogonalization algorithm to functions 1, Z, Z2, ¼. This produces orthogonal polynomials in variable Z known as Hermite polynomials. Those play important role and can be equivalently defined by

Hn(x) = (-1)n exp(x2/2) dn
dxn
exp(-x2/2).

Hermite polynomials actually form an orthogonal basis of L2(Z). In particular, every function f such that f(Z) is square integrable can be expanded as f(x) = ån = 1¥ fk Hk(x), where fk Î IR are Fourier coefficients of f(·); the convergence is in L2(Z), ie. in weighted L2 norm on the real line, L2(IR, e-x2/2dx).

The following is the classical Mehler's formula.

Theorem 22 For a bivariate normal r. v. X,Y with EX = EY = 0, EX2 = EY2 = 1, EXY = r, the joint density q(x,y) of X,Y is given by

q(x,y) = ¥
å
k = 0 
rk/k! Hk(x)Hk(y) q(x) q(y),
(34)
where q(x) = (2p)-1/2 exp(-x2/2) is the marginal density.

Proof. By Fourier's inversion formula we have

q(x,y) = 1
2p
ó
õ
ó
õ
exp(itx+ity) exp(- 1
2
t2- 1
2
s2)exp(-rts) dt ds.
Since (-1)ktksk exp(itx+isy) = [(2k)/( xk yk)]exp(itx+isy), expanding e-rts into the Taylor series we get
q(x,y) = ¥
å
k = 0 
rk
k!
2k
xk yk
q(x) q(y).
[¯]

2.5  Cramer and Marcinkiewicz theorems

The next lemma is a direct application of analytic functions theory.

[ 8 If X is a random variable such that Eexp(lX2) < ¥ for some l > 0, and the analytic extension f(z) of the characteristic function of X satisfies f(z) ¹ 0 for all z Î \sf CC, then X is normal.

Proof. By the assumption, f(z) = logf(z) is well defined and analytic for all z Î \sf CC. Furthermore if z = x+ iy is the decomposition of z Î \sf CC into its real and imaginary parts, then Âf(z) = log| f(z)| £ log(Eexp|yX|). Notice that Eexp(tX) £ Cexp([(t2)/( 2l)]) for all real t, see Problem 1.9. Indeed, since lX2+t2/l ³ 2tX, therefore Eexp(tX) £ Eexp(lX2+t2/a)/2 = Cexp([(t2)/( 2l)]). Those two facts together imply Âf(z) £ const+[(y2)/ 2a]. Therefore a variant of the Liouville theorem [] implies that f(z) is a quadratic polynomial in variable z, ie. f(z) = A+Bz+Cz2. It is easy to see that the coefficients are A = 0, B = i E{X}, C = -Var(X)/2, compare Proposition 2.1. [¯]
From Lemma 2.5 we obtain quickly the following important theorem, due to H. Cramer [].

Theorem 23 If X1 and X2 are independent random variables such that X1+X2 has a normal distribution, then each of the variables X1, X2 is normal.

Theorem 2.5 is celebrated Cramer's decomposition theorem; for extensions, see []. Cramer's theorem complements nicely the Central Limit Theorem in the following sense. While the Central Limit Theorem asserts that the distribution of the sum of i. i. d. random variables with finite variances is close to normal, Cramer's theorem says that it cannot be exactly normal, except when we start with a normal sequence. This resembles propagation of chaos phenomenon, where one proves a dynamical system approaches chaotic behavior, but it never reaches it except from initially chaotic configurations. We shall use Theorem 2.5 as a technical tool.


Proof of Theorem 2.5. Without loss of generality we may assume EX1 = EX2 = 0. The proof of Theorem 1.6 (iii) implies that Eexp(aXj2) < ¥, j = 1, 2. Therefore, by Theorem 2.3, the corresponding characteristic functions f1 (·), f2 (·) are analytic. By the uniqueness of the analytic extension, f1(s)f2(s) = exp(-s2/2) for all s Î \sf CC. Thus fj(z) ¹ 0 for all z Î \sf CC, j = 1, 2, and by Lemma 2.5 both characteristic functions correspond to normal distributions. [¯]
The next theorem is useful in recognizing the normal distribution from what at first sight seems to be incomplete information about a characteristic function. The result and the proof come from Marcinkiewicz [], cf. [].

Theorem 24 Let Q(t) be a polynomial, and suppose that a characteristic function f has the representation f(t) = expQ(t) for all t close enough to 0. Then Q is of degree at most 2 and f corresponds to a normal distribution.

Proof. First note that formula f(s) = expQ(s), s Î \sf CC, defines the analytic extension of f. Thus, by Corollary 2.3, f(s) = Eexp(isX), s Î \sf CC. By Theorem 2.5, it suffices to show that f(s) f(-s) corresponds to the normal distribution. Clearly f(s) f(-s) also has the form exp( P(t)), where P(s) is a polynomial that has only even terms, ie. P(s) = åk = 0n aks2k. Since f(s)f(-s) = | f(s)|2 is a real number for all s, the coefficients a1, ¼, an of polynomial P (·) are real. Moreover, the n-th coefficient satisfies an = -g2 < 0, as the inequality | f(t)| £ 1 holds for arbitrarily large real t. Therefore, taking z = N exp( ip/(2n)), we obtain

| f(z)| ³ exp( N(g2- e(N)))
(35)
for large enough real N, where e(N)® 0 as N® ¥. On the other hand, using the explicit representation by expected value, we get
| f(z)| = |Eexp( izX)| £ Eexp(Nsin(p/(2n))X)
= f(Nsin(p/(2n))) = exp(P(Nsin(p/(2n))))
£ exp( Nsin(p/(2n))(g2+e(N))).
As N® ¥ the last inequality contradicts (35), unless sin(p/(2n)) = 1, ie. unless n = 1. This means that P is of degree 2 and, since P(0) = 0, we have P(t) = -g2t for all t. [¯]

2.6  Large deviations

Formula (25) shows that a multivariate normal distribution is uniquely determined by the vector m of expected values and the covariance matrix C. However, to compute probabilities of the events of interest might be quite difficult. As Theorem 2.2 shows, even writing explicitly the density is cumbersome in higher dimensions as it requires inverting large matrices. Additional difficulties arise in degenerate cases.

Here we shall present the logarithmic term in the asymptotic expansion for P(X Î nA) as n® ¥. This is the so called large deviation estimate; it becomes more accurate for less likely events. The main feature is that it has relatively simple form and applies to all events. Higher order expansions are more accurate but work for fairly regular sets A Ì IRd only.

Let us first define the conjugate ``norm'' to the RKHS seminorm |||·||| defined by (29).

|||y|||* =
sup
x Î I\negthinspace Rd|||x||| = 1 
x·y.

The conjugate norm has all the properties of the norm except that it can attain value ¥. To see this, and also to have a more explicit expression, decompose I\negthinspace Rd into the orthogonal sum of the null space of A and the range of A: IRd = N(A)ÅÂ(A); here A is the symmetric matrix from (26). Since A:IRd® Â(A) is onto, there is a right-inverse A-1:Â(A)® Â(A) Ì I\negthinspace Rd.

For y Î Â(A) we have


sup
||Ax|| = 1 
x·y =
sup
||Ax|| = 1 
x·AA-1y =
sup
||Ax|| = 1 
ATx·A-1y
(36)
Since A is symmetric and A-1y Î Â(A), for y Î Â(A) we have by (36)
|||y|||* =
sup
x Î Â(A), ||x|| = 1 
x·A-1y = ||A-1y||.

For y\not Î Â(A) we write y = y N+yÂ, where 0 ¹ y N Î N(A). Then we have sup||Ax|| = 1x·y ³ supx Î N(A)x·y N = ¥. Since C = A×A, we get

|||y|||* = ì
í
î
y·C-1y
if y Î Â(C);
¥
if y\not Î Â(C),
(37)
where C-1 is the right inverse of the covariance matrix C.

In this notation, the multivariate normal density is

f(x) = Ce-1/2|||x-m|||*2,
(38)
where C is the normalizing constant and the integration has to be taken over the Lebesgue measure l on the support supp(X) = {x:|||x|||* < ¥}.

To state the Large Deviation Principle, by A° we denote the interior of a Borel subset A Ì I\negthinspace Rd.

Theorem 25 If X is Gaussian I\negthinspace Rd-valued with the mean m and the covariance matrix C, then for all measurable A Ì IRd


limsup
n® ¥ 
1
n2
logP(X Î nA) £ -
inf
x Î A 
1
2
|||x-m|||*2
(39)
and

liminf
n® ¥ 
1
n2
logP(X Î nA) ³ -
inf
x Î A° 
1
2
|||x-m|||*2.
(40)

The usual interpretation is that the dominant term in the asymptotic expansion for P(1/nX Î A) as n®¥ is given by

exp(- n2
2

inf
x Î A 
|||x-m|||*2).

Proof. Clearly, passing to X-m we can easily reduce the question to the centered random vector X. Therefore we assume

m = 0.
Inequality (39) follows immediately from
P(X Î nA) = ó
õ


supp(X)ÇA 
C n-ke-[(n2)/ 2] |||x|||*2 dx
£ C n-k l(supp(X)ÇA)
sup
x Î A 
e-[(n2)/ 2] |||x|||*2,
where C = C(k) is the normalizing constant and k £ d is the dimension of supp(X), cf. (38). Indeed,
1
n2
logP(X Î nA) £ C
n2
-k logn
n2
+ logl(supp(X)ÇA)
n2
- 1
2

inf
x Î A 
|||x|||*2.

To prove inequality (40) without loss of generality we restrict our attention to open sets A. Let x0 Î A. Then for all e > 0 small enough, the balls B(x0,e) = {x: ||x-x0|| < e} are in A. Therefore

P(X Î nA) ³ P(X Î nDe) = ó
õ


De 
C n-ke-[(n2)/ 2] |||x|||*2 dx,
(41)
where De = B(x0,e)Çsupp(X). On the support supp(X) the function x® |||x|||* is finite and convex; thus it is continuous. For every h > 0 one can find e such that |||x|||*2 ³ |||x0|||*2-h for all x Î De. Therefore (41) gives
P(X Î nA) ³ C n-ke-(1-h)[(n2)/ 2] |||x|||*2,
which after passing to the logarithms ends the proof. [¯]

Large deviation bounds for Gaussian vectors valued in infinite dimensional spaces and for Gaussian stochastic processes have similar form and involve the conjugate RKHS norm; needless to say, the proof that uses the density cannot go through; for the general theory of large deviations the reader is referred to [].

2.6.1  A numerical example

Consider a bivariate normal (X,Y) with the covariance matrix

[
1
1
1
2
]. The conjugate RKHS norm is then

||| [
x
y
]|||* = 2x2-2xy+y2 and the corresponding unit ball is the ellipse 2x2-2xy+y2 = 1. Figure illustrates the fact that one can actually see the conjugated RKHS norm. Asymptotic shapes in more complicated systems are more mysterious, see [].


Picture Omitted

Figure 2.1: A sample of N = 1500 points from bivariate normal distribution.

2.7  Problems

Problem 23 If Z is the standard normal N(0,1) random variable, show by direct integration that its characteristic function is f(z) = exp(-1/2 z2) for all complex z Î \sf CC.

Problem 24 Suppose (X, Y) Î IRd1+d2 are jointly normal and have pairwise uncorrelated components, corr(Xi, Yj) = 0. Show that X, Y are independent.

Problem 25 For standardized bivariate normal X,Y with correlation coefficient r, show that P(X > 0, Y > 0) = 1/4+[1/( 2p)]arcsinr.

Problem 26 Prove Theorem 2.2.

Problem 27 Prove that ``moments'' mk = E{Xkexp(-X2)} are finite and determine the distribution of X uniquely.

Problem 28 Show that the exponential distribution is determined uniquely by its moments.

Problem 29 If f(s) is an analytic characteristic function, show that logf(ix) is a well defined convex function of the real argument x.

Problem 30 [deterministic analogue of Theorem 2.5] Suppose f1, f2 are characteristic functions such that f1(t)f2(t) = exp(i t ) for each t Î IR. Show that fk(t) = exp(i t ak), k = 1, 2, where a1, a2 Î IR.

Problem 31 [exponential analogue of Theorem 2.5] If X, Y are i. i. d. random variables such that min{X, Y} has an exponential distribution, then X is exponential.

Chapter 3
Equidistributed linear forms

In Section 1.1 we present the classical characterization of the normal distribution by stability. Then we use this to define Gaussian measures on abstract spaces and we prove the zero-one law. In Section we return to the characterizations of normal distributions. We consider a more difficult problem of characterizations by the equality of distributions of two general linear forms.

3.1  Two-stability

The main result of this section is the theorem due to G. Polya []. Polya's result was obtained before the axiomatization of probability theory. It was stated in terms of positive integrable functions and part of the conclusion was that the integrals of those functions are one, so that indeed the probabilistic interpretation is valid.

Theorem 26 If X1, X2 are two i. i. d. random variables such that X1 and (X1+X2)/ Ö2 have the same distribution, then X1 is normal.

It is easy to see that if X1 and X2 are i. i. d. random variables with the distribution corresponding to the characteristic function exp( -|t|p), then the distributions of X1 and (X1+X2)/ pÖ2 are equal. In particular, if X1, X2 are normal N(0,1), then so is (X1+X2)/ Ö2. Theorem 3.1 says that the above trivial implication can be inverted for p = 2. Corresponding results are also known for p < 2, but in general there is no uniqueness, see [,,]. For p ¹ 2 it is not obvious whether exp( -|t|p) is indeed a characteristic function; in fact this is true only if 0 £ p £ 2; the easier part of this statement was given as Problem 1.9. The distributions with this characteristic function are the so called (symmetric) stable distributions.

The following corollary shows that p-stable distributions with p < 2 cannot have finite second moments.

[ 13 Suppose X1, X2 are i. i. d. random variables with finite second moments and such that for some scale factor k and some location parameter a the distribution of X1+X2 is the same as the distribution of k(X1+a). Then X1 is normal.

Indeed, subtracting the expected value if necessary, we may assume EX1 = 0 and hence a = 0. Then Var(X1+X2) = Var(X1)+ Var(X2) gives k = 2-1/2 (except if X1 = 0; but this by definition is normal, so there is nothing to prove). By Theorem 3.1, X1 (and also X2) is normal.


Proof of Theorem 3.1. Clearly the assumption of Theorem 3.1 is not changed, if we pass to the symmetrizations [X\tilde], [Y\tilde] of X, Y. By Theorem 2.5 to prove the theorem, it remains to show that [X\tilde] is normal. Let f(t) be the characteristic function of [X\tilde], [Y\tilde]. Then

f( Ö2t) = f2(t)
(42)
for all real t. Therefore recurrently we get
f(t2k/2) = f(t)2k
(43)
for all real t. Take t0 such that f(t0) ¹ 0 ; such t0 can be found as f is continuous and f(0) = 1. Let s2 > 0 such that f(t0) = exp( - s2 ). Then (43) implies f(t02-k/2) = exp( - s2 2-k) for all k = 0, 1, ¼. By Corollary 2.3 we have f(t) = exp( - s2 t2) for all t, and the theorem is proved. [¯]

3.2  Measures on linear spaces

Let \sf V be a linear space over the field IR of real numbers (we shall also call \sf V a (real) vector space). Suppose \sf V is equipped with a s-field F such that the algebraic operations of scalar multiplication (x, t)® tx and of vector addition x, y® x+y are measurable transformations \sf V×IR® \sf V and \sf V× \sf V® \sf V with respect to the corresponding s-fields FÄ BIR, and FÄ F respectively. Let (W, M, P) be a probability space. A measurable function X: W® \sf V is called a \sf V-valued random variable.

Example 4 Let \sf V = IRd be the vector space of all real d-tuples with the usual Borel s-field B. A \sf V-valued random variable is called a d-dimensional random vector. Clearly X = (X1, ¼, Xd) and if one prefers, one can consider the family X1, ¼, Xd rather than X.

Example 5 Let \sf V = C[0, 1] be the vector space of all continuous functions [0, 1] ® IR with the topology defined by the norm || f || : = sup0 £ t £ 1|f(t)| and with the s-field F generated by all open sets. Then a \sf V-valued random variable X is called a stochastic process with continuous trajectories with time T = [0,1]. The usual form is to write X(t) for the random continuous function X evaluated at a point t Î [0,1].

Warning. Although it is known that every abstract random vector can be interpreted as a random process with the appropriate choice of time set T, the natural choice of T (such as T = 1, 2, ¼, d in Example 3.2 and T = [0, 1] in Example 3.2) might sometimes fail. For instance, let \sf V = L2[0, 1] be the vector space of all (classes of equivalence) of square integrable functions [0, 1] ® IR with the usual L2 norm || f || = (òf2(t) dt)1/2. In general, a \sf V-valued random variable X cannot be represented as a stochastic process with time T = [0, 1], because evaluation at a point t Î T is not a well defined mapping. Although L2[0, 1] is commonly thought as the square integrable functions, we are actually dealing with the classes of equivalence rather than with the genuine functions. For \sf V = L2[0, 1]-valued Gaussian processes, one can show that Xt exists almost surely as the limit in probability of continuous linear functionals; abstract variants of this result can be found in [] and in the references therein.

The following definition of an abstract Gaussian random variable is motivated by Theorem 3.1.

Definition 7 A \sf V -valued random variable X is E-Gaussian ( E stays for the equality of distributions) if the distribution of Ö2X is equal to the distribution of X+X¢, where X¢ is an independent copy of X.

In Sections and we shall see that there are other equally natural candidates for the definitions of a Gaussian vector. To distinguish between them, we shall keep the longer name E-Gaussian instead of just calling it Gaussian. Fortunately, at least in familiar situations, it does not matter which definition we use. This occurs whenever we have plenty of measurable linear functionals. By Theorem 3.1 if L: \sf V®IR is a measurable linear functional, then the IR-valued random variable X = L(X) is normal. When this specifies the probability measure on \sf V uniquely, then all three definitions are equivalent, Let us see, how this works in two simple but important cases.


Example 3.2 (continued) Suppose X = (X(1), X(2), ¼, X(n)) is an IRn-valued E-Gaussian random variable. Consider linear functionals L: IRn®IR given by Lx® åaixi, where a1, a2, ¼, an Î IR. Then the one-dimensional random variable a1X(1)+ a2X(2)+¼+ anX(n) has the normal distribution. This means that X is a Gaussian vector in the usual sense (ie. it has multivariate normal distribution), as presented in Section 2.2.


Example 3.2 (continued) Suppose X is a C[0, 1]-valued Gaussian random variable. Consider the set of all linear functionals L: C[0, 1]®IR that can be written in the form

L = a1 Et(1)+a2 Et(2)+¼+an Et(n),
where a1, ¼, an are real numbers and Et: C[0, 1]®IR denotes the evaluation at point t defined by Et(f) = f(t). Then

L(X) = åaiX(ti) is normal. However, since the coefficients a1, ¼, an are arbitrary, this means that for each choice of t1, t2, ¼, tn Î [0, 1] the n-dimensional random variable X(t1), X(t2), ¼, X(tn) has a multivariate normal distribution, ie. X(t) is a Gaussian stochastic process in the usual sense8.


The question that we want to address now is motivated by the following (false) intuition. Suppose a measurable linear subspace IL Ì \sf V is given. Think for instance about IL = C1[0, 1] - the space of all continuously differentiable functions, considered as a subspace of C[0, 1] = \sf V. In general, it seems plausible that some of the realizations of a \sf V-valued random variable X may happen to fall in IL, while other realizations fail to be in IL. In other words, it seems plausible that with positive probability some of the trajectories of a stochastic process with continuous trajectories are smooth, while other trajectories are not. Strangely, this cannot happen for Gaussian vectors (and, more generally, for a-stable vectors). The result is due to Dudley and Kanter and provides an example of the so called zero-one law. The most famous zero-one law is of course the one due to Kolmogorov, see eg. []; see also the appendix to []. The proof given below follows []. Smole\'nski [] gives an elementary proof, which applies also to other classes of measures. Krakowiak [] proves the zero-one law when IL is a measurable sub-group rather than a measurable linear subspace. Tortrat [] considers (among other issues) zero-one laws for Gaussian distributions on groups. Theorem and Theorem in the next chapter give the same conclusion under different definitions of the Gaussian random vector.

Theorem 27 If X is a \sf V-valued E-Gaussian random variable and IL is a linear measurable subspace of \sf V, then P(X Î IL) is either 0, or 1.

Proof. Let X1, X2, ¼ be independent copies of X. Also, let us choose them to be independent of X. By 2-stability and the linearity of IL we have

P(X1+X2 Î IL) = P(Ö2X Î IL) = P(X Î IL).
(44)
By induction, this gives
P(X1+X2+¼+ X2n Î IL) = P(X Î IL)
(45)
for all n = 0, 1, ¼.

Let Z = X1+X2. Clearly, Z is independent of X and 2-stability implies that X1+X2+¼+X2n+1 has the same distribution as Z+2n/2X. Therefore (45) gives

P(Z+2n/2X Î IL) = P(X Î IL).
(46)
Consider now events An = {Z\not Î IL}Ç{Z+2n/2X Î IL}. Since event {Z Î IL}Ç{Z+2n/2X Î IL} is the same as {Z Î IL}Ç{X Î IL}, therefore by (46)
P(An) = P(Z+2n/2X Î IL)-P(Z Î IL)P(X Î IL)
= P(X Î IL)-P(Z Î IL)P(X Î IL).
By (44) this says that P(An) = P(X Î IL)P(X\not Î IL) does not depend on n.

Now let us observe that if m ¹ n, then the events Am and An are disjoint. We shall prove this by contradiction. Suppose both vectors Z+2n/2X Î IL and Z+2m/2X Î IL. Then their difference (2n/2-2m/2)X is in IL, too. For m ¹ n this implies X Î IL and therefore Z Î IL. The latter contradicts the definition of An, proving that Am and An are indeed disjoint.

The preceding two observations show that {An} is an infinite sequence of disjoint events with the same probability fixed P(An) = P(A1). This can happen only if P(An) = 0, ie. when P(X Î IL)P(X\not Î IL) = 0, which ends the proof. [¯]
To make Theorem 3.2 more concrete, consider the following application.

Example 6 This example presents a simple-minded model of transmission of information. Suppose that we have a choice of one of the two signals f(t), or g(t) be transmitted by a noisy channel within unit time interval 0 £ t £ 1. To simplify the situation even further, we assume g(t) = 0, ie. g represents ``no message send". The noise (which is always present) is a random and continuous function; we shall assume that it is represented by a C[0, 1]-valued Gaussian random variable W = {W(t)}0 £ t £ 1. We also assume it is an ``additive" noise.

Under these circumstances the signal received is given by a curve; it is either {f(t)+W(t)}0 £ t £ 1, or {W(t)}0 £ t £ 1, depending on which of the two signals, f or g, was sent. The objective is to use the received signal to decide, which of the two possible messages: f(·) or 0 (ie. message, or no message) was sent.

Notice that, at least from the mathematical point of view, the task is trivial if f (·) is known to be discontinuous; then we only need to observe the trajectory of the received signal and check for discontinuities. There are of course numerous practical obstacles to collecting continuous data, which we are not going to discuss here.

If f (·) is continuous, then the above procedure does not apply. Problem requires more detailed analysis in this case. One may adopt the usual approach of testing the null hypothesis that no signal was sent. This amounts to choosing a suitable critical region IL Ì C[0, 1]. As usual in statistics, the decision is to be made according to whether the observed trajectory falls into IL (in which case we decide f (·) was sent) or not (in which case we decide that 0 was sent and that what we have received was just the noise). Clearly, to get a sensible test we need P(f (·) +W (·) Î IL) > 0 and P(W (·) Î IL) < 1.

Theorem 3.2 implies that perfect discrimination is achieved if we manage to pick the critical region in the form of a (measurable) linear subspace. Indeed, then by Theorem 3.2 P(W (·) Î IL) < 1 implies P(W (·) Î IL) = 0 and P(f (·) +W (·) Î IL) > 0 implies P(f (·) +W (·) Î IL) = 1.

Unfortunately, it is not true that a linear space can always be chosen for the critical region. For instance, if W (·) is the Wiener process (see Section ), it is known that such subspace cannot be found if (and only if!) f (·) is differentiable for almost all t and ò([df/ dt])2 dt < ¥. The proof of this theorem is beyond the scope of this book (cf. Cameron-Martin formula in []). The result, however, is surprising (at least for those readers, who know that trajectories of the Wiener process are non-differentiable): it implies that, at least in principle, each non-differentiable (everywhere) signal f (·) can be recognized without errors despite having non-differentiable Wiener noise.

(Affine subspaces for centered noise E Wt = 0 do not work, see Problem )

For a recent work, see [].

3.3  Linear forms

It is easily seen that if a1, ¼, an and b1, ¼, bn are real numbers such that the sets A = {|a1|, ¼, |an|} and B = {|b1|, ¼, |bn|} are equal, then for any symmetric i. i. d. random variables X1, ¼, Xn the sums åk = 1n akXk and åk = 1n bkXk have the same distribution. On the other hand, when n = 2, A = {1, 1} and B = {0, Ö2} Theorem 3.1 says that the equality of distributions of linear forms åk = 1n akXk and åk = 1n bkXk implies normality. In this section we shall consider two more characterizations of the normal distribution by the equality of distributions of linear combinations åk = 1n akXk and åk = 1n bkXk. The results are considerably less elementary than Theorem 3.1.

We shall begin with the following generalization of Corollary 3.1 which we learned from J. Wesoowski.

Theorem 28 Let X1, ¼, Xn, n ³ 2, be i. i. d. square-integrable random variables and let A = {a1, ¼, an} be the set of real numbers such that A ¹ {1,0,...,0}. If X1 and åk = 1nakXk have equal distributions, then X1 is normal.

The next lemma is a variant of the result due to C. R. Rao, see [].

[ 9 Suppose q (·) is continuous in a neighborhood of 0, q(0) = 0, and in a neighborhood of 0 it satisfies the equation

q(t) = n
å
k = 1 
ak2 q(akt),
(47)
where a1, ¼, an are given numbers such that |ak| £ d < 1 and åk = 1n ak2 = 1.

Then q(t) = const in some neighborhood of t = 0.

Proof. Suppose (47) holds for all |t| < e. Then |ajt| < e and from (47) we get q(ajt) = åk = 1n ak2q(ajakt) for every 1 £ j £ n. Hence q(t) = åj = 1n åk = 1n aj2ak2q(ajakt) and we get recurrently

q(t) = n
å
j1 = 1 
¼ n
å
jr = 1 
aj12¼ajr 2q(aj1¼ajrt)
for all r ³ 1. This implies
|q(t)-q(0)| £ ( n
å
k = 1 
ak2)r
sup
|a| £ dr 
|q(at)-q(0)| =
sup
|x| £ dr 
|q(x)-q(0)|® 0
as r® ¥ for all |t| < e. [¯]

Proof of Theorem 3.3. Without loss of generality we may assume Var(X1) ¹ 0. Let f be the characteristic function of X and let Q(t) = logf(t). Clearly, Q(t) is well defined for all t close enough to 0. Equality of distributions gives

Q(t) = Q(a1t)+ Q(a2t)+¼+ Q(ant).
The integrability assumption implies that Q has two derivatives, and for all t close enough to 0 the derivative q (·) = Q¢¢(·) satisfies equation (47).

Since X1 and åk = 1nakXk have equal variances, åk = 1n ak2 = 1. Condition |ai| ¹ 0, 1 implies |ai| < 1 for all 1 £ i £ n. Lemma 3.3 shows that q (·) is constant in a neighborhood of t = 0 and ends the proof. [¯]

Comparing Theorems 3.1 and 3.3 the pattern seems to be that the less information about coefficients, the more information about the moments is needed. The next result ([]) fits into this pattern, too; [] present the general theory of active exponents which permits to recognize (by examining the coefficients of linear forms), when the equality of distributions of linear forms implies normality; see also []. Variants of characterizations by equality of distributions are known for group-valued random variables, see []; [] is also pertinent.

Theorem 29 Suppose A = {|a1|, ¼, |an|} and B = {|b1|, ¼, |bn|} are different sets of real numbers and X1, ¼, Xn are i. i. d. random variables with finite moments of all orders. If the linear forms åk = 1n akXk and åk = 1n bkXk are identically distributed, then X1 is normal.

We shall need the following elementary lemma.

[ 10 Suppose A = {|a1|, ¼, |an|} and B = {|b1|, ¼, |bn|} are different sets of real numbers. Then

( n
å
k = 1 
ak2r) ¹ ( n
å
k = 1 
bk2r)
(48)
for all r ³ 1 large enough.

Proof. Without loss of generality we may assume that coefficients are arranged in increasing order |a1| £ ¼ £ |an| and |b1| £ ¼ £ |bn|. Let M be the largest number m £ n such that |am| ¹ |bm|. ( Clearly, at least one such m exists, because sets A, B consist of different numbers.) Then |ak| = |bk| for k > M and åk = 1nak2r ¹ åk = 1nbk2r for all r large enough. Indeed, by the definition of M we have

åk > Mbk2r = åk > Mak2r but the remaining portions of the sum are not equal, åk £ Mbk2r ¹ åk £ M ak2r for r large enough; the latter holds true because by our choice of M the limits limr® ¥ (åk £ Mak2r)1/(2r) = maxk £ M|ak| = |aM| and limr® ¥ (åk £ Mbk2r)1/(2r) = maxk £ M|bk| = |bM| are not equal. [¯]

We also need the following lemma9 due to Marcinkiewicz [].

[ 11 Let f be an infinitely differentiable characteristic function and let Q(t) = logf(t). If there is r ³ 1 such that Q(k)(0) = 0 for all k ³ r, then f is the characteristic function of a normal distribution.

Proof. Indeed, F(z) = exp(åk = 0r [(zk)/ k!] Q(k)(0)) is an analytic function and all derivatives at 0 of the functions logF(·) and logf(·) are equal. Differentiating the (trivial) equality fQ¢ = f¢, we get f(n+1) = åk = 0n(kn) f(n-k)Q(k+1), which shows that all derivatives at 0 of F(·) and of f(·) are equal. This means that f(·) is analytic in some neighborhood of 0 and f(t) = F(t) = expP(t) for all small enough t, where P is a polynomial of the degree (at most) r. Hence by Theorem 2.5, f is normal. [¯]

Proof of Theorem 3.3. Without loss of generality, we may assume that X1 is symmetric. Indeed, if random variables X1, ¼, Xn satisfy the assumptions of the theorem, then so do their symmetrizations [X\tilde]1, ¼, [X\tilde]n, see Section 1.6. If we could prove the theorem for symmetric random variables, then [X\tilde]1 would be be normal. By Theorem 2.5, this would imply that X1 is normal. Hence it suffices to prove the theorem under the additional symmetry assumption. Let f be the characteristic function of X's and let Q(t) = logf(t); Q is well defined for all t close enough to 0. The assumption implies that Q has derivatives of all orders and also that Q(a1t)+ Q(a2t)+¼+ Q(ant) = Q(b1t)+ Q(b2t)+¼+ Q(bnt). Differentiating the last equality 2r times at t = 0 we obtain

n
å
k = 1 
ak2r Q(2r)(0) = n
å
k = 1 
bk2r Q(2r)(0), r = 0, 1, ¼
(49)
Notice that by (48), equality (49) implies Q(2r)(0) = 0 for all r large enough. Thus by (49) (and by the symmetry assumption to handle the derivatives of odd order), Q(k)(0) = 0 for all k ³ 1 large enough. Lemma 3.3 ends the proof. [¯]

3.4  Exponential analogy

Characterizations of the normal distribution frequently lead to analogous characterizations of the exponential distribution. The idea behind this correspondence is that adding random variables is replaced by taking their minimum. This is explained by the well known fact that the minimum of independent exponential random variables is exponentially distributed; the observation is due to Linnik [], see []. Monographs [,], present such results as well as the characterizations of the exponential distribution by its intrinsic properties, such as lack of memory. In this book some of the exponential analogues serve as exercises.

The following result, written in the form analogous to Theorem *, illustrates how the exponential analogy works. The i. i. d. assumption can easily be weakened to independence of X and Y (the details of this modification are left to the reader as an exercise).

Theorem 30 Suppose X, Y non-negative random variables such that

(i) for all a, b > 0 such that a+b = 1, the random variable min{X/a, Y/b} has the same distribution as X;

(ii) X and Y are independent and identically distributed.

Then X and Y are exponential.

Proof. The following simple observation stays behind the proof.

If X, Y are independent non-negative random variables, then the tail distribution function, defined for any Z ³ 0 by NZ(x) = P(Z ³ x), satisfies

Nmin{X, Y}(x) = NX(x) NY(x).
(50)
Using (50) and the assumption we obtain N(at)N(bt) = N(t) for all a, b, t > 0 such that a+b = 1. Writing t = x+y, a = x/(x+y), b = y/(x+y) for arbitrary x, y > 0 we get
N(x+y) = N(x)N(y)
(51)
Therefore to prove the theorem, we need only to solve functional equation (51) for the unknown function N(·) such that 0 £ N (·) £ 1; N(·) is also right-continuous non-increasing and N(x)® 0 as x® ¥.

Formula (51) shows recurrently that for all integer n and all x ³ 0 we have

N(nx) = N(x)n.
(52)
Since N(0) = 1 and N(·) is right continuous, it follows from (52) that r = N(1) > 0. Therefore (52) implies N(n) = rn and N(1/n) = r1/n (to see this, plug in (52) values x = 1 and x = 1/n respectively). Hence N(n/m) = N(1/m)n = rn/m (by putting x = 1/m in (52)), ie. for each rational q > 0 we have
N(q) = rq.
(53)
Since N(x) is right-continuous, N(x) = limq\searrow x N(q) = rx for each x ³ 0. It remains to notice that r < 1, which follows from the fact that N(x)® 0 as x® ¥. Therefore r = exp(-l) for some l > 0, and N(x) = exp(-lx), x ³ 0. [¯]

3.5  Exponential distributions on lattices

The abstract notation of this section follows []. Let IL be a vector space with norm || ·||. Suppose that IL is also a lattice with the operations minimum Ù and maximum Ú which are consistent with the vector operations and with the norm. The related order is then defined by x\preceq y iff xÚy = y (or, alternatively: iff xÙy = x). By consistency with vector operations we mean that10

(x+y)Ù(z+y) = y+(xÙz) for all x, y, z Î IL

(ax)Ù(ay) = a(xÙy) for all x, y Î IL, a ³ 0
and
-(xÙy) = (-x)Ú(-y).
Consistency with the norm means
|| x|| £ || y|| for all 0 \preceq x\preceq y

Moreover, we assume that there is a s-field F such that all the operations considered are measurable.

Vector space IRd with

xÙy = ( min
{xj; yj})1 £ j £ d
(54)
with the norm: || x|| = maxj|xj| satisfies the above requirements. Other examples are provided by the function spaces with the usual norms; for instance, a familiar example is the space C[0, 1] of all continuous functions with the standard supremum norm and the pointwise minimum of functions as the lattice operation, is a lattice.

The following abstract definition complements [].

Definition 8 A random variable X: W®IL has exponential distribution if the following two conditions are satisfied: (i) X\succeq 0;

(ii) if X¢ is an independent copy of X then for any 0 < a < 1 random variables X/aÙX¢/(1-a) and X have the same distribution.

Example 7 Let IL = IRd with Ù defined coordinatewise by (54) as in the above discussion. Then any IRd-valued exponential random variable has the multivariate exponential distribution in the sense of Pickands, see []. This distribution is also known as Marshall-Olkin distribution.

Using the definition above, it is easy to notice that if (X1, ¼, Xd) has the exponential distribution, then min{X1, ¼, Xd} has the exponential distribution on the real line. The next result is attributed to Pickands see [].

[ 5 Let X = (X1, ¼, Xd) be an IRd-valued exponential random variable. Then the real random variable min{X1/a1, ¼, Xd/ad} is exponential for all a1, ¼, ad > 0.

Proof. Let Z = min{X1/a1, ¼, Xd/ad}. Let Z¢ be an independent copy of Z. By Theorem 3.4 it remains to show that

min
{Z/a;Z¢/b} @ Z
(55)
for all a, b > 0 such that a+b = 1. It is easily seen that
min
{Z/a;Z¢/b} = min
{Y1/a1, ¼, Yd/ad},
where Yi = min{Xi/a; X¢i/b} and X¢ is an independent copy of X. However by the definition, X has the same distribution as (Y1,¼, Yd), so (55) holds. [¯]
Remark: By taking a limit as aj® 0 for all j ¹ i, from Proposition 3.5 we obtain in particular that each component Xi is exponential.

Example 8 Let IL = C[0, 1] with {fÙg}(x): = min{f(x), g(x)}. Then exponential random variable X defines the stochastic process X(t) with continuous trajectories and such that {X(t1), X(t2), ¼, X(tn)} has the n-dimensional Marshall-Olkin distribution for each integer n and for all t1, ¼, tn in [0, 1].

The following result shows that the supremum supt|X(t)| of the exponential process from Example 3.5 has the moment generating function in a neighborhood of 0. Corresponding result for Gaussian processes will be proved in Sections and . Another result on infinite dimensional exponential distributions will be given in Theorem .

[ 6 If IL is a lattice with the measurable norm || ·|| consistent with algebraic operation Ù, then for each exponential IL-valued random variable X there is l > 0 such that Eexp(l|| X|| ) < ¥.

Proof. The result follows easily from the trivial inequality

P( || X|| ³ 2 x) = P( || XÙX¢|| ³ x) £ (P( || X|| ³ x))2
and Corollary 1.3. [¯]

3.6  Problems

Problem 32 [deterministic analogue of Theorem 3.1)] Show that if X, Y ³ 0 are i. i. d. and 2X has the same distribution as X+Y, then X, Y are non-random 11.

Problem 33 Suppose random variables X1, X2 satisfy the assumptions of Theorem 3.1 and have finite second moments. Use the Central Limit Theorem to prove that X1 is normal.

Problem 34 Let \sf V be a metric space with a measurable metric d. We shall say that a \sf V-valued sequence of random variables Sn converges to Y in distribution, if there exist a sequence [^S]n convergent to Y in probability (ie. P( d([^S]n, Y) > e)® 0 as n® ¥ ) and such that Sn @ [^S]n (in distribution) for each n. Let Xn be a sequence of \sf V-valued independent random variables and put Sn = X1+¼+Xn. Show that if Sn converges in distribution (in the above sense), then the limit is an E-Gaussian random variable12.

Problem 35 For a separable Banach-space valued Gaussian vector X define the mean m = EX as the unique vector that satisfies l(m) = El(X) for all continuous linear functionals l Î \sf V*. It is also known that random vectors with equal characteristic functions f(l) = Eexpil(X) have the same probability distribution.

Suppose X is a Gaussian vector with the non-zero mean m. Show that for a measurable linear subspace IL Ì \sf V, if m\not Î IL then P(X Î IL) = 0.

Problem 36 [deterministic analogue of Theorem 3.3)] Show that if i. i. d. random variables X,Y have moments of all orders and X+2Y @ 3X, then X, Y are non-random.

Problem 37 Show that if X,Y are independent and X+Y @ X, then Y = 0 a. s.

Chapter 4
Rotation invariant distributions

4.1  Spherically symmetric vectors

Definition 9 A random vector X = (X1, X2, ¼, Xn) is spherically symmetric if the distribution of every linear form

a1X1+a2X2+¼+anXn @ X1
(56)
is the same for all a1, a2, ¼, an, provided a12+a22+¼+an2 = 1.

A slightly more general class of the so called elliptically contoured distributions has been studied from the point of view of applications to statistics in []. Elliptically contoured distributions are images of spherically symmetric random variables under a linear transformation of IRn. Additional information can also be found in [], which is devoted to the characterization problems and overlaps slightly with the contents of this section.

Let f(t) be the characteristic function of X. Then

f(t) = f æ
ç
ç
ç
ç
ç
è
||t|| é
ê
ê
ê
ê
ê
ë
1
0
:
0
ù
ú
ú
ú
ú
ú
û
ö
÷
÷
÷
÷
÷
ø
,
(57)
ie. the characteristic function at t can be written as a function of ||t|| only. Conversely, if f(t) is a characteristic function of a real random variable, then f(||t||) corresponds to an IRn-valued random vector.

From the definition we also get the following.

[ 7 If X = (X1, ¼, Xn) is spherically symmetric, then each of its marginals Y = (X1, ¼, Xk), where k £ n, is spherically symmetric.

This fact is very simple; just consider linear forms (56) with ak+1 = ¼ = an = 0.

Example 9 Suppose [(g)\vec] = (g1, g2, ¼, gn) is the sequence of independent identically distributed normal N(0, 1) random variables. Then [(g)\vec] is spherically symmetric. Moreover, for any m ³ 1, [(g)\vec] can be extended to a longer spherically invariant sequence (g1, g2, ¼, gn+m). In Theorem we will see that up to a random scaling factor, this is essentially the only example of a spherically symmetric sequence with arbitrarily long spherically symmetric extensions13.

In general a multivariate normal distribution is not spherically symmetric. But if X is centered non-degenerated Gaussian r. v., then A-1X is spherically symmetric, see Theorem 2.2. Spherical symmetry together with Theorem is sometimes useful in computations as illustrated in Problem .

Example 10 Suppose X = (X1, ¼, Xn) has the uniform distribution on the sphere || x|| = r. Obviously, X is spherically symmetric. For k < n, vector Y = (X1, ¼, Xk) has the density

f(y) = C(r2-||y||2)(n-k)/2-1,
(58)
where C is the normalizing constant (see for instance, []). In particular, Y is spherically symmetric and absolutely continuous in IRk.

The density of real valued random variable Z = || Y|| at point z has an additional factor coming from the area of the sphere of radius z in IRk, ie.

fZ(z) = C zk-1(r2-z2)(n-k)/2-1.
(59)
Here C = C(r, k, n) is again the normalizing constant. By rescaling, it is easy to see that C = rn-2 C1(k, n), where
C1(k, n) = ( ó
õ
1

-1 
zk-1(1-z2)(n-k)/2-1 dz)-1
= 2G(n/2)
G(k/2) G((n-k)/2)
= 2
B(k/2,(n-k)/2)
.
Therefore
fZ(z) = C1 rn-2zk-1(r2-z2)(n-k)/2-1.
(60)
Finally, let us point out that the conditional distribution of || (Xk+1, ¼, Xn)|| given Y is concentrated at one point (r2-|| Y||2)1/2.

From expression (58) it is easy to see that for fixed k, if n® ¥ and the radius is r = Ön, then the density of the corresponding Y converges to the density of the i. i. d. normal sequence (g1, g2, ¼, gk). (This well known fact is usually attributed to H. Poincaré).

Calculus formulas of Example 4.1 are important for the general spherically symmetric case because of the following representation.

Theorem 31 Suppose X = (X1, ¼, Xn) is spherically symmetric. Then X = RU, where random variable U is uniformly distributed on the unit sphere in IRn, R ³ 0 is real valued with distribution R @ ||X||, and random variables variables R, U are stochastically independent.

Proof. The first step of the proof is to show that the distribution of X is invariant under all rotations U\sf J: IRn® IRn. Indeed, since by definition

f(t) = Eexp(it·X) = Eexp(i||t||X1), the characteristic function f(t) of X is a function of || t|| only. Therefore the characteristic function y of U\sf JX satisfies

y(t) = Eexp(it·U\sf JX) = Eexp(iU\sf JTt·X) = Eexp(i||t||X1) = f(t).
The group O(n) of rotations of IRn (ie. the group of orthogonal n×n matrices) is a compact group; by m we denote the normalized Haar measure (cf. []). Let G be an O(n)-valued random variable with the distribution m and independent of X (G can be actually written down explicitly; for example if

n = 2, G = [
cosq
sinq
-sinq
cosq
], where q is uniformly distributed on [0, 2p].) Clearly X @ GX @ || X|| GX/|| X|| conditionally on the event || X|| ¹ 0. To take care of the possibility that X = 0, let Q be uniformly distributed on the unit sphere and put

U = ì
í
î
Q
if X = 0
GX/|| X||
if X ¹ 0
.
It is easy to see that U is uniformly distributed on the unit sphere in IRn and that U, X are independent. This ends the proof, since X @ GX = || X||U. [¯]
The next result explains the connection between spherical symmetry and linearity of regression. Actually, condition () under additional assumptions characterizes elliptically contoured distributions, see [,].

[ 8 If X is a spherically symmetric random vector with finite first moments, then

E{X1| a1X1+¼+anXn} = r n
å
k = 1 
akXk
(61)
for all real numbers a1, ¼, an, where r = [(a1)/( a12+¼+ an2)] .


The simplest approach here is to use the converse to Theorem 1.5; if f(|| t||2) denotes the characteristic function of X (see (57)), then the characteristic function of X1, a1X1+¼ +anXn evaluated at point (t, s) is y(t, s) = f ((s+a1t)2+(a2t)2+¼+(ant)2). Hence

(a12+¼+an2)
s
y(t, s) ê
ê
ê


s = 0 
= a1
t
y(t, 0).


Picture Omitted
Figure 4.1: Linear regression for the uniform distribution on a circle.

Another possible proof is to use Theorem 4.1 to reduce () to the uniform case. This can be done as follows. Using the well known properties of conditional expectations, we have

E{X1| a1X1+¼+anXn} = E{RU1| R(a1U1+¼+anUn)}
= E{E{RU1| R, a1U1+¼+anUn}|R(a1U1+¼+anUn)}.
Clearly,
E{RU1| R, a1U1+¼+anUn} = RE{U1| R, a1U1+¼+anUn}
and
E{U1| R, a1U1+¼+anUn} = E{U1|a1U1+¼+anUn},
see Theorem 1.4 (ii) and (iii). Therefore it suffices to establish () for the uniform distribution on the unit sphere. The last fact is quite obvious from symmetry considerations; for the 2-dimensional situation this can be illustrated on a picture. Namely, the hyper-plane a1x1+¼+anxn = const intersects the unit sphere along a translation of a suitable (n-1)-dimensional sphere S; integrating x1 over S we get the same fraction (which depends on a1,¼, an) of const. [¯]

The following theorem shows that spherical symmetry allows us to eliminate the assumption of independence in Theorem *, see also Theorem below. The result for rational a is due to S. Cambanis, S. Huang & G. Simons []; for related exponential results see [].

Theorem 32 Let X = (X1, ¼, Xn) be a spherically symmetric random vector such that E|| X||a < ¥ for some real a > 0 . If

E{|| (X1, ¼, Xm)|| a|(Xm+1, ¼, Xn)} = const
for some 1 £ m < n, then X is Gaussian.

Our method of proof of Theorem 4.1 will also provide easy access to the following interesting result due to Szabowski [], see also [].

Theorem 33 Let X = (X1, ¼, Xn) be a spherically symmetric random vector such that E|| X||2 < ¥ and P(X = 0) = 0. Suppose c(x) is a real function with the property that there is 0 £ U £ ¥ such that 1/c(x) is integrable on each finite sub-interval of the interval [0, U] and that c(x) = 0 for all x > U.

If for some 1 £ m < n

E{|| (X1, ¼, Xm)|| 2 | (Xm+1, ¼, Xn)} = c(|| (Xm+1, ¼, Xn)|| ),
then the distribution of X is determined uniquely by c(x).

To prove both theorems we shall need the following.

[ 12 Let X = (X1, ¼, Xn) be a spherically symmetric random vector such that P(X = 0) = 0 and let H denote the distribution of || X||. Then we have the following.

(a) For m < n r. v. || (Xm+1, ¼, Xn)|| has the density function g(x) given by

g(x) = C xn-m-1 ó
õ
¥

x 
r-n+2(r2-x2)m/2-1H(dr),
(62)
where

C = 2G(1/2 n)(G(1/2 m)G(1/2(n-m)))-1 is a normalizing constant of no further importance below.

(b) The distribution of X is determined uniquely by the distribution of its single component X1.

(c) The conditional distribution of || (X1, ¼, Xm)|| given (Xm+1, ¼, Xn) depends only on the IRm-n-norm || (Xm+1, ¼, Xn)|| and

E{|| (X1, ¼, Xm)|| a|(Xm+1, ¼, Xn)} = h( || (Xm+1, ¼, Xn)|| ),
where
h(x) =
ó
õ
¥

x 
r-n+2(r2-x2)(m+a)/2-1H(dr)

ó
õ
¥

x 
r-n+2(r2-x2)m/2-1H(dr)
(63)

Sketch of the proof.
Formulas (62) and (63) follow from Theorem 4.1 by conditioning on R, see Example 4.1. Fact (b) seems to be intuitively obvious; it says that from the distribution of the product U1R of independent random variables (where U1 is the 1-dimensional marginal of the uniform distribution on the unit sphere in IRn) we can recover the distribution of R. Indeed, this follows from Theorem 1.8 and (62) applied to m = n-1: multiplying g(x) = Còx¥ r-n+2(r2-x2)(n-1)/2-1H(dr) by xu-1 and integrating, we get the formula which shows that from g(x) we can determine the integrals ò0¥ rt-1 H(dr), cf. () below.[¯]

[ 13 Suppose ca(·) is a function such that

E{|| (X1, ¼, Xm)|| a|(Xm+1, ¼, Xn)} = ca(|| (Xm+1, ¼, Xn)|| 2).
Then the function f(x) = x(m+1-n)/2g(x1/2), where g(.) is defined by (62), satisfies
ca(x)f(x) = 1
B(a/2, m/2)
ó
õ
¥

x 
(y-x)a/2-1f(y) dy.
(64)

Proof. As previously, let H(dr) be the distribution of || X||. The following formula for the beta integral is well known, cf. [].

(r2-x2)(m+a)/2-1 = 2
B(a/2, m/2)
ó
õ
1

x 
(t2-x2)a/2-1(r2-t2)m/2-1 dt.
(65)
Substituting (65) into (63) and changing the order of integration we get
ca(x2)g(x)
= Cxn-m-1 2
B(a/2, m/2)
ó
õ
¥

x 
(t2-x2)a/2-1 ó
õ
¥

t 
r-n+2(r2-t2)m/2-1H(dr) dt.
Using (62) we have therefore
ca(x2)g(x) = xn-m-1 2
B(a/2, m/2)
ó
õ


x 
¥(t2-r2)a/2-1tm+1-ng(t) dt.
Substituting f(·) and changing the variable of integration from t to t2 ends the proof of (64). [¯]

Proof of Theorem 4.1. By Lemma 4.1 we need only to show that for a = 2 equation (64) has the unique solution. Since f(·) ³ 0, it follows from (64) that f(y) = 0 for all y ³ U. Therefore it suffices to show that f(x) is determined uniquely for x < U. Since the right hand side of (64) is differentiable, therefore from (64) we get 2[d/ dx](c(x)f(x)) = -mf(x). Thus b(x): = c(x)f(x) satisfies equation

2b¢(x) = -mb(x)/c(x)
at each point 0 £ x < U. Hence b(x) = Cexp(-1/2mò0x1/c(t) dt). This shows that
f(x) = C
c(x)
exp(- 1
2
m ó
õ
x

0 
1
c(t)
 dt)
is determined uniquely (here C > 0 is a normalizing constant). [¯]

[ 14 If p(s) is a periodic and analytic function of complex argument s with the real period, and for real t the function t® log(p(t)G(t+C)) is real valued and convex, then p(s) = const.

Proof. For all positive x we have

d2
dx2
logp(x)+ d2
dx2
logG(x) ³ 0.
(66)
However it is known that [(d2)/( dx2)]logG(x) = ån ³ 0(n+x)-2® 0 as x® ¥, see []. Therefore (66) and the periodicity of p(.) imply that [(d2)/( dx2)] logp(x) ³ 0. This means that the first derivative [d/ dx] logp(.) is a continuous, real valued, periodic and non-decreasing function of the real argument. Hence [d/ dx] logp(x) = B Î IR for all real x. Therefore logp(s) = A+Bs and, since p(.) is periodic with real period, this implies B = 0. This ends the proof. [¯]
Proof of Theorem 4.1. There is nothing to prove, if X = 0. If P(X = 0) < 1 then P(X = 0) = 0. Indeed, suppose, on the contrary, that P(X = 0) > 0. By Theorem 4.1 this means that p = P(R = 0) > 0 and that E{|| (X1, ¼, Xm)||a|(Xm+1, ¼, Xn)} = 0 with positive probability p > 0. Therefore E{|| (X1, ¼, Xm)||a|(Xm+1, ¼, Xn)} = 0 with probability 1. Hence R = 0 and X = 0 a. s., a contradiction.

Throughout the rest of this proof we assume without loss of generality that P(X = 0) = 0. By Lemmas 4.1 and 4.1, it remains to show that the integral equation

f(x) = K ó
õ
¥

x 
(y-x)b-1f(y) dy
(67)
has the unique solution in the class of functions satisfying conditions f(.) ³ 0 and ò0¥ x(n-m)/2-1f(x) dx = 2.

Let M (s) = xs-1f(x)dx be the Mellin transform of f(.), see Section 1.8. It can be checked that M (s) is well defined and analytic for s in the half-plane Âs > 1/2(n-m), see Theorem 1.8. This holds true because the moments of all orders are finite, a claim which can be recovered with the help of a variant of Theorem , see Problem ; for a stronger conclusion see also []. The Mellin transform applied to both sides of (67) gives

M (s) = K G(b)G(s)
G(b +s)
M (b+s).
Thus the Mellin transform M1(.) of the function f(Cx), where
C = (KG(b))-1/b, satisfies
M1(s) = M1(b+s) G (s)
G(b+s)
.
This shows that M1(s) = p (s)G(s), where p(.) is analytic and periodic with real period b. Indeed, since G(s) ¹ 0 for Âs > 0, function p(s) = M1(s)/G(s) is well defined and analytic in the half-plane Âs > 0. Now notice that p(.), being periodic, has analytic extension to the whole complex plane.

Since f(.) ³ 0, log M1(x) is a well defined convex function of the real argument x. This follows from the Cauchy-Schwarz inequality, which says that M1((t+s)/2) £ ( M1(t) M1(s))1/2. Hence by Lemma 4.1, p(s) = const.[¯]
Remark: Solutions of equation (67) have been found in []. Integral equations of similar, but more general form occurred in potential theory, see Deny [], see also Bochner [] for an early work; for another proof and recent literature, see [].

4.2  Rotation invariant absolute moments

The following beautiful theorem is due to M. S. Braverman []14.

Theorem 34 Let X, Y, Z be independent identically distributed random variables with finite moments of fixed order p Î IR+\2IN. Suppose that there is constant C such that for all real a, b, c

E|aX+bY+cZ|p = C(a2+b2+c2)p/2.
(68)
Then X, Y, Z are normal.

Condition (68) says that the absolute moments of a fixed order p of any axis, no matter how rotated, are the same; this fits well into the framework of Theorem *.

Theorem 4.2 is a strictly 3-dimensional phenomenon, at least if no additional conditions on random variables are imposed. It does not hold for pairs of i. i. d. random variables, see Problem below15. Theorem 4.2 cannot be extended to other values of exponent p; if p is an even integer, then (68) is not strong enough to imply the normal distribution (the easiest case to see this is of course p = 2).

Following Braverman's argument, we obtain Theorem 4.2 as a corollary to Theorem 3.1. To this end, we shall use the following result of independent interest.

Theorem 35 If p Î IR+\2IN and X, Y, Z are independent symmetric p-integrable random variables such that P(Z = 0) < 1 and

E|X+tZ|p = E|Y+tZ|p for all real t,
(69)
then X @ Y in distribution.

Theorem 4.2 resembles Problem 1.9, and it seems to be related to potential theory, see [] and []. Similar results have functional analytic importance, see Rudin []; also Hall [] and Hardin [] might be worth seeing in this context. Koldobskii [,] gives Banach space versions of the results and relevant references.

Theorem 4.2 follows immediately from Theorem 4.2 by the following argument.

Proof of Theorem 4.2 . Clearly there is nothing to prove, if C = 0, see also Problem . Suppose therefore C ¹ 0. It follows from the assumption that E|X+Y+tZ|p = E|Ö2X+tZ|p for all real t. Note also that E|Z|p = C ¹ 0. Therefore Theorem 4.2 applied to X+Y, X¢ and Z, where X¢ is an independent copy of Ö2X, implies that X+Y and Ö2X have the same distribution. Since X, Y are i. i. d., by Theorem 3.1 X, Y, Z are normal. [¯]

  A related result

The next result can be thought as a version of Theorem 4.2 corresponding to p = 0. For the proof see [,,].

Theorem 36 If X = (X1, ¼, Xn) is at least 3-dimensional random vector such that its components X1, ¼, Xn are independent, P(X = 0) = 0 and X/||X|| has the uniform distribution on the unit sphere in IRn, then X is Gaussian.

4.2.1  Proof of Theorem for p = 1

We shall first present a slightly simplified proof for p = 1 which is based on elementary identity max{x, y} = (x+y+|x-y|). This proof leads directly to the exponential analogue of Theorem 4.2; the exponential version is given as Problem below.

We shall begin with the lemma which gives an analytic version of condition (69).

[ 15 Let X1, X2, Y1, Y2 be symmetric independent random variables such that E|Yi| < ¥ and E|Xi| < ¥, i = 1, 2. Denote Ni(t) = P(|Xi| ³ t), Mi(t) = P(|Yi| ³ t), t ³ 0, i = 1, 2. Then each of the conditions

E|a1X1+a2X2| = E|a1Y1+a2Y2| for all a1, a2 Î IR;
(70)
ó
õ
¥

0 
N1(t)N2(xt) dt = ó
õ
¥

0 
M1(t)M2(xt) dt for all x > 0;
(71)
ó
õ
¥

0 
N1(xt)N2(yt) dt = ó
õ
¥

0 
M1(xt)M2(yt) dt
(72)
for all x, y ³ 0, |x|+|y| ¹ 0;
implies the other two.

Proof. For all real numbers x, y we have |x-y| = 2max{x, y} - (x+y). Therefore, taking into account the symmetry of the distributions for all real a, b we have

E|aX1-bX2| = 2 E max
{aX1, bX2}.
(73)
For an integrable random variable Z we have EZ = ò0¥ P(Z ³ t) dt - ò0¥ P(-Z ³ t) dt, see (3). This identity applied to Z = max{aX1, bX2}, where a, b ³ 0 are fixed, gives
E max
{aX1, bX2} = ó
õ
¥

0 
P(Z ³ t) dt - ó
õ
¥

0 
P(Z £ -t) dt
= ó
õ
¥

0 
P(aX1 ³ t) dt + ó
õ
¥

0 
P(bX2 ³ t) dt
- ó
õ
¥

0 
P(aX1 ³ t)P(bX2 ³ t) dt- ó
õ
¥

0 
P(aX1 £ -t)P(bX2 £ -t) dt.
Therefore, from (73) after taking the symmetry of distributions into account, we obtain
E|aX1-bX2| = 2aEX1+ + 2bEX2+ -4 ó
õ
¥

0 
P(aX1 ³ t) P(bX2 ³ t) dt,
where Xi+ = max{Xi, 0}, i = 1, 2. This gives
E|aX1-bX2| = 2aEX1+ + 2bEX2+ -4 ó
õ
¥

0 
N1(t/a)N2(t/b) dt.
(74)
Similarly
E|aY1-bY2| = 2aEY1+ + 2bEY2+ -4 ó
õ
¥

0 
M1(t/a)M2(t/b) dt.
(75)
Once formulas (74) and (75) are established, we are ready to prove the equivalence of conditions (70)-(72).

(70)Þ(71): If condition (70) is satisfied, then E|Xi| = E|Yi|, i = 1, 2 and thus by symmetry EXi+ = EYi+, i = 1, 2. Therefore (74) and (75) applied to a = 1, b = 1/x imply (71) for any fixed x > 0.

(71) Þ(72): Changing the variable in (71) we obtain (72) for all x > 0, y > 0. Since E|Yi| < ¥ and E|Xi| < ¥ we can pass in (72) to the limit as x® 0, while y is fixed, or as y® 0, while x is fixed, and hence (72) is proved in its full generality.

(72)Þ(70): If condition (72) is satisfied, then taking x = 0, y = 1 or x = 1, y = 0 we obtain E|Xi| = E|Yi|, i = 1, 2 and thus by symmetry

EXi+ = EYi+, i = 1, 2. Therefore identities (74) and (75) applied to a = 1/x, b = 1/y imply (70) for any a1 > 0, a2 < 0. Since E|Yi| < ¥ and E|Xi| < ¥, we can pass in (70) to the limit as a1® 0, or as a2® 0. This proves that equality (70) for all a1 ³ 0, a2 £ 0. However, since Xi, Yi, i = 1, 2, are symmetric, this proves (70) in its full generality. [¯]
The next result translates (70) into the property of the Mellin transform. A similar analytical identity is used in the proof of Theorem 2.0.1.

[ 16 Let X1, X2, Y1, Y2 be symmetric independent random variables such that E|Yj| < ¥ and E|Xj| < ¥, j = 1, 2. Let 0 < u < 1 be fixed. Then condition (70) is equivalent to

E|X1|u+it E|X2|1-u-it = E|Y1|u+it E|Y2|1-u-it for all t Î IR.
(76)

Proof. By Lemma 2.4.3, it suffice to show that conditions (76) and (71) are equivalent.

Proof of (71)Þ(76): Multiplying both sides of (71) by x-u-it, where t Î IR is fixed, integrating with respect to x in the limits from 0 to ¥ and changing the order of integration (which is allowed, since the integrals are absolutely convergent), then substituting x = y/t, we get

ó
õ
¥

0 
tit+u-1 N1(t) dt ó
õ
¥

0 
y-u-it N2(y) dy
= ó
õ
¥

0 
tit+u-1M1(t) dt ó
õ
¥

0 
y-u-it M2(y) dy.
This clearly implies (76), since, eg.
ó
õ
¥

0 
tit+u-1 Nj(t) dt = E|Xj|u+it/(u+it),  j = 1, 2
(this is just tail integration formula (2)).

Proof of (76)Þ(71): Notice that

fj(t): = uE|Xj|u+it
(u+it)E|Xj|u
,  j = 1, 2
is the characteristic function of a random variable with the probability density function fj, u(x): = Cjexp(xu)Nj(exp(x)), x Î IR, j = 1, 2, where Cj = Cj(u) is the normalizing constant. Indeed,
ó
õ
¥

-¥ 
eixtexp(xu) Nj(exp(x)) dx = ó
õ
¥

0 
yityu-1Nj(y) dy = E|Xj|u+it/(u+it)
and the normalizer Cj(u) = u/E|Xj|u is then chosen to have fj(0) = 1, j = 1, 2. Similarly
yj(t): = uE|Yj|u+it
(u+it)E|Yj|u
is the characteristic function of a random variable with the probability density function gj, u (x): = Kjexp(xu)Mj(exp(x)), x Î IR, where Kj = u/E|Yj|u, j = 1, 2. Therefore (76) implies that the following two convolutions are equal f1, u* [`f]2, 1-u = g1, u*[`g]2, 1-u, where [`f]2(x) = f2(-x), [`g]2(x) = g2(-x). Since (76) implies C1(u)C2(1-u) = K1(u)K2(1-u), a simple calculation shows that the equality of convolutions implies

ó
õ
¥

-¥ 
exN1(ex)N2(eyex) dx = ó
õ
¥

-¥ 
exM1(ex)M2(eyex) dx
for all real y. The last equality differs from (71) by the change of variable only. [¯]
Now we are ready to prove Theorem 4.2. The conclusion of Lemma 4.2.1 suggests using the Mellin transform E|X|u+it, t Î IR. Recall from Section 1.8 that if for some fixed u > 0 we have E|X|u < ¥, then the function E|X|u+it, t Î IR, determines the distribution of |X| uniquely. This and Lemma 4.2.1 are used in the proof of Theorem 4.2.

Proof of Theorem 4.2. Lemma 4.2.1 implies that for each 0 < u < 1, -¥ < t < ¥

E|X|u+it E|Z|1-u-it = E|Y|u+it E|Z|1-u-it.
(77)
Since E|Z|s is an analytic function in the strip 0 < Âs < 1, see Theorem 1.8, and E|Z| = C ¹ 0 by (68), therefore the equation E|Z|u+it = 0 has at most a countable number of solutions (u, t) in the strip 0 < u < 1 and -¥ < t < ¥. Indeed, the equation has at most a finite number of solutions in each compact set - otherwise we would have Z = 0 almost surely by the uniqueness of analytic extension. Therefore one can find 0 < u < 1 such that E|Z|u+it ¹ 0 for all t Î IR. For this value of u from (77) we obtain
E|X|1-u-it = E|Y|1-u-it
(78)
for all real t, which by Theorem 1.8 proves that random variables X and Y have the same distribution. [¯]

4.2.2  Proof of Theorem in the general case

The following lemma shows that under assumption (69) all even moments of order less than p match.

[ 17 Let k = [p/2]. Then (69) implies

E|X|2j = E|Y|2j
(79)
for j = 0,1,¼, k.

Proof. For j £ k the derivatives [(j)/(tj)]|tX+Z|p are integrable. Therefore (79) follows by the consecutive differentiation (under the integral signs) of the equation E|tX+Z|p = E|tY+Z|p at t = 0. [¯]
The following is a general version of (76).

[ 18 Let 0 < u < p be fixed. Then condition (69) and

E|X|u+it E|Z|p-u-it = E|Y|u+it E|Z|p-u-it for all t Î IR.
(80)
are equivalent.

Proof. We prove only the implication (69)Þ(80); we will not use the other one.

Let k = [p/2]. The following elementary formula follows by the change of variable16

|a|p = Cp ó
õ
¥

0 
æ
è
cosax - k
å
j = 0 
(-1)j a2jx2j ö
ø
dx
xp+1
(81)
for all a.

Since our variables are symmetric, applying (81) to a = X+aZ and a = Y+aZ from (69) and Lemma 4.2.2 we get

ó
õ
¥

0 
(fX(x)-fY(x))fZ(ax)
xp+1
 dx = 0
(82)
and the integral converges absolutely. Multiplying (82) by a-p+u+it-1, integrating with respect to a in the limits from 0 to ¥ and switching the order of integrals we get
ó
õ
¥

0 
fX(x)-fY(x)
xp+1
ó
õ
¥

0 
a-p+u+it-1fZ(ax) da dx = 0.
(83)
Notice that
ó
õ
¥

0 
a-p+u+it-1fZ(ax) da = xp-u-it ó
õ
¥

0 
b-p+u+it-1fZ(b) db
= xp-u-itG(-p+u+it)E|Z|p-u-it.
Therefore (83) implies
G(-p+u+it)G(-u-it)(E|X|u+it-E|Y|u+it)E|Z|p-u-it = 0.
This shows that identity (80) holds for all values of t, except perhaps a for a countable discrete set arising from the zeros of the Gamma function. Since E|Y|z is analytic in the strip -1 < Âz < p, this implies (80) for all t. [¯]
Proof of Theorem 4.2 (general case). The proof of the general case follows the previous argument for p = 1 with (80) replacing (76). [¯]

4.2.3  Pairs of random variables

Although in general Theorem 4.2 doesn't hold for a pair of i. i. d. variables, it is possible to obtain a variant for pairs under additional assumptions. Braverman [] obtained the following result.

Theorem 37 Suppose X, Y are i. i. d. and there are positive p1 ¹ p2 such that p1, p2\not Î 2IN and E|aX+bY|pj = Cj(a2+b2)pj for all a, b Î IR, j = 1, 2. Then X is normal.

Proof of Theorem . Suppose 0 < p1 < p2. Denote by Z the standard normal N(0,1) random variable and let

fp(s) = E|X|p/2+s
E|Z|p/2+s
.
Clearly fp is analytic in the strip -1 < p/2+Âs < p2.

For -p1/2 < Âs < p2/2 by Lemma 4.2.2 we have

fp1(s)fp1(-s) = C1
(84)
and
fp2(s)fp2(-s) = C2
(85)
Put r = 1/2(p2-p1). Then fp2(s) = fp1(s+r) in the strip -p1/2 < Âs < p1/2. Therefore (85) implies
f(r+s)f(r-s) = C2,
where to simplify the notation we write f = fp1. Using now (84) we get
f(r+s) = C2
f(r-s)
= C2
C1
f(s-r)
(86)
Equation (86) shows that the function p(s): = Ksf(s), where K = (C1/C2)[1/ 2r], is periodic with real period 2r. Furthermore, since p1 > 0, p(s) is analytic in the strip of the width strictly larger than 2r; thus it extends analytically to \sf CC. By Lemma 4.1 this determines uniquely the Mellin transform of |X|. Namely,
E|X|s = C Ks E|Z|s.
Therefore in distribution we have the representation
X @ K Z c,
(87)
where K is a constant, Z is normal N(0,1), and c is a {0,1}-valued independent of Z random variable such that P(c = 1) = C.

Clearly, the proof is concluded if C = 0 (X being degenerate normal). If C ¹ 0 then by (87)

E|tX+uY|p
(88)
= C(1-C)2(t2+u2)p/2E|Z|p+C(1-C)(|t|p+|u|p)E|Z|p.
Therefore C = 1, which ends the proof. [¯]

The next result comes from [] and uses stringent moment conditions; Braverman [] gives examples which imply that the condition on zeros of the Mellin transform cannot be dropped.

Theorem 38 Let X, Y be symmetric i. i. d. random variables such that

Eexp(l|X|2) < ¥
for some l > 0, and E|X|s ¹ 0 for all s Î \sf CC such that Âs > 0. Suppose there is a constant C such that for all real a, b

E|aX+bY| = C(a2+b2)1/2.
Then X, Y are normal.

The rest of this section is devoted to the proof of Theorem 4.2.3.

The function f(s) = E|X|s is analytic in the half-plane Âs > 0. Since E|Z|s = p-1/2 Ks G([(s+1)/ 2]), where K = p1/2E|Z| > 0 and G(.) is the Euler gamma function, therefore (76) means that f(s) = p-1/2 Ks a(s) G([(s+1)/ 2]), where a(s) : = p1/2K-s f(s)/ G([(s+1)/ 2]) is analytic in the half-plane Âs > 0, a([`s]) = [`(a(s))] and satisfies

a(s)a(1-s) = 1 for 0 < Âs < 1.
(89)
We shall need the following estimate, in which without loss of generality we may assume 0 < lK < 1 (choose l > 0 small enough).

[ 19 There is a constant C > 0 such that |a(s)| £ C|s| (lK)-Âs for all s in the half-plane Âs ³ 1/2.

Proof. Since Eexp(l2|X|2) < ¥ for some l > 0, therefore P(|X| ³ t) £ Ce-l2t2, where C = Eexp(l2|X|2), see Problem 1.9. This implies

|f(s)| £ C1|s|l-Âs G ( 1
2
Âs ), Âs > 0.
(90)
In particular |a(s)| £ Cexp(o(|s|2)), where o(x)/x® 0 as x® ¥.

Consider now function u(s) = a(s)(lK)s/s, which is analytic in Âs > 0. Clearly |u(s)| £ Cexp(o(|s|2)) as |s|® ¥. Moreover |u(1/2+it)| £ const for all real t by (89); for all real x

|u(x)| = p1/2x-1lxf(x)/G( x+1
2
) £ C1G( 1
2
x)/G( x+1
2
) £ p1/2C,
by (90). Therefore by the Phragmén-Lindelöf principle, see, eg. [], applied twice to the angles -1/2p £ args £ 0, and 0 £ args £ 1/2p, the Lemma is proved. [¯]

By Lemma 4.2.1 Theorem 4.2.3 follows from the next result.

[ 20 Suppose X is a symmetric random variable satisfying

Eexp(l2|X|2) < ¥
for some l > 0, and
E|X|s ¹ 0
for all s Î C, such that Âs > 0. Let Z be a centered normal random variable such that
E|X|1/2+it E|X|1/2-it = E|Z|1/2+it E|Z|1/2-it
(91)
for all t Î IR. Then X is normal.

Proof.

We shall use Lemma 4.2.3 to show that a(s) = C1C2s for some real C1, C2 > 0. It is clear that a(s) ¹ 0 if Âs > 0. Therefore b(s) = loga(s) is a well defined function which is analytic in the half-plane Âs > 0. The function v(s): = Â(b(-is)) = log|a(-is)| is harmonic in the half-plane Ás > -1/2 and limsup |s|® ¥ v(s)/|s| < ¥ by Lemma 4.2.3. Furthermore by (89) we have v(t) = 0 for real t . By the Nevanlina integral representation, see []

v(x+iy) = y
p
ó
õ
¥

-¥ 
v(t)
(t-x)2+y2
 dt+ky
for some real constant k and for all real x,y with y > 0. This in particular implies that b(y+1/2) = Â(b(y+1/2)) = v(-iy) = c y. Thus by the uniqueness of analytic extension we get a(s) = C1C2s and hence
f(s) = p-1/2KsC1C2sG( s+1
2
)
(92)
for some constants C1, C2 such that C12C2 = 1 (the latter is the consequence of (89)). Formula (92) shows that the distribution of X is given by (87). To exclude the possibility that P(X = 0) ¹ 0 it remains to verify that C1 = 1. This again follows from (88). By Theorem 1.8, the proof is completed. [¯]

4.3  Infinite spherically symmetric sequences

In this section we present results that hold true for infinite sequences only and which might fail for finite sequences.

Definition 10 An infinite sequence X1, X2, ¼ is spherically symmetric if the finite sequence X1, X2, ¼, Xn is spherically symmetric for all n.

The following provides considerably more information than Theorem 4.1.

Theorem 39 [[]] If an infinite sequence X = (X1, X2, ¼) is spherically symmetric, then there is a sequence of independent identically distributed Gaussian random variables [(g)\vec] = (g1, g2, ¼) and a non-negative random variable R independent of [(g)\vec] such that

X = R ®
g
 
.

This result is based on exchangeability.

Definition 11 A sequence (Xk) of random variables is exchangeable, if the joint distribution of Xs(1), Xs(2), ¼, Xs(n) is the same as the joint distribution of X1, X2, ¼,Xn for all n ³ 1 and for all permutations s of {1, 2, ¼, n}.

Clearly, spherical symmetry implies exchangeability. The following beautiful theorem due to B. de Finetti [] points out the role of exchangeability in characterizations as a substitute for independence; for more information and the references see [].

Theorem 40 Suppose that X1, X2, ¼ is an infinite exchangeable sequence. Then there exist a s-field N such that X1, X2, ¼ are N-conditionally i. i. d., that is

P(X1 < a1, X2 < a2, ¼, Xn < an | N)
= P(X1 < a1| N) P(X1 < a2 | N)¼P(X1 < an | N)
for all a1, ¼, an Î IR and all n ³ 1.

Proof. Let N be the tail s-field, ie.

N = ¥
Ç
k = 1 
s(Xk, Xk+1, ¼)
and put Nk = s(Xk, Xk+1, ¼). Fix bounded measurable functions f, g, h and denote
Fn = f(X1, ¼, Xn);
G n, m = g(Xn+1, ¼, Xm+n) ;
H n, m, N = h(Xm+n+N+1, Xm+n+N+2, ¼),
where n, m, N ³ 1. Exchangeability implies that
EFnGn, mHn, m, N = EFnGn+r, mHn, m, N
for all r £ N. Since Hn, m, N is an arbitrary bounded Nm+n+N+1-measurable function, this implies
E{FnGn, m| Nm+n+N+1} = E{FnGn+r, m| Nm+n+N+1}.
Passing to the limit as N® ¥, see Theorem 1.4, this gives
E{FnGn, m| N} = E{FnGn+r, m| N}.
Therefore
E{FnGn, m| N} = E{Gn+r, mE{Fn| Nn+r+1}| N}.
Since E{Fn| Nn+r+1} converges in L1 to E{Fn| N} as r® ¥, and since g is bounded,
E{Gn+r, mE{Fn| Nn+r+1}| N}
is arbitrarily close (in the L1 norm) to
E{Gn+r, mE{Fn | N}| N} = E{Fn | N} E{Gn+r, m | N}
as r® ¥. By exchangeability E{Gn+r, m | N} = E{Gn, m | N} almost surely, which proves that
E{FnGn, m| N} = E{Fn| N} E{Gn, m | N}.
Since f, g are arbitrary, this proves N-conditional independence of the sequence. Using the exchangeability of the sequence once again, one can see that random variables X1, X2, ¼ have the same N-conditional distribution and thus the theorem is proved. [¯]

Proof of Theorem 4.3. Let N be the tail s-field as defined in the proof of Theorem 4.3. By assumption, sequences

(X1, X2, ¼),
(-X1, X2, ¼),
(2-1/2(X1+X2), X3, ¼),
(2-1/2(X1+X2), 2-1/2(X1-X2), X3, X4, ¼)
are all identically distributed and all have the same tail s-field N. Therefore, by Theorem 4.3 random variables X1, X2, are N-conditionally independent and identically distributed; moreover, each variable has the symmetric N-conditional distribution and N-conditionally X1 has the same distribution as 2-1/2(X1+X2). The rest of the argument repeats the proof of Theorem 3.1. Namely, consider conditional characteristic function f(t) = E{exp( itX1)| N}. With probability one f(1) is real by N-conditional symmetry of distribution and f(t) = ( f(2-1/2t))2. This implies
f(2-n/2) = ( f(1))1/2n
(93)
almost surely, n = 0, 1, ¼. Since f(2-n/2)® f(0) = 1 with probability 1, we have f(1) ¹ 0 almost surely. Therefore on a subset W0 Ì W of probability P(W0) = 1, we have f(1) = exp( -R2), where R2 ³ 0 is N-measurable random variable. Applying17 Corollary 2.3 for each fixed w Î W0 we get that f(t) = exp( -tR2) for all real t.

[¯]
The next corollary shows how much simpler the theory of infinite sequences is, compare Theorem 4.1.

[ 14 Let X = (X1, X2, ¼) be an infinite spherically symmetric sequence such that E|Xk|a < ¥ for some real a > 0 and all k = 1, 2, ¼. Suppose that for some m ³ 1

E{||(X1, ¼, Xm)||a| (Xm+1, Xm+2, ¼)} = const.
(94)
Then X is Gaussian.

Proof. From Theorem 4.3 it follows that

E{||(X1, ¼, Xm)||a | (Xm+1, Xm+2, ¼)}
= E{Ra||(g1, ¼, gm)||a | (Xm+1, Xm+2,¼)}.
However, R is measurable with respect to the tail s-field, and hence it also is s(Xm+1, Xm+2, ¼)-measurable for all m. Therefore
E{||(X1, ¼, Xm)||a|(Xm+1, Xm+2, ¼)}
= Ra E{||(g1, ¼, gm)||a|R(gm+1, gm+2, ¼)}
= Ra E{E{||(g1, ¼, gm)||a|R, (gm+1, gm+2, ¼)}|R(gm+1, gm+2, ¼)}.
Since R and [(g)\vec] are independent, we finally get
E{||(X1, ¼, Xm)||a|(Xm+1, Xm+2, ¼)}
= Ra E{||(g1, ¼,gm)||a|(gm+1, gm+2, ¼)} = Ca Ra.
Using now (94) we have R = const almost surely and hence X is Gaussian. [¯]
The following corollary of Theorem 4.3 deals with exponential distributions as defined in Section 3.5. Diaconis & Freedman [] have a dozen of de Finetti-style results, including this one.

Theorem 41 If X = (X1, X2, ¼) is an infinite sequence of non-negative random variables such that random variable min{X1/a1, ¼, Xn/an} has the same distribution as (a1+¼+an)-1X1 for all n and all a1, ¼, an > 0 , then X = L[(e)\vec], where L and [(e)\vec] are independent random variables and [(e)\vec] = (e1, e2, ¼) is a sequence of independent identically distributed exponential random variables.

Sketch of the proof: Combine Theorem 3.4 with Theorem 4.3 to get the result for the pair X1, X2. Use the reasoning from the proof of Theorem 3.4 to get the representation for any finite sequence X1, ¼, Xn, see also Proposition 3.5.

4.4  Problems

Problem 38 Prove the converse of (57). Namely, if f(s) is the characteristic function of a one-dimensional random variable, then there is a spherically symmetric (X1,¼, Xn) such that f(||t||) is its characteristic function.

Problem 39 For centered bivariate normal r. v. X,Y with variances 1 and correlation coefficient r (see Example 2.2), show that E{|X| |Y|} = [2/(p)](Ö{1-r2}+rarcsinr).

Problem 40 Let X, Y be i. i. d. random variables with the probability density function defined by f(x) = C |x|-3exp(-1/x2), where C is a normalizing constant, and x Î IR. Show that for any choice of a, b Î IR we have

E|aX+bY| = K(a2+b2)1/2,
where K = E|X|.

Problem 41 Using the methods used in the proof of Theorem 4.2 for p = 1 prove the following.

Theorem 42 Let X, Y, Z ³ 0 be i. i. d. and integrable random variables. Suppose that there is a constant C ¹ 0 such that Emin{X/a, Y/c, Z/c} = C/(a+b+c) for all a, b, c > 0. Then X, Y, Z are exponential.

Problem 42 [deterministic analogue of theorem 4.2] Show that if X, Y are independent with the same distribution, and E|aX+bY| = 0 for some a, b ¹ 0, then X, Y are non-random.

Chapter 5
Independent linear forms

In this chapter the property of interest is the independence of linear forms in independent random variables. In Section we give a characterization result that is both simple to state and to prove; it is nevertheless of considerable interest. Section parallels Section 3.2. We use the characteristic property of the normal distribution to define abstract group-valued Gaussian random variables. In this broader context we again obtain the zero-one law; we also prove an important result about the existence of exponential moments. In Section we return to characterizations, generalizing Theorem . We show that the stochastic independence of arbitrary two linear forms characterizes the normal distribution. We conclude the chapter with abstract Gaussian results when all forces are joined.

5.1  Bernstein's theorem

The following result due to Bernstein [] characterizes normal distribution by the independence of the sum and the difference of two independent random variables. More general but also more difficult result is stated in Theorem below. An early precursor is Narumi [], who proves a variant of Problem .The elementary proof below is adapted from Feller [].

Theorem 43 If X1, X2 are independent random variables such that X1+X2 and X1 - X2 are independent, then X1 and X2 are normal.

The next result is an elementary version of Theorem 2.5.

[ 21 If X, Z are independent random variables such that Z and X+Z are normal, then X is normal.

Indeed, the characteristic function f of random variable X satisfies

f(t)exp( - (t - m)2/ s2 ) = exp( - (t - M)2/S2)
for some constants m, M, s, S. Therefore f(t) = exp(at2+bt+c), for some real constants a, b, c, and by Proposition 2.1, f corresponds to the normal distribution.

[ 22 If X, Z are independent random variables and Z is normal, then X+Z has a non-vanishing probability density function which has derivatives of all orders.

Proof. Assume for simplicity that Z is N(0, 2 - 1/2). Consider f(x) = Eexp( - (x - X)2). Then f(x) ¹ 0 for each x, and since each derivative [(dk)/( dyk)] exp( -(y - X)2) is bounded uniformly in variables y, X, therefore f (·) has derivatives of all orders. It remains to observe that p-1/2f (·) is the probability density function of X+Z. This is easily verified using the cumulative distribution function:

P(X+Z £ t) = p-1/2 ó
õ
¥

-¥ 
exp( - z2) ó
õ


W 
IX £ t - z dP dz
= p-1/2 ó
õ


W 
ì
í
î
ó
õ
¥

-¥ 
exp( - z2)Iz+X £ t dz ü
ý
þ
 dP
= p -1/2 ó
õ


W 
ì
í
î
ó
õ
¥

-¥ 
exp( - (y - X)2)Iy £ t dy ü
ý
þ
 dP
= p-1/2 ó
õ
t

-¥ 
Eexp( - (y - X)2) dy.
[¯]

Proof of Theorem 5.1. Let Z1, Z2 be i. i. d. normal random variables, independent of X's. Then random variables Yk = Xk+Zk, k = 1, 2, satisfy the assumptions of the theorem, cf. Theorem 2.2. Moreover, by Lemma 5.1, each of Yk's has a smooth non-zero probability density function fk(x), k = 1, 2. The joint density of the pair Y1+Y2, Y1 - Y2 is 1/2 f1([(x+y)/ 2])f2([(x-y)/ 2]) and by assumption it factors into the product of two functions, the first being the function of x, and the other being the function of y only. Therefore the logarithms Qk(x): = logfk(1/2x), k = 1, 2, are twice differentiable and satisfy

Q1(x+y) + Q2(x - y) = a(x)+b(y)
(95)
for some twice differentiable functions a, b (actually a = Q1+Q2). Taking the mixed second order derivative of (95) we obtain
Q1¢¢(x+y) = Q2¢¢(x - y).
(96)
Taking x = y this shows that Q1¢¢(x) = const . Similarly taking x = - y in (96) we get that Q2¢¢(x) = const. Therefore Qk(2x) = Ak+Bkx+Ckx2, and hence fk(x) = exp(Ak+Bkx+Ckx2), k = 1, 2. As a probability density function, fk has to be integrable, k = 1, 2. Thus Ck < 0, and then Ak = - 1/2log( - 2pCk) is determined uniquely from the condition that òfk(x) dx = 1. Thus fk(x) is a normal density and Y1, Y2 are normal. By Lemma 5.1 the theorem is proved. [¯]

5.2  Gaussian distributions on groups

In this section we shall see that the conclusion of Theorem 5.1 is related to integrability just as the conclusion of Theorem 3.1 is related to the fact that the normal distribution is a limit distribution for sums of i. i. d. random variables, see Problem 3.6.

Let \sf CG be a group with a s-field F such that group operation x, y® x+y, is a measurable transformation ( \sf CG×\sf CG, FÄ F)® ( \sf CG, F). Let (W, M, P) be a probability space. A measurable function X: (W, M)® (\sf CG, F), is called a \sf CG-valued random variable and its distribution is called a probability measure on \sf CG.

Example 11 Let \sf CG = IRd be the vector space of all real d-tuples with vector addition as the group operation and with the usual Borel s-field B. Then a \sf CG-valued random variable determines a probability distribution on IRd.

Example 12 Let \sf CG = S1 be the group of all complex numbers z such that |z| = 1 with multiplication as the group operation and with the usual Borel s-field F generated by open sets. A distribution of \sf CG-valued random variable is called a probability measure on S1.

Definition 12 A \sf CG-valued random variable X is Á-Gaussian (letter Á stays here for independence) if random variables X+X¢ and X- X¢, where X¢ is an independent copy of X, are independent.

Clearly, any vector space is an Abelian group with vector addition as the group operation. In particular, we now have two possibly distinct notions of Gaussian vectors: the E-Gaussian vectors introduced in Section 3.2 and the Á-Gaussian vectors introduced in this section. In general, it seems to be not known, when the two definitions coincide; [] gives related examples that satisfy suitable versions of the 2-stability condition (as in our definition of E-Gaussian) without being Á-Gaussian.

Let us first check that at least in some simple situations both definitions give the same result.


Example 5.2 (continued) If \sf CG = IRd and X is an IRd-valued Á-Gaussian random variable, then for all a1, a2, ¼, ad Î IR the one-dimensional random variable a1 X(1)+ a2X(2)+¼+ adX(d) has the normal distribution. This means that X is a Gaussian vector in the usual sense, and in this case the definitions of Á-Gaussian and E-Gaussian random variables coincide. Indeed, by Theorem 5.1, if L: \sf CG®IR is a measurable homomorphism, then the IR-valued random variable X = L(X) is normal.


In many situations of interest the reasoning that we applied to IRd can be repeated and both the definitions are consistent with the usual interpretation of the Gaussian distribution. An important example is the vector space C[0, 1] of all continuous functions on the unit interval.

To some extend, the notion of Á-Gaussian variable is more versatile. It has wider applicability because less algebraic structure is required. Also there is some flexibility in the choice of the linear forms; the particular linear combination X+X¢ and X- X¢ seems to be quite arbitrary, although it might be a bit simpler for algebraic manipulations, compare the proofs of Theorem and Lemma below. This is quite different from Section 3.2; it is known, see [] that even in the real case not every pair of linear forms could be used to define an E-Gaussian random variable. Besides, Á-Gaussian variables satisfy the following variant of E-condition. In analogy with Section 3.2, for any \sf CG-valued random variable X we may say that X is E¢-Gaussian, if 2X has the same distribution as X1+X2+X3+X4, where X1, X2, X3, X4 are four independent copies of X. Any symmetric Á-Gaussian random variable is always E¢-Gaussian in the above sense, compare Problem . This observation allows to repeat the proof of Theorem 3.2 in the Á-Gaussian case, proving the zero-one law. For simplicity, we chose to consider only random variables with values in a vector space \sf V; notation 2nx makes sense also for groups - the reader may want to check what goes wrong with the argument below for non-Abelian groups.

Theorem 44 If X is a \sf V-valued Á-Gaussian random variable and IL is a linear measurable subspace of \sf V, then P(X Î IL) is either 0, or 1.

Indeed, let X1, ¼, Xn, ¼ be independent copies of X, taken to be also independent from X. Recurrently we see that X1+¼+ X4n and 2nX have the same distribution for all n ³ 1. Since IL is a linear subspace of \sf V, we have P(X1+¼+ X4n Î IL) = P(X Î IL). Put Z = X1+X2+X3+ X4. Since X1+¼+ X4n+1 has the same distribution as Z+2nX, therefore P(Z+2nX Î IL) = P(X Î IL) does not depend on n. As in the proof of Theorem 3.2, define events An = {Z\not Î IL}Ç{Z+2nX Î IL}. It is again easily verified that events {An}n ³ 1 are disjoint; therefore P(An) = P(X Î IL)P(X\not Î IL) = 0. [¯]

The main result of this section, Theorem , needs additional notation. This notation is natural for linear spaces. Let \sf CG be a group with a translation invariant metric d(x, y), ie. suppose d(x+z, y+z) = d(x, y) for all x, y, z Î \sf CG. Such a metric d(·, ·) is uniquely defined by the function x® D(x): = d(x, 0). Moreover, it is easy to see that D(x) has the following properties: D(x) = D( - x) and D(x+y) £ D(x)+D(y) for all x, y Î \sf CG. Indeed, by translation invariance D( - x) = d( - x, 0) = d(0, x) = d(x, 0) and

D(x+y) = d(x+y, 0) £ d(x+y,y)+d(y, 0) = D(x)+D(y).

Theorem 45 Let \sf CG be a group with a measurable translation invariant metric d(.,.). If X is an Á-Gaussian \sf CG-valued random variable, then Eexpld(X, 0) < ¥ for some l > 0.

More information can be gained in concrete situations. To mention one such example of great importance, consider a C[0, 1]-valued Á-Gaussian random variable, ie. a Gaussian stochastic process with continuous trajectories. Theorem 5.2 says that

Eexpl(
sup
0 £ t £ 1 
|X(t)|) < ¥
for some l > 0. On the other hand, C[0, 1] is a normed space and another (equivalent) definition applies; Theorem below implies stronger integrability property
Eexpl(
sup
0 £ t £ 1 
|X(t)|2) < ¥
for some l > 0. However, even the weaker conclusion of Theorem 5.2 implies that the real random variable sup0 £ t £ 1|X(t)| has moment generating function and that all its moments are finite. Lemma below is another application of the same line of reasoning.

Proof of Theorem 5.2. Consider a real function N(x): = P(D(X) ³ x), where as before D(x): = d(x, 0). We shall show that there is x0 such that

N(2x) £ 8(N(x - x0))2
(97)
for each x ³ x0. By Corollary 1.3 this will end the proof.

Let X1, X2 be the independent copies of X. Inequality (97) follows from the fact that event {D(X1) ³ 2x} implies that either the event {D(X1) ³ 2x}Ç{D(X2) ³ 2x0}, or the event

{D(X1+X2) ³ 2(x - x0)}Ç{D( X1 - X2 ) ³ 2(x - x0)} occurs.

Indeed, let x0 be such that P(D(X2) ³ 2x0) £ 1/2. If D(X1) ³ 2x and D(X2) < 2x0 then D(X1±X2) ³ D(X1) - D(X2) ³ 2(x - x0). Therefore using independence and the trivial bound

P(D(X1+X2) ³ 2a) £ P(D(X1) ³ a)+P(D(X2) ³ a), we obtain

P(D(X1) ³ 2x) £ P(D(X1) ³ 2x)P(D(X2) ³ 2x0)
+ P(D(X1+X2) ³ 2(x - x0)) P(D(X1 - X2) ³ 2(x - x0) )
£ 1
2
N(2x)+4N2(x - x0)
for each x ³ x0. [¯]

More theory of Gaussian distributions on groups can be developed when more structure is available, although technical difficulties arise; for instance, the Cramer theorem (Theorem 2.5) fails on the torus, see Marcinkiewicz []. Series expansion questions (cf. Theorem 2.2 and the remark preceding Theorem ) are studied in [], see also references therein. One can also study Gaussian distributions on normed vector spaces. In Section below we shall see to what extend this extra structure is helpful, for integrability question; there are deep questions specific to this situation, such as what are the properties of the distribution of the real r. v. ||X||; see []. Another research subject, entirely left out from this book, are Gaussian distributions on Lie groups; for more information see eg. []. Further information about abstract Gaussian random variables, can be found also in [,,,].

5.3  Independence of linear forms

The next result generalizes Theorem to more general linear forms of a given independent sequence X1, ¼, Xn. An even more general result that admits also zero coefficients in linear forms, was obtained independently by Darmois [] and Skitovich []. Multi-dimensional variants of Theorem are also known, see []. Banach space version of Theorem was proved in [].

Theorem 46 If X1, ¼, Xn is a sequence of independent random variables such that the linear forms åk = 1nakXk and åk = 1nbkXk have all non-zero coefficients and are independent, then random variables Xk are normal for all 1 £ k £ n.

Our proof of Theorem 5.3 uses additional information about the existence of moments, which then allows us to use an argument from [] (see also []). Notice that we don't allow for vanishing coefficients; the latter case is covered by [] but the proof is considerably more involved18.

We need a suitable generalization of Theorem 5.2, which for simplicity we state here for real valued random variables only. The method of proof seems also to work in more general context under the assumption of independence of certain nonlinear statistics, compare [], [] and Lemma below.

[ 23 Let a1, ¼, an, b1, ¼, bn be two sequences of non-zero real numbers. If X1, ¼, Xn is a sequence of independent random variables such that two linear forms åk = 1nakXk and åk = 1nbkXk are independent, then random variables Xk, k = 1, 2, ¼, n have finite moments of all orders.

Proof. We shall repeat the idea from the proof of Theorem 5.2 with suitable technical modifications. Suppose that 0 < e £ |ak|, |bk| £ K < ¥ for k = 1, 2, ¼, n. For x ³ 0 denote N(x): = maxj £ nP(|Xj| ³ x) and let C = 2nK/e. For 1 £ j £ n we have trivially

P(|Xj| ³ Cx) £ P(|Xj| ³ Cx, |Xk| £ x  "k ¹ j)
+ n
å
k ¹ j 
P(|Xj| ³ x)P(|Xk| ³ x).
Notice that the event Aj: = {|Xj| ³ Cx }Ç{|Xk| £ x  "k ¹ j} implies that both |åk = 1nakXk| ³ nKx and |åk = 1nbkXk| ³ nKx. Indeed,
| n
å
k = 1 
akXk| ³ |Xj| |aj| -
å
k, k ¹ j 
|akXk| ³ (eC - nK)x = nKx
and the second inclusion follows analogously. By independence of the linear forms this shows that
P(|Xj| ³ Cx) £ P(| n
å
k = 1 
akXk| ³ nKx)P(| n
å
k = 1 
bkXk| ³ nKx)
+ n
å
k ¹ j 
P(|Xj| ³ x)P(|Xk| ³ x).
Therefore N(Cx) £ P(|åk = 1nakXk| ³ nKx)P(|åk = 1nbkXk| ³ nKx)+nN2(x). Using the trivial bound
P(| n
å
k = 1 
akXk| ³ nKx) £ nN(x),
we get
N(Cx) £ 2 n2N2(x).
Corollary 1.3 now ends the proof. [¯]
Proof of Theorem 5.3. We shall begin with reducing the theorem to the case with more information about the coefficients of the linear forms. Namely, we shall reduce the proof to the case when all ak = 1, and all bk are different.

Since all ak are non-zero, normality of Xk is equivalent to normality of akXk; hence passing to Xk¢ = akXk, we may assume that ak = 1, 1 £ k £ n. Then, as the second step of the reduction, without loss of generality we may assume that all bj's are different. Indeed, if, eg. b1 = b2, then substituting X1¢ = X1+X2 we get (n-1) independent random variables X1¢, X3, X4, ¼, Xn which still satisfy the assumptions of Theorem 5.3; and if we manage to prove that X1¢ is normal, then by Theorem 2.5 the original random variables X1, X2 are normal, too.

The reduction argument allows without loss of generality to assume that ak = 1, 1 £ k £ n and 0 ¹ b1 ¹ b2 ¹ ¼ ¹ bn. In particular, the coefficients of linear forms satisfy the assumption of Lemma 5.3. Therefore random variables X1, ¼, Xn have finite moments of all orders and linear forms åk = 1nXk and åk = 1nbkXk are independent.

The joint characteristic function of åk = 1nXk, åk = 1nbkXk is

f(t, s) = n
Õ
k = 1 
fk(t+bks),
where fk is the characteristic function of random variable Xk, k = 1, ¼, n. By independence of linear forms f(t,s) factors
f(t, s) = Y1(t) Y2(s).
Hence
n
Õ
k = 1 
fk(t+bks) = Y1(t) Y2(s).
(98)
Passing to the logarithms Qk = logfk in a neighborhood of 0, from (98) we obtain
n
å
k = 1 
Qk(t+bks) = w1(t)+w2(s).
(99)
By Lemma 5.3 functions Qk and wj have derivatives of all orders, see Theorem 1.5. Consecutive differentiation of (99) with respect to variable s at s = 0 leads to the following system of equations
n
å
k = 1 
bkQk¢(t)
=
w2¢(0),
n
å
k = 1 
bk2Qk¢¢(t)
=
w2¢¢(0),
(100)
:
n
å
k = 1 
bknQk(n)(t)
=
w2(n)(0).
Differentiation with respect to t gives now
n
å
k = 1 
bkQk(n)(t)
=
0,
n
å
k = 1 
bk2Qk(n)(t)
=
0,
(101)
:
n
å
k = 1 
bkn - 1Qk(n)(t)
=
0,
n
å
k = 1 
bknQk(n)(t)
=
const
(clearly, the last equation was not differentiated).

Equations (101) form a system of linear equations (101) for unknown values Qk(n)(t), 1 £ k £ n. Since all bj are non-zero and different, therefore the determinant of the system is non-zero19. The unique solution Qk(n)(t) of the system is Qk(n)(t) = constk and does not depend on t. This means that in a neighborhood of 0 each of the characteristic functions fk (·) can be written as fk(t) = exp(Pk(t)), where Pk is a polynomial of at most n-th degree. Theorem 2.5 now concludes the proof. [¯]

Remark: Additional integrability information was used to solve equation (99). In general equation (99) has the same solution but the proof is more difficult, see [].

5.4  Strongly Gaussian vectors

Following Fernique, we give yet another definition of a Gaussian random variable.

Let \sf V be a linear space and let X be an \sf V-valued random variable. Denote by X¢ an independent copy of X.

Definition 13 X is S-Gaussian ( S stays here for strong) if for all real a random variables cos(a)X¢+sin(a)X, and sin(a)X¢- cos(a)X are independent and have the same distribution as X.

Clearly any S-Gaussian random vector is both Á-Gaussian and E-Gaussian, which motivates the adjective ``strong''. Let us quickly show how Theorems 3.2 and 5.2 can be obtained for S-Gaussian vectors. The proofs follow Fernique [].

Theorem 47 If X is an \sf V -valued S-Gaussian random variable and IL is a linear measurable subspace of \sf V, then P(X Î IL) is either equal to 0, or to 1.

Proof. Let X, X¢ be independent copies of X. For each 0 < a < p/2, let Xa = cos(a) X+sin(a) X¢, and consider the event

A(a) = {w: Xa(w) Î IL}Ç{ Xp/2-a(w)\not Î IL}.
Clearly P(A(a)) = P(X Î IL)P(X\not Î IL). Moreover, it is easily seen that {A(a)}0 < a < p/2 are pairwise disjoint events. Indeed, if A(a)ÇA(b) ¹ Æ, then we would have vectors v, w such that cos(a) v+sin(a) w Î IL,cos(b)v+sin(b)w Î IL, which for a ¹ b implies that v, w Î IL. This contradicts cos(p/2-a)v+sin(p/2-a)w\not Î IL. Therefore P( A(a)) = 0 for each a and in particular P(X Î IL)P(X\not Î IL) = 0, which ends the proof. [¯]
The next result is taken from Fernique []. It strengthens considerably the conclusion of Theorem 3.2.2.

Theorem 48 Let \sf V be a normed linear space with the measurable norm ||·||. If X is an S-Gaussian \sf V-valued random variable, then there is e > 0 such that Eexp(e|| X||2) < ¥.

Proof. As previously, let N(x): = P( || X|| ³ x). Let X1, X2 be independent copies of X. It follows from the definition that

|| X1 ||, || X2 ||
and
2 - 1/2 || X1+X2 ||, 2 - 1/2 || X1 - X2 ||
are two pairs of independent copies of || X||. Therefore for any 0 £ y £ x we have the following estimate
N(x) = P( || X1 || ³ x, || X2 || ³ y)+P( || X1 || ³ x, || X2 || < y)
£ N(x)N(y)+P( || X1+X2 || ³ x - y)P( || X1 - X2 || ³ x - y).
Thus
N(x) £ N(x)N(y)+N2(2 - 1/2(x - y)).
(102)
Take x0 such that N(x0) £ 1/2. Substituting t = Ö2x in (102) we get
N(Ö2t) £ 2N2(t - t0)
(103)
for each t ³ t0. This is similar to, but more precise than (97). Corollary 1.3 ends the proof. [¯]

5.5  Joint distributions

Suppose X1, ¼, Xn, n ³ 1, are (possibly dependent) random variables such that the joint distribution of n linear forms L1, L2, ¼, Ln in variables X1, ¼, Xn is given. Then, except in the degenerate cases, the joint distribution of (L1, L2, ¼, Ln) determines uniquely the joint distribution of (X1, ¼, Xn). The point to be made here is that if X1, ¼, Xn are independent, then even degenerate transformations provide a lot of information. This phenomenon is responsible for results in Chapters 3 and 5. More general results which have little to do with the Gaussian distribution are also known. For instance, if X1, X2, X3 are independent, then the joint distribution m(dx,dy) of the pair X1 - X2, X2 - X3 determines the distribution of X1, X2, X3 up to a change of location, provided that the characteristic function of m does not vanish, see []. This result was found independently by a number of authors, see [,,]; for related results see also [,]. Nonlinear functions were analyzed in [] and the references therein.

5.6  Problems

Problem 43 Let X1, X2, ¼ and Y1, Y2, ¼ be two sequences of i. i. d. copies of random variables X, Y respectively. Suppose X, Y have finite second moments and are such that U = X+Y and V = X - Y are independent. Observe that in distribution X @ X1 = 1/2(U+V) @ 1/2(X1+Y1+X2 - Y2), etc. Use this observation and the Central Limit Theorem to prove Theorem 5.1 under the additional assumption of finiteness of second moments.

Problem 44 Let X and Y be two independent identically distributed random variables such that U = X+Y and V = X - Y are also independent. Observe that 2X = U+V and hence the characteristic function f(·) of X satisfies equation f(2t) = f(t) f(t) f( - t). Use this observation to prove Theorem 5.1 under the additional assumption of i. i. d.

Problem 45 [Deterministic version of Theorem 5.1] Suppose X,U,V are independent and X+U, X+V are independent. Show that X is non-random.

The next problem gives a one dimensional converse to Theorem 2.2.

Problem 46 [From []] Let X, Y be (dependent) random variables such that for some number r ¹ 0,±1 both X - rY and Y are independent and also Y - rX and X are independent. Show that (X, Y) has bivariate normal distribution.

Chapter 6
Stability and weak stability

The stability problem is the question of to what extent the conclusion of a theorem is sensitive to small changes in the assumptions. Such description is, of course, vague until the questions of how to quantify the departures both from the conclusion and from the assumption are answered. The latter is to some extent arbitrary; in the characterization context, typically, stability reasoning depends on the ability to prove that small changes (measured with respect to some measure of smallness) in assumptions of a given characterization theorem result in small departures (measured with respect to one of the distances of distributions) from the normal distribution.

Below we present only one stability result; more about stability of characterizations can be found in [], see also []. In Section we also give two results that establish what one may call weak stability. Namely, we establish that moderate changes in assumptions still preserve some properties of the normal distribution. Theorem below is the only result of this chapter used later on.

6.1  Coefficients of dependence

In this section we introduce a class of measures of departure from independence, which we shall call coefficients of dependence. There is no natural measure of dependence between random variables; those defined below have been used to define strong mixing conditions in limit theorems; for the latter the reader is referred to []; see also [].

To make the definition look less arbitrary, at first we consider an infinite parametric family of measures of dependence. For a pair of s-fields F, G let

ar, s( F, G) = sup
{ |P(AÇB)-P(A)P(B)|
P(A)rP(B)s
: A Î F, B Î G non-trivial}
with the range of parameters 0 £ r £ 1, 0 £ s £ 1, r+s £ 1. Clearly, ar, s is a number between 0 and 1. It is obvious that ar, s = 0 if and only if the s-fields F, G are independent. Therefore one could use each of the coefficients ar, s as a measure of departure from independence.

Fortunately, among the infinite number of coefficients of dependence thus introduced, there are just four really distinct, namely a0,0, a0,1, a1,0, and a1/2,1/2. By this we mean that the convergence to zero of ar,s (when the s-fields F, G vary) is equivalent to the convergence to 0 of one of the above four coefficients. And since a0,1 and a1,0 are mirror images of each other, we are actually left with three coefficients only.

The formal statement of this equivalence takes the form of the following inequalities.

[ 9 If r+s < 1, then ar, s £ (a0,0)1-r-s.

If r+s = 1 and 0 < r £ 1/2 £ s < 1, then ar, s £ (a1/2,1/2)2r.

Proof. The first inequality follows from the fact that

|P(AÇB)-P(A)P(B)|
P(A)rP(B)s
= |P(AÇB)-P(A)P(B)|1-r-s|P(B|A)-P(B)|r|P(A|B)-P(A)|s
£ |P(AÇB)-P(A)P(B)|1-r-s.
The second one is a consequence of
|P(AÇB)-P(A)P(B)|
P(A)rP(B)s
= æ
ç
è
|P(AÇB)-P(A)P(B)|
P(A)1/2P(B)1/2
ö
÷
ø
2r

 
|P(A|B)-P(A)|s-r £ (a1/2,1/2)2r
[¯]
Coefficients a0,0 and a0,1, a1,0 are the basis for the definition of classes of stationary sequences called in the limit theorems literature strong-mixing and uniform strong mixing (called also f-mixing); a1/2,1/2 is equivalent to the maximal correlation coefficient (), which is the basis of the so called r-mixing condition. Monograph [] gives recent exposition and relevant references; see also [].

There is also a whole continuous spectrum of non-equivalent coefficients ar, s when r+s > 1. As those coefficients may attain value ¥, they are less frequently used; one notable exception is a1, 1, which is the basis of the so called y-mixing condition and occurs occasionally in the assumptions of some limit theorems. Condition equivalent to a1, 1 < ¥ and conditions related to ar, s with r+s > 1 are also employed in large deviation theorems, see [].

The following bounds20 for the covariances between random variables in Lp( F) and in Lq( F) will be used later on.

[ 10 If X is F-measurable with p-th moment finite (1 £ p £ ¥) and Y is G-measurable with q-th moment finite (1 £ q £ ¥ ) and 1/p+1/q £ 1, then

|EXY-EXEY|
(104)
£
4(a0,0)1-1/p-1/q(a1,0)1/p(a0,1)1/q||X||p||Y||q
where ||X||p = (E|X|p)1/p if p < ¥ and ||X||¥ = ess sup|X|.

Proof. We shall prove the result for p = 1, q = ¥ and p = q = ¥ only; these are the only cases we shall actually need; for the general case, see eg. [] or [].

Let M = ess sup|Y|. Switching the order of integration (ie. by Fubini's theorem) we get, see Problem 1.9,

|EXY-EXEY|
=
| ó
õ
¥

-¥ 
ó
õ
M

-M 
(P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)) dt ds|
£
ó
õ
¥

-¥ 
ó
õ
M

-M 
|P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| dt ds.
(105)
Since |P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| £ a1, 0 P(X ³ t) (which is good for positive t) and

|P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| = |P(X < t, Y ³ s)-P(X < t)P(Y ³ s)| £ a1, 0 P(X £ t) (which works well for negative t), inequality (105) implies

|EXY-EXEY| £ a1, 0 ó
õ
¥

0 
ó
õ
M

-M 
P(X ³ t) dt ds
+a1, 0 ó
õ
¥

0 
ó
õ
M

-M 
P(X £ -t) dt ds = 2a1, 0E|X| ||Y||¥.
Similar argument using |P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| £ a0, 0 gives
|EXY-EXEY| £ 4a0, 0||X||¥ ||Y||¥ .
[¯]

6.1.1  Normal case

Here we review without proofs the relations between the dependence coefficients in the multivariate normal case. Ideas behind the proofs can be found in the solutions to the Problems , , and .

The first result points points out that the coefficients a0,1 and a1,0 are of little interest in the normal case.

Theorem 49 Suppose (X, Y) Î IRd1+d2 are jointly normal and a0,1(X,Y) < 1. Then X, Y are independent.

Denote by r the maximal correlation coefficient

r = sup
{corr(f(X) g(Y)): f(X), g(Y) Î L2}.
(106)
The following estimate due to Kolmogorov & Rozanov [] shows that in the normal case the maximal correlation coefficient (106) can be estimated by a0,0. In particular, in the normal case we have
a1/2,1/2 £ 2pa0,0.

Theorem 50 Suppose X, Y Î IRd1+d2 are jointly normal. Then

corr(f(X),g(Y)) £ 2pa0,0(X,Y)
for all square integrable f,g.

The next inequality is known as the so called Nelson's hypercontractive estimate [] and is of importance in mathematical physics. It is also known in general that inequality () implies a bound for maximal correlation, see [].

Theorem 51 Suppose (X, Y) Î IRd1+d2 are jointly normal. Then

Ef(X)g(Y) £ ||f(X)||p||g(Y)||p
(107)
for all p-integrable f,g, provided p ³ 1+r, where r is the maximal correlation coefficient (106).

6.2  Weak stability

A weak version of the stability problem may be described as allowing relatively large departures from the assumptions of a given theorem. In return, only a selected part of the conclusion is to be preserved. In this section the part of the characterization conclusion that we want to preserve is integrability. This problem is of its own interest. Integrability results are often useful as a first step in some proofs, see the proof of Theorem 5.3, or the proof of Theorem below.

As a simple example of weak stability we first consider Theorem 5.1, which says that for independent r. v. X,Y we have a1,0(X+Y, X-Y) = 0 only in the normal case. We shall show that if the coefficient of dependence a1,0(X+Y, X-Y) is small, then the distribution of X still has some finite moments. The method of proof is an adaptation of the proof of Theorem 5.2.

[ 11 Suppose X, Y are independent random variables such that random variables X+Y and X-Y satisfy a1,0(X+Y, X-Y) < 1/2. Then X and Y have finite moments E|X|b < ¥ for b < -log2(2a1,0).

Proof. Let N(x) = max{P(|X| ³ x), P(|Y| ³ x)}. Put a = a1,0. We shall show that for each r > 2a, there is x0 > 0 such that

N(2x) £ rN(x-x0)
(108)
for all x ³ x0.

Inequality (108) follows from the fact that the event {|X| ³ 2x} implies that either {|X| ³ 2x}Ç{|Y| ³ 2y} or {|X+Y| ³ 2(x-y)}Ç{|X-Y| ³ 2(x-y)} holds (make a picture). Therefore, using the independence of X, Y, the definition of a = a1,0(X+Y, X-Y) and trivial bound P(|X+Y| ³ a) £ P(|X| ³ 1/2a)+P(|Y| ³ 1/2a) we obtain

P(|X| ³ 2x) £ P(|X| ³ 2x)P(|Y| ³ 2y)
+ P(|X+Y| ³ 2(x-y)) (a+P(|X-Y| ³ 2(x-y)))
£ N(2x)N(2y)+2aN(x-y)+4N2(x-y).
For any e > 0 pick y so that N(2y) £ e/(1+e). This gives N(2x) £ (1+e)2aN(x-y)+4(1+e)N2(x-y) for all x > y. Now pick x0 ³ y such that N(x-y) £ ea/(1+e) for all x > y. Then
N(2x) £ 2(1+2e)aN(x-y) £ 2(1+3e)aN(x-x0)
for all x ³ x0. Since e > 0 is arbitrary, this ends the proof of (108).

By Theorem 1.3 inequality (108) concludes the proof, eg. by formula (2).

[¯]
In Chapter we shall consider assumptions about conditional moments. In Section we need the integrability result which we state below. The assumptions are motivated by the fact that a pair X, Y with the bivariate normal distribution has linear regressions E{X|Y} = a0+a1Y and E{Y|X} = b0+b1X, see (30); moreover, since X-(a0+a1Y) and Y are independent (and similarly Y-( b0+b1X ) and X are independent), see Theorem 2.2, therefore the conditional variances Var(X|Y) and Var(Y|X) are non-random. These two properties do not characterize the normal distribution, see Problem . However, the assumption that regressions are linear and conditional variances are constant might be considered as the departure from the assumptions of Theorem 5.1 on the one hand and from the assumptions of Theorem on the other. The following somehow surprising fact comes from []. For similar implications see also [] and [].

Theorem 52 Let X, Y be random variables with finite second moments and suppose that

E{| X-(a0+a1Y)|2|Y} £ const
(109)
and
E{|Y-(b0+b1X)|2|X} £ const
(110)
for some real numbers a0, a1, b0, b1 such that a1b1 ¹ 0,1,-1. Then X, Y have finite moments of all orders.

In the proof we use the conditional version of Chebyshev's inequality stated as Problem 1.9.

[ 24 If F is a s-field and E|X| < ¥, then

P(|X| > t | F) £ E{|X| | F}/t
almost surely.

Proof. Fix t > 0 and let A Î F. By the definition of the conditional expectation

ó
õ


A 
P(|X| > t| F) dP = E{IAI|X| > t} £ E{|X|/t IAI|X| > t} £ t-1E{|X|IA}.
This end the proof by Lemma 1.4. [¯]
Proof of Theorem 6.2. First let us observe that without losing generality we may assume a0 = b0 = 0. Indeed, by triangle inequality (E{|X-a1Y|2|Y})1/2 £ |a0|+(E{|X-(a0+a1Y)|2|Y})1/2 £ const, and the analogous bound takes care of (110). Furthermore, by passing to -X or -Y if necessary, we may assume a = a1 > 0 and b = b1 > 0. Let N(x) = P(|X| ³ x)+P(|Y| ³ x). We shall show that there are constants K, C > 0 such that
N(Kx) £ CN(x)/x2.
(111)
This will end the proof by Corollary 1.3.

To prove (111) we shall proceed as in the proof of Theorem 5.2. Namely, the event {|X| ³ Kx}, where x > 0 is fixed and K will be chosen later, can be decomposed into the sum of two disjoint events {|X| ³ Kx}Ç{|Y| ³ x} and {|X| ³ Kx}Ç{|Y| < x}. Therefore trivially we have

P(|X| ³ Kx) £ P(|X| ³ x, |Y| ³ x)
(112)
+P(|X| ³ Kx, |Y| < x)
= P1+P2 (say) .
For K large enough the second term on the right hand side of (112) can be estimated by conditional Chebyshev's inequality from Lemma 6.2. Using trivial estimate |Y-bX| ³ b|X|-|Y| we get
P2 £ P(|Y-bX| ³ (Kb-1)x, |X| ³ Kx)
(113)
=
ó
õ


|X| ³ Kx 
P(|Y-bX| ³ (Kb-1)x|X) dP £ const N(Kx)/x2.
To estimate P1 in (112), observe that the event {|X| ³ x} implies that either |X-aY| ³ Cx, or |Y-bX| ³ Cx, where C = |1-ab|/(1+a). Indeed, suppose both are not true, ie. |Y-bX| < Cx and |X-aY| < Cx. Then we obtain trivially
|1-ab| |X| = |X-abX| £ |X-aY|+a|Y-bX| < C(1+a)x.
By our choice of C, this contradicts |X| ³ x.

Using the above observation and conditional Chebyshev's inequality we obtain

P1 £ P(|X-aY| ³ Cx, |Y| ³ x)
+P(|Y-bX| ³ Cx, |X| ³ x) £ C1 N(x)/x2.
This, together with (112) and (113) implies P(|X| ³ Kx) £ CN(x)/x2 for any K > 1/b with constant C depending on K but not on x. Similarly P(|Y| ³ Kx) £ CN(x)/x2 for any K > 1/a, which proves (111). [¯]

6.3  Stability

In this section we shall use the coefficient a0, 0 to analyze the stability of a variant21 of Theorem 5.1 which is based on the approach sketched in Problem 5.6.

Theorem 53 Suppose X, Y are i. i. d. with the cumulative distribution function F(·). Assume that EX = 0, EX2 = 1 and E|X|3 = K < ¥ and let F(·) denote the cumulative distribution function of the standard normal distribution. If a0, 0(X+Y; X-Y ) < e, then


sup
x 
|F(x)-F(x)| £ C(K)e1/3.
(114)

The following corollary is a consequence of Theorem 6.3 and Proposition 6.2.

[ 15 Suppose X, Y are i. i. d. with the cumulative distribution function F(·). Assume that EX = 0, EX2 = 1. If a1, 0(X+Y; X-Y ) < e, then there is C < ¥ such that (114) holds.

Indeed, by Proposition 6.2 the third moment exists if e < e-3/2; choosing large enough C inequality (114) holds true trivially for e ³ e-3/2.

The next lemma gives the estimate of the left hand side of (114) in terms of characteristic functions. Inequality () is called smoothing inequality - a name well motivated by the method of proof; it is due to Esseen [].

[ 25 Suppose F, G are cumulative distribution functions with the characteristic functions f, y respectively. If G is differentiable, then for all T > 0


sup
x 
|F(x)-G(x)| £ 1
p
ó
õ
T

-T 
| f(t)- y(t)| dt/t+ 12
pT

sup
x 
|G¢(x)|.
(115)

Proof. By the approximation argument, it suffices to prove (115) for F, G differentiable and with integrable characteristic functions only. Indeed, one can approximate F uniformly by the cumulative distribution functions Fd, obtained by convoluting F with the normal N(0, d) distribution, compare Lemma 5.1. The approximation, clearly, does not affect (115). That is, if (115) holds true for the approximants, then it holds true for the actual cdf's as well.

Let f, g be the densities of F and G respectively. The inversion formula for characteristic functions gives

f(x) = 1
2p
ó
õ
¥

-¥ 
e-itx f(t) dt,
g(x) = 1
2p
ó
õ
¥

-¥ 
e-itx y(t) dt.
From this we obtain
F(x)-G(x) = i
2p
ó
õ
¥

-¥ 
e-itx f(t)-y(t)
t
 dt.

The latter formula can be checked, for instance, by verifying that both sides have the same derivative, so that they may differ by a constant only. The constant has to be 0, because the left hand side has limit 0 at ¥ (a property of cdf) and the right hand side has limit 0 at ¥ (eg. because we convoluted with the normal distribution while doing our approximation step; another way of seeing what is the asymptotic at ¥ of the right hand side is to use the Riemann-Lebesgue theorem, see eg. []).

This clearly implies


sup
x 
|F(x)-G(x)| £ 1
2p
ó
õ
¥

-¥ 
| f(t)- y(t)| dt/t.
(116)
This inequality, while resembling (115), is not good enough; it is not preserved by our approximation procedure, and the right hand side is useless when the density of F doesn't exist. Nevertheless (116) would do, if one only knew that the characteristic functions vanish outside of a finite interval. To achieve this, one needs to consider one more convolution approximation, this time we shall use density hT(x) = [1/(pT)][(1-cos(Tx))/( x2)]. We shall need the fact that the characteristic function hT(t) of hT(x) vanishes for |t| ³ T (and we shall not need the explicit formula hT(t) = 1-|t|/T for |t| £ T, cf. Example 1.5). Denote by FT and GT the cumulative distribution functions corresponding to convolutions f*hT and g*hT respectively. The corresponding characteristic functions are f(t)hT(t) and y(t)hT(t) respectively and both vanish for |t| ³ T. Therefore, inequality (116) applied to FT and GT gives

sup
x 
|FT(x)-GT(x)|
(117)
£
1
2p
ó
õ
T

-T 
|(f(t)-y(t))hT(t)| dt/t
£ 1
2p
ó
õ
T

-T 
|f(t)- y(t)| dt/t.
It remains to verify that supx|FT(x)-GT(x)| does not differ too much from supx|F(x)-G(x)|. Namely, we shall show that

sup
x 
|F(x)-G(x)| £ 2
sup
x 
|FT(x)-GT(x)|+ 12
pT

sup
x 
|G¢(x)|,
(118)
which together with (117) will end the proof of (115). To verify (118), put M = supx|G¢(x)| and pick x0 such that

sup
x 
|F(x)-G(x)| = |F(x0)-G(x0)|.
Such x0 can be found, because F and G are continuous and F(x)-G(x) vanishes as x® ±¥. Suppose supx|F(x)-G(x)| = G(x0)-F(x0). (The other case: supx|F(x)-G(x)| = F(x0)-G(x0) is handled similarly, and is done explicitly in []). Since F is non-decreasing, and the rate of growth of G is bounded by M, for all s ³ 0 we get
G(x0-s)-F(x0-s) ³ G(x0)-F(x0)-sM.
Now put a = [(G(x0)-F(x0))/ 2M], t = x0+a, x = a-s. Then for all |x| £ a we get
G(t-x)-F(t-x) ³ 1
2
(G(x0)-F(x0))+Mx.
(119)
Notice that
GT(t)-FT(t) = 1
pT
ó
õ
¥

-¥ 
(F(t-x)-G(t-x))(1-cosTx)x-2 dx
³ 1
pT
ó
õ
a

-a 
(F(t-x)-G(t-x))(1-cosTx)x-2 dx
-
sup
x 
|F(x)-G(x)| 2
pT
ó
õ
¥

a 
y-2 dy.
Clearly,

sup
x 
|F(x)-G(x)| 2
pT
ó
õ
¥

a 
y-2 dy = (G(x0)-F(x0)) 2
pT
a-1 = 4M/(pT)
by our choice of a. On the other hand (119) gives
1
pT
ó
õ
a

-a 
(F(t-x)-G(t-x))(1-cosTx)x-2 dx
³ 1
pT
ó
õ
a

-a 
Mx(1-cosTx)x-2 dx
+ 1
2
(G(x0)-F(x0))(1- 2
pT
ó
õ
¥

a 
y-2 dy)
= 1
2
(G(x0)-F(x0))-2M/(pT);
here we used the fact that the first integral vanishes by symmetry. Therefore G(x0)-F(x0) £ 2(GT(x0+a)-FT(x0+a))+12M/(pT), which clearly implies (118). [¯]
Proof of Theorem 6.3. Clearly only small e > 0 are of interest. Throughout the proof C will denote a constant depending on K only, not always the same at each occurrence. Let f(.) be the characteristic function of X. We have Eexp it(X+Y)expit(X-Y) = f(2t) and Eexpit(X+Y)Eexpit(X-Y) = (f(t))3 f(-t). Therefore by a complex valued variant of (104) with p = q = ¥, see Problem , we have
| f(2t)-( f(t))3 f(-t)| £ 16e.
(120)
We shall use (115) with T = e-1/3 to show that (120) implies (114). To this end we need only to establish that for some C > 0
1
pT
ó
õ
T

-T 
| f(t)-e- 1/2t2|/t  dt £ Ce1/3.
(121)
Put h(t) = f(t)-e- 1/2t2. Since EX = 0, EX2 = 1 and E|X|3 < ¥, we can choose e > 0 small enough so that
|h(t)| £ C0|t|3
(122)
for all |t| £ e1/3. From (120) we see that
|h(2t)| = | f(2t)-exp(-2t2)| £ 16e+|( f(t))3 f(-t)-exp( -2t2)|.
Since f(t) = exp( -1/2t2)+h(t), therefore we get
|h(2t)| £ 16e+ 3
å
r = 0 
æ
ç
è
4
r
ö
÷
ø
exp( - 1
2
rt2)|h(t)|4-r.
(123)
Put tn = e1/32n, where n = 0, 1, 2, ¼, [1-2/3log2(e)], and let hn = max{|h(t)|: tn-1 £ t £ tn}. Then (123) implies
hn+1 £ 16e+4exp( - 1
2
tn2)hn(1+ 3
2
hn+hn2)+hn4.
(124)

Claim 2 Relation (124) implies that for all sufficiently small e > 0 we have

hn £ 2(C0+44)e4n exp(-t024n/6),
(125)
hn4 £ e,
(126)
where 0 £ n £ [1- 2/3log2(e)], and C0 is a constant from (122).

Claim 6.3 now ends the proof. Indeed,

ó
õ
T

-T 
| f(t)-e- 1/2t2|/t dt = 2 ó
õ
t0

0 
|h(t)|/t dt+2 n
å
i = 1 
ó
õ
ti

ti-1 
|h(t)|/t  dt
£ 2C0e+2 n
å
i = 1 
hi/ti-1 ó
õ
ti

ti-1 
1 dt £ 2C0e+4 n
å
i = 1 
(C0+44)e4n e-t024n/6
£ 2C0e+24(C0+44) e
t02
ó
õ
¥

0 
e-x  dx £ Ce1/3.
[¯]

Proof of Claim 6.3. We shall prove (126) by induction, and (125) will be established in the induction step. By (122), inequality (126) is true for n = 1, provided e < C0-4/3. Suppose m ³ 0 is such that (126) holds for all n £ m. Since 3/2hn+hn2 < 3e1/4 = d, thus (124) implies

hm+1 £ 32e+ 4exp(- 1
2
tn2)hm(1+d)
£ 32e n-1
å
j = 1 
4j(1+d)j exp( - 1
2
j
å
k = 1 
tn-k2)+4n(1+d)n exp( - 1
2
n
å
k = 1 
tn-k2)h1
= 32e n-1
å
j = 1 
4j(1+d)j exp( -t02(4n-4n-j)/6)+4n(1+d)n exp( -t02(4n-1)/6)h1.
Therefore
hm+1 £ (h1+44e) (1+d)n4n e-t024n/6.
(127)
Since
(1+d)n £ (1+3e1/4)2- 2/3log2(e) £ 2
and
4n e-t024n/6 £ 4e-4/3exp(- 1
6
e-2/3) £ e-2/3
for all e > 0 small enough, therefore, taking (122) into account, we get hm+1 £ 2(44+C0)e1/3 £ e1/4, provided e > 0 is small enough. This proves (126) by induction. Inequality (125) follows now from (127). [¯]

6.4  Problems

Problem 47 Show that for complex valued random variables X, Y

|EXY-EXEY| £ 16a0, 0 ||X||¥ ||Y||¥ .
(The constant is not sharp.)

Problem 48 Suppose (X, Y) Î IR2 are jointly normal and a0,1(X,Y) < 1. Show that X, Y are independent.

Problem 49 Suppose (X, Y) Î IR2 are jointly normal with correlation coefficient r. Show that Ef(X)g(Y) £ ||f(X)||p||g(Y)||p for all p-integrable f(X),g(Y), provided p ³ 1+|r|.
Hint: Use the expl