[section]
[section]
Theorem]Corollary
Theorem]Lemma
[section]
[section]
[section]
[chapter]
[section]
Theorem]Proposition
\Large Normal Distribution
\\ \large characterizations with applications
Normal Distribution
characterizations with applications
W odzimierz Bryc
Department of Mathematics
University of Cincinnati
P O Box 210025
Cincinnati, OH 45221-0025
bryc@ucbeh.san.uc.edu
December 22, 1994
Preface
This book is a concise presentation of the normal distribution on the real line
and its counterparts on more abstract spaces, which we shall call the Gaussian
distributions. The material is selected towards presenting
characteristic properties, or characterizations, of the normal distribution.
There are many such
properties and there are numerous relevant works in the literature.
In this book special attention is given to characterizations generated
by the so called Maxwell's Theorem of statistical mechanics, which is stated
in the introduction as Theorem .
These characterizations are of interest both intrinsically, and as
techniques that are worth being aware of. The book may also serve as a good
introduction to diverse analytic methods of probability theory. We use
characteristic functions, tail estimates, and occasionally dive
into complex analysis.
In the book we also show how the characteristic properties
can be used to prove important results about the Gaussian processes and the
abstract Gaussian vectors. For instance, in Section
we present Fernique's beautiful proofs of the zero-one
law and of the integrability of abstract Gaussian vectors.
The central limit theorem is obtained via characterizations in Section .
The excellent book by
Kagan, Linnik & Rao
[] overlaps with ours in the coverage of the classical characterization
results. Our presentation of these is sometimes less general,
but in return we often give simpler proofs. On the other hand, we are more
selective in the choice of characterizations we want to present, and we also
point out some applications.
Characterization results that are not included in
[] can be found in numerous places of the book, see Section , Chapter
and Chapter .
We have tried to make this book accessible to readers with various backgrounds.
If possible,
we give elementary proofs of important theorems,
even if they are special cases of more advanced
results. Proofs of several difficult classic
results have been simplified. We have managed to
avoid
functional equations for non-differentiable functions;
in many proofs in the literature
lack of differentiability is a major technical difficulty.
The book is primarily aimed at graduate students in
mathematical statistics and probability theory
who would like to expand their bag of tools, to understand the
inner workings of the normal distribution, and to explore
the connections with other fields.
Characterization aspects sometimes show up in unexpected places,
cf. Diaconis & Ylvisaker [].
More generally, when fitting
any statistical model to the data, it is inevitable
to refer to relevant properties of the population
in question; otherwise
several different models may fit the
same set of empirical data, cf. W. Feller
[]. Monograph
[] by Prakasa Rao is written from such perspective and for a
statistician our
book may only serve as a complementary source.
On the other hand results presented in Sections
and are quite recent and virtually unknown among
statisticians. Their modeling aspects remain to be explored, see Section
. We hope that
this book
will popularize the interesting and difficult area of conditional moment
descriptions of random fields.
Of course it is possible that such characterizations will finally end up far
from
real life like many other branches of applied mathematics.
It is up to the readers of this book to see if the following sentence applies to
characterizations as well as to trigonometric series.
``Thinking of the extent and refinement reached by the
theory of trigonometric series in its long development one sometimes
wonders why only relatively
few of these advanced achievements find an application.''
(A. Zygmund, Trigonometric Series, Vol. 1, Cambridge Univ. Press,
Second Edition, 1959, page xii)
There is more than one way to use this book.
Parts of it have been used in a graduate one-quarter course
Topics in statistics. The reader may also skim through it
to find results that he needs; or look up the techniques
that might be useful in his own research. The author of this book would be most
happy
if the reader treats this book as an adventure into the unknown -
picks a piece of his liking and
follows through and beyond the references. With this is mind, the book has a number
of references and digressions.
We have tried to point out the historical perspective, but also to get close to current
research.
An appropriate background for reading the book is a one year course
in real analysis including measure theory and abstract normed spaces,
and a one-year course in complex analysis. Familiarity with conditional
expectations would also help.
Topics from probability theory are reviewed in Chapter ,
frequently with proofs and exercises.
Exercise problems are at the end of the chapters; solutions or hints are
in Appendix .
The book benefited from the comments of Chris Burdzy, Abram Kagan,
Samuel Kotz, Wodek Smole\'nski,
Pawe Szabowski, and Jacek Wesoowski. They read portions of the first
draft, generously shared their criticism, and pointed out
relevant references
and errors. My colleagues at the University of Cincinnati also provided
comments, criticism and encouragement.
The final version of the book was prepared at the
Institute for Applied Mathematics of the University of Minnesota in fall quarter of 1993
and at the Center for Stochastic Processes in Chapel Hill in Spring 1994.
Support by C. P. Taft Memorial Fund
in the summer of 1987 and in the spring of 1994 helped to begin
and to conclude this endeavor.
Introduction
The following narrative comes from J. F. W. Herschel [].
``Suppose a ball is dropped from a given height,
with the intention that it shall fall on a given mark.
Fall as it may, its deviation from the mark is error,
and the probability of that error is the unknown function
of its square, ie. of the sum of the squares of its
deviations in any two rectangular directions. Now, the
probability of any deviation depending solely on its
magnitude, and not on its direction, it follows that
the probability of each of these rectangular deviations
must be the same function of its square.
And since the observed oblique deviation is
equivalent to the two rectangular ones, supposed
concurrent, and which are essentially independent
of one another, and is, therefore, a compound event of
which they are the simple independent constituents,
therefore its probability will be the product of their
separate probabilities. Thus the form of our unknown
function comes to be determined from this condition...''
Ten years after Herschel,
the reasoning was repeated by J. C. Maxwell
[]. In his theory of gases he
assumed that gas consists of small elastic
spheres bumping each other; this led to intricate mechanical
considerations to analyze the velocities before
and after the encounters. However, Maxwell answered the question of his
Proposition IV: What is the distribution of velocities
of the gas particles? without using the details
of the interaction between the particles; it
lead to the emergence of the trivariate normal distribution.
The result that velocities are normally distributed is sometimes
called Maxwell's theorem. At the time of discovery, probability
theory was in its beginnings and the proof
was considered ``controversial" by leading mathematicians.
The beauty of the reasoning lies in the fact
that the interplay of two very natural assumptions: of independence
and of rotation
invariance, gives rise to the normal law of errors - the most important
distribution in statistics.
This interplay of independence and invariance shows up in many
of the theorems presented below.
Here we state the Herschel-Maxwell
theorem in modern notation but without proof; for one of
the early proofs, see []. The reader will see several
proofs that use various, usually weaker, assumptions in
Theorems , , , , and .
Theorem 1
Suppose random variables
X, Y have joint probability distribution m(dx, dy)
such that
(i) m(·) is invariant under the rotations of IR2;
(ii) X, Y are independent.
Then X, Y are normally distributed.
This theorem has generated a vast literature.
Here is a quick preview of pertinent results in this book.
Polya's theorem [] presented in Section
says that if just two rotations by angles p/2 and
p/4, preserve the distribution of X,
then the distribution is normal.
Generalizations to characterizations by the
equality of distributions of more general linear forms are given in
Chapter . One of the most interesting results here is
Marcinkiewicz's theorem [], see Theorem .
An interesting modification of Theorem *,
discovered by M. Sh. Braverman
[] and presented in Section below,
considers three i. i. d. random variables X, Y, Z with
the rotation-invariance assumption (i) replaced by the requirement
that only some absolute moments are rotation invariant.
Another insight is obtained, if one notices that assumption (i)
of Maxwell's theorem implies that rotations preserve the independence
of the original random variables X, Y. In this approach we consider a
pair X, Y of independent random variables such that the
rotation by an angle a produces two independent random variables
Xcosa+Ysina and Xsina-Ycosa.
Assuming this for
all angles a, M. Kac [] showed that
the distribution in question has to be normal. Moreover,
careful inspection of Kac's proof reveals that the only essential
property he had used was that X, Y are independent and that
just one p/4-rotation: (X+Y)/ Ö2, (X-Y)/ Ö2
produces the independent pair. The result explicitly assuming the
latter was found independently by
Bernstein []. Bernstein's theorem and its extensions are considered in
Chapter ; Bernstein's theorem also motivates the
assumptions in Chapter .
The following is a more technical description the contents of the book.
Chapter collects
probabilistic prerequisites. The emphasis is on analytic aspects; in particular
elementary but useful tail estimates collected in Section . In Chapter
we
approach multivariate normal distributions through characteristic functions.
This is a less intuitive but powerful method. It leads rapidly to several
fundamental facts, and to associated Reproducing Kernel Hilbert Spaces
(RKHS). As an illustration, we prove the large deviation estimates on IRd
which use the conjugate RKHS norm. In Chapter
the reader is introduced to
stability and equidistribution of linear forms in independent random
variables. Stability is directly related to the CLT. We show that in the
abstract setup stability is also responsible for the zero-one law. Chapter
presents the analysis of rotation invariant distributions on IRd and on
IR¥ . We study when a rotation invariant distribution has to be
normal. In the process we analyze structural properties of rotation
invariant laws and introduce the relevant techniques. In this chapter we
also present surprising results on rotation invariance of the absolute
moments.
We conclude with a short proof of
de Finetti's theorem and point out its implications for infinite spherically symmetric
sequences. Chapter parallels Chapter in analyzing the
role of independence of linear forms. We show that independence of certain
linear forms, a characteristic property of the normal distribution, leads to
the zero-one law, and it is also responsible for exponential moments. Chapter
is a short introduction to measures of dependence and stability
issues. Theorem establishes integrability under conditions of
interest, eg. in polynomial biorthogonality as studied by Lancaster
[]. In Chapter we extend results in Chapter
to conditional moments. Three interesting aspects emerge here. First,
normality can frequently be recognized from the conditional moments of linear
combinations of independent random variables; we illustrate this by a simple
proof of the well known fact that the independence of the sample mean and the
sample variance characterizes normal populations, and by
the proof of the central limit theorem. Secondly, we show that for infinite
sequences, conditional moments determine normality without any reference to
independence. This part has its natural continuation in Chapter .
Thirdly, in the exercises we point out
the versatility of conditional moments in handling other infinitely
divisible distributions. Chapter is a short introduction to
continuous parameter random fields, analyzed through their conditional
moments. We also present a self-contained analytic construction of the
Wiener process.
Most of the contents of this section is
fairly standard probability theory. The reader shouldn't be under the impression that
this chapter is a substitute for a systematic course in
probability theory. We will skip important topics such as limit theorems.
The emphasis here is
on analytic methods; in particular characteristic functions will be extensively
used throughout.
Let (W, M, P) be the probability space, ie. W is a
set, M is a s-field of its subsets and P is the probability
measure on (W, M).
We follow the usual conventions:
X,Y,Z stand for real random variables; boldface X, Y, Z
denote vector-valued random variables.
Throughout the book EX = òW X(w) dP
(Lebesgue integral) denotes the expected value of a random variable X.
We write X @ Y to denote the equality of distributions, ie.
P(X Î A) = P(Y Î A) for all measurable sets A. Equalities and inequalities
between random variables are to be interpreted almost surely (a. s.). For
instance X £ Y+1 means P(X £ Y+1) = 1; the latter is a shortcut that we
use for the expression P({w Î W: X(w) £ Y(w)+1}) = 1.
Boldface A, B, C will denote matrices. For a complex z = x+iy Î \sf CC
by x = Âz and y = Áz we denote
the real and the imaginary part of z. Unless otherwise stated,
loga = logea denotes the natural logarithm of number a.
1.1 Moments
Given a real number r ³ 0, the absolute moment of
order r is defined by E|X|r; the ordinary moment
of order r = 0, 1, ¼ is defined as
EXr. Clearly, not every sequence of numbers
is the sequence of moments of a random
variable X; it may also happen that two random variables
with different distributions have the same moments.
However, in Corollary below we will show that the latter
cannot happen for normal distributions.
The following inequality is known as Chebyshev's
inequality. Despite its simplicity it has numerous non-trivial applications,
see eg. Theorem or [].
[ 1
If f: IR+® IR+ is
a non-decreasing function and Ef(|X|) = C < ¥, then
for all t > 0 such that
f(t) ¹ 0 we have
Indeed,
Ef(|X|) = òW f(|X|) dP ³ ò|X| ³ tf(|X|) dP ³ ò|X| ³ tf(t) dP = f(t)P(|X| > t).
It follows immediately from Chebyshev's
inequality that if
E|X|p = C < ¥, then P(|X| > t) £ C/tp, t > 0.
An implication in converse direction is also well known:
if P(|X| > t) £ C/tp+e for some e > 0 and for all t > 0,
then E|X|p < ¥, see () below.
The following formula will often be useful1.
[ 2
If f: IR+® IR is a function such
that
f(x) = f(0)+ ò0xg(t) dt, E{|f(X)|} < ¥
and X ³ 0, then
|
Ef(X) = f(0) + |
ó õ
|
¥
0
|
g(t)P(X ³ t) dt. |
| (2) |
Moreover, if g ³ 0 and if the right hand side of (2)
is finite, then Ef(X) < ¥.
Proof. The formula follows from Fubini's
theorem2, since for
X ³ 0
|
|
ó õ
|
W
|
f(X) dP = |
ó õ
|
W
|
|
æ è
|
f(0)+ |
ó õ
|
¥
0
|
1t £ Xg(t) dt |
ö ø
|
dP |
|
|
= f(0)+ |
ó õ
|
¥
0
|
g(t) ( |
ó õ
|
W
|
1t £ X dP) dt = f(0)+ |
ó õ
|
¥
0
|
g(t)P(X ³ t) dt. |
|
[¯]
[ 1
If E|X|r < ¥ for an integer r > 0, then
|
EXr = r |
ó õ
|
¥
0
|
tr-1P(X ³ t) dt - r |
ó õ
|
¥
0
|
tr-1P(-X ³ t) dt. |
| (3) |
If E|X|r < ¥ for real r > 0 then
|
E|X|r = r |
ó õ
|
¥
0
|
tr-1P(|X| ³ t) dt. |
| (4) |
Moreover, the left hand side of (4) is finite if and only if
the right hand side is finite.
Proof. Formula (4) follows directly from Proposition
1.1 (with f(x) = xr and g(t) = [d/ dt]f(t) = rtr-1).
Since EX = EX+ - EX-, where X+ = max{X, 0} and
X- = min{X, 0}, therefore applying Proposition 1.1
separately to each of this expectations we get (3).
[¯]
1.2 Lp-spaces
By Lp(W, M, P), or Lp if no misunderstanding may result,
we denote the Banach space of a. s. classes of equivalence of p-integrable
M-measurable random variables X with the norm
If X Î Lp, we shall say that X is p-integrable; in particular, X is
square integrable if EX2 < ¥.
We say that Xn converges to X in Lp, if ||Xn-X||p® 0 as
n® ¥. If Xn converges to X in L2, we shall also use the phrase
sequence Xn converges to X in mean-square.
Several useful inequalities are collected in the following.
Theorem 2
- [(i)]for 1 £ p £ q £ ¥ we have Minkowski's inequality
- [(ii)]for 1/p+1/q = 1, p ³ 1 we have Hölder's inequality
- [(iii)] for 1 £ p £ ¥ we have triangle
inequality
|
||X+Y||p £ ||X||p+||Y||p. |
| (7) |
Special case p = q = 2 of Hölder's inequality (6) reads
EXY £ [Ö(EX2EY2)]. It is frequently used and is known as
the Cauchy-Schwarz inequality.
For 1 £ p < ¥ the conjugate space to Lp
(ie. the space of all bounded linear functionals on Lp) is usually
identified with Lq,
where 1/p+1/q = 1. The identification is by the duality
áf,gñ = òf(w)g(w) dP.
For the proof of Theorem 1.2 we need the following elementary inequality.
[ 1 For a,b > 0, 1 < p < ¥ and 1/p+1/q = 1 we have
Proof.
Function t® tp/p+t-q/q has the derivative
tp-1-t-q-1. The derivative is positive for t > 1 and negative
for 0 < t < 1. Hence the maximum value of the function for t > 0 is
attained at t = 1, giving
Substituting t = a1/q b-1/p we get (8).
[¯]
Proof of Theorem 1.2 (ii).
If either ||X||p = 0 or ||Y||q = 0, then XY = 0 a. s. Therefore we
consider only the case ||X||p||Y||q > 0 and after rescaling we assume
||X||p = ||Y||q = 1. Furthermore, the case p = 1, q = ¥ is trivial
as
|XY| £ |X| ||Y||¥. For 1 < p < ¥ by (8) we have
Integrating this inequality we get |EXY| £ E|XY| £ 1 = ||X||p||Y||q.
[¯]
Proof of Theorem 1.2 (i).
For p = 1 this is just Jensen's inequality; for a more general
version see Theorem . For 1 < p < ¥
by Hölder's inequality applied to the product of 1 and |X|p
we have
|
||X||pp = E{|X|p ·1} £ (E|X|q)p/q (E1r)1/r = ||X||qp, |
|
where r is computed from the equation 1/r+p/q = 1. (This proof works also for
p = 1 with obvious changes in the write-up.)
[¯]
Proof of Theorem 1.2 (iii).
The inequality is trivial if p = 1 or if ||X+Y||p = 0. In the remaining
cases
|
||X+Y||pp £ E{(|X|+|Y|)|X+Y|p-1} = E{|X||X+Y|p-1}+ E{|Y||X+Y|p-1}. |
|
By Hölder's inequality
|
||X+Y||pp £ ||X||p||X+Y||pp/q+||Y||p||X+Y||pp/q. |
|
Since p/q = p-1, dividing both sides by ||X+Y||pp/q we get the conclusion.
[¯]
By Var(X) we shall denote the variance of a square integrable r. v. X
|
Var(X) = EX2-(EX)2 = E(X-EX)2. |
|
The correlation coefficient corr(X,Y) is defined for square-integrable
non-degenerate r. v. X, Y by the formula
|
corr(X,Y) = |
EXY-EXEY
||X-EX||2||Y-EY||2
|
. |
|
The Cauchy-Schwarz inequality implies that -1 £ corr(X,Y) £ 1.
1.3 Tail estimates
The function N(x) = P(|X| ³ x) describes tail behavior of r. v. a X.
Inequalities involving N(·) similar to Problems and
are sometimes easy to prove. Integrability that follows is of
considerable interest.
Below we give two rather technical tail estimates
and we state several corollaries for future reference. The proofs
use only the fact that N:[0,¥)® [0,1] is a non-increasing function
such that limx®¥N(x) = 0.
Theorem 3
If there are C > 1, 0 < q < 1, x0 ³ 0 such that
for all x > x0
then there is M < ¥ such that N(x) £ [M/( xb)],
where b = -logCq.
Proof. Let an be such that when an = xn-x0 then an+1 = Cxn.
Solving the resulting recurrence
we get an = Cn-b, where b = Cx0(C-1)-1.
Equation (9) says N(an+1) £ CN(an). Therefore
This implies the tail estimate
for arbitrary x > 0. Namely, given x > 0 choose n such that an £ x < an+1. Then
|
N(x) £ N(an) £ K qn = |
K
q
|
qlogC(an+1+b) = M(x+b)-b. |
|
[¯]
The next results follow from Theorem
1.3 and (4) and are stated for future reference.
[ 2
If there is 0 < q < 1 and x0 ³ 0 such that N(2x) £ q N(x-x0)
for all x > x0, then E|X|b < ¥ for all b < log2 1/q.
[ 3
Suppose there is C > 1 such that
for every 0 < q < 1 one can find x0 ³ 0 such that
for all x > x0. Then E|X|p < ¥ for all p.
As a special case of Corollary 1.3 we have the following.
[ 4
Suppose there are C > 1, K < ¥ such that
for all x large enough. Then E|X|p < ¥ for all p.
The next result deals with exponentially small tails.
Theorem 4
If there are C > 1, 1 < K < ¥, x0 ³ 0 such that
for all x > x0, then there are M < ¥, b > 0 such that
where a = logC2.
Proof.
As in the proof of Theorem 1.3, let
an = Cn-b, b = Cx0/(C-1). Put qn = logKN(an).
Then (12) gives
which implies
Therefore by induction we get
Indeed, (14) becomes equality for n = 0. If it holds for n = k, then
qm+k+1 £ 2qm+k+1 £ 2 (2k(1+qm)-1)+1 = 2k+1(1+qm)-1.
This proves (14) by induction.
Since an®¥, we have N(an)® 0 and qn®-¥.
Choose m large enough to have 1+qm < 0. Then (14) implies
|
N(an+m) £ K2n(1+qm) = exp-b2n. |
|
The proof is now concluded by the standard argument.
Selecting large enough M we have
N(x) £ 1 £ Mexp-bxa
for all x £ am. Given x > am
choose n ³ 0 such that an+m £ x < an+m+1. Then
|
N(x) £ N(an+m) £ exp-b2n £ M exp(-b2logC an+m+1) £ M exp-bxa. |
|
[¯]
[ 5
If there are C < ¥, x0 ³ 0 such that
then there is b > 0 such that Eexp(b|X|2) < ¥.
[ 6
If there are C < ¥, x0 ³ 0 such that
then there is b > 0 such that Eexp(b|X|) < ¥.
1.4 Conditional expectations
Below we recall the definition of the conditional expectation of a r. v.
with respect to a s-field and we state
several results that we need for future reference.
The definition is as old as axiomatic probability theory
itself, see
[].
The reader not familiar with conditional expectations should consult textbooks,
eg. Billingsley
[], Durrett [], or Neveu [].
Definition 1 Let (W, M, P) be a probability space.
If F Ì M is a s-field and X is an
integrable random variable, then the conditional expectation of X
given F is an integrable F-measurable random
variable Z such that òAX dP = òA Z dP for all
A Î F.
Conditional expectation
of an integrable random variable X
with respect to a s-field F Ì M
will be denoted interchangeably by
E{X| F} and E FX. We shall also write
E{X|Y} or EYX for the conditional expectation
E{X| F} when F = s(Y) is the s-field
generated by a random variable Y.
Existence and almost sure uniqueness of the conditional
expectation E{X| F}
follows from the Radon-Nikodym theorem,
applied to the finite signed measures m(A) = òAX dP and
P| F, both defined on the measurable space
(W, F).
In some simple situations more explicit expressions can also be found.
Example. Suppose F is a s-field generated
by the events A1, A2, ¼, An which form a non-degenerate
disjoint partition of the probability space W. Then it is easy
to check that
|
E{X| F}(w) = |
n å
k = 1
|
mk IAk(w), |
|
where mk = òAkX dP /P(Ak). In other words, on Ak we
have E{X| F} = òAkX dP /P(Ak).
In particular, if X is discrete and X = åxj IBj, then we
get intuitive expression
|
E{X| F} = |
å
| xj P(Bj|Ak) for w Î Ak. |
|
Example. Suppose that f(x, y) is the joint density with respect
to the Lebesgue measure on IR2 of the bivariate random variable (X, Y)
and let fY(y) ¹ 0 be the (marginal) density of Y. Put
f(x|y) = f(x, y)/fY(y). Then E{X|Y} = h(Y), where
h(y) = ò-¥¥ x f(x|y) dx.
The next theorem lists properties of conditional expectations
that will be used without further mention.
Theorem 5
- [(i)] If Y is F-measurable random variable such that X
and XY are integrable, then E{XY| F} = YE{X| F};
- [(ii)] If G Ì F, then
E G E F = E G;
- [(iii)] If s(X, F) and N are independent
s-fields, then
E{X| NÚ F} = E{X| F}; here NÚ F
denotes the s-field generated by the union
NÈ F;
- [(iv)] If g(x) is a convex function and E|g(X)| < ¥,
then g(E{X| F}) £ E{g(X)| F};
- [(v)] If F is the trivial s-field consisting of
the events of probability 0 or 1 only, then E{X| F} = EX;
- [(vi)] If X, Y are integrable and a, b Î IR then
E{aX+bY| F} = aE{X| F}+bE{Y| F};
- [(vii)] If X and F are independent, then
E{X| F} = EX.
Remark: Inequality (iv) is known as Jensen's
inequality and this is how we shall refer to it.
The proof uses the following.
[ 2
If Y1 and Y2 are
F-measurable and òAY1 dP £ òA Y2 dP for all
A Î F, then Y1 £ Y2 almost surely. If
òAY1 dP = òA Y2 dP for all A Î F, then Y1 = Y2.
Proof. Let Ae = {Y1 > Y2+e} Î F. Since
òAeY1 dP ³ òAeY2 dP + eP(Ae), thus
P(Ae) > 0 is impossible. Event
{Y1 > Y2} is the countable union of the events
Ae (with e
rational); thus it has probability 0 and Y1 £ Y2 with probability
one.
The second part follows from the first by symmetry.
[¯]
Proof of Theorem 1.4.
(i) This is verified first for Y = IB (the indicator function
of an event B Î F). Let
Y1 = E{XY| F}, Y2 = YE{X| F}. From the definition one
can easily see that both òAY1 dP and òA Y2 dP are equal
to òA ÇB X dP. Therefore Y1 = Y2 by the Lemma 1.4.
For the general case, approximate Y by simple random variables and
use (vi).
(ii) This follows from Lemma 1.4: random variables
Y1 = E{X| F},
Y2 = E{X| G} are G-measurable and for
A in G both òAY1 dP and òA Y2 dP are equal to
òAX dP.
(iii) Let Y1 = E{X| NÚ F}, Y2 = E{X| F}.
We check first that
for all A = BÇC,
where B Î N and C Î F. This holds true, as both sides of
the equation are
equal to
P(B)òCX dP.
Once equality òAY1 dP = òA Y2 dP is established for
the generators of the s-field, it holds true for the whole
s-field NÚ F; this is standard
measure theory, see p-l Theorem
[].
(iv) Here we need the first part of
Lemma 1.4. We also need to know that each convex
function g(x) can be written as the supremum of a family of
affine functions fa, b (x) = ax+b.
Let Y1 = E{g(X)| F}, Y2 = fa, b(E{X| F}), A Î F.
By (vi) we have
|
|
ó õ
|
A
|
Y1 dP = |
ó õ
|
A
|
g(X) dP ³ fa, b( |
ó õ
|
A
|
X) dP = fa, b( |
ó õ
|
A
|
E{X| F}) dP = |
ó õ
|
A
|
Y2 dP. |
|
Hence
fa, b(E{X| F}) £ E{g(X)| F};
taking the supremum (over suitable a, b) ends the proof.
(v), (vi), (vii) These proofs are left as exercises.
[¯]
Theorem 1.4 gives geometric interpretation of
the conditional expectation E{·| F} as the projection of
the Banach space Lp(W, M, P) onto its closed subspace
Lp(W, F, P), consisting of all
p-integrable F-measurable random variables, p ³ 1.
This projection is ``self adjoint'' in the sense that the adjoint
operator is given by the same ``conditional expectation'' formula,
although the adjoint operator acts on Lq rather than on Lp; for
square integrable functions E{.| F} is just the orthogonal
projection onto L2(W, F, P).
Monograph [] considers conditional expectation from this angle.
We will use the following (weak) version of the martingale3
convergence
theorem.
Theorem 6
Suppose Fn is a decreasing
family of s-fields, ie.
Fn+1 Ì Fn for all n ³ 1. If X is integrable,
then E{X| Fn}® E{X| F} in L1-norm,
where F is the intersection of all Fn.
Proof. Suppose first that X is square integrable.
Subtracting m = EX if necessary, we can reduce the convergence
question to the centered case EX = 0. Denote Xn = E{X| Fn}.
Since Fn+1 Ì Fn, by Jensen's inequality
EXn2 ³ 0 is a decreasing non-negative sequence.
In particular, EXn2 converges.
Let m < n be fixed. Then E(Xn-Xm)2 = EXn2+EXm2-2EXnXm.
Since Fn Ì Fm, by Theorem 1.4 we have
|
EXnXm = EE{XnXm| Fn} = EXnE{Xm| Fn} |
|
|
= EXnE{E{X| Fm}| Fn} = EXnE{X| Fn} = EXn2. |
|
Therefore E(Xn-Xm)2 = EXm2-EXn2. Since EXn2 converges,
Xn satisfies the Cauchy condition for convergence in
L2 norm.
This shows that for square integrable X, sequence {Xn}
converges in L2.
If X is not square integrable, then for every e > 0 there is a
square integrable Y
such that E|X-Y| < e. By Jensen's inequality E{X| Fn}
and E{Y| Fn} differ by at most e in L1-norm; this holds
uniformly in n. Since by the first part of the proof E{Y| Fn} is convergent, it satisfies
the Cauchy condition in L2 and hence in L1. Therefore for each
e > 0 we can find N such that for all n, m > N we have
E{|E{X| Fn}-E{X| Fm}|} < 3e. This shows that
E{X| Fn}
satisfies the Cauchy condition and hence converges in L1.
The fact that the limit is
X¥ = E{X| F} can be seen as follows.
Clearly X¥ is Fn-measurable
for all n, ie. it is F-measurable.
For A Î F (hence also in Fn),
we have EXIA = EXnIA. Since
|EXnIA-EX¥ IA| £ E|Xn-X¥ |IA £ E|Xn-X¥ |® 0,
therefore EXnIA® EX¥ IA.
This shows that EXIA = EX¥ IA and by definition,
X¥ = E{X| F}.
[¯]
1.5 Characteristic functions
The characteristic function of a real-valued random variable X is
defined by fX(t) = Eexp(itX), where i is the imaginary unit
(i2 = -1). It is easily seen that
If X has the density f(x), the characteristic function is just its Fourier
transform: f(t) = ò-¥¥ eitx f(x) dx. If f(t) is
integrable, then the inverse Fourier transform gives
|
f(x) = |
1
2p
|
|
ó õ
|
¥
-¥
|
e-itxf(t) dt. |
|
This is occasionally useful in verifying whether the specific f(t) is a
characteristic function as in the following example.
Example 1
The following gives an example of
characteristic function that has finite support.
Let f(t) = 1-|t| for |t < | < 1 and 0 otherwise. Then
|
f(x) = |
1
2p
|
|
ó õ
|
1
-1
|
e-itx(1-|t|) dt = - |
1
p
|
|
ó õ
|
1
0
|
(1-t)costx dt = |
1
p
|
|
1-cosx
x2
|
. |
|
Since f(x) = [1/(p)][(1-cosx)/( x2)] is non-negative and integrable,
f(t) is indeed a characteristic function.
The following properties of characteristic functions are proved in any standard probability course,
see eg. [,].
Theorem 7
(i) The distribution of X is determined uniquely by its characteristic
function f(t).
(ii) If E|X|r < ¥ for some r = 0,1,¼, then f(t)
is r-times differentiable, the derivative is uniformly continuous
and
|
EXk = (-i)k |
dk
dtk
|
f(t) |
ê ê
ê
|
t = 0
|
|
|
for all 0 £ k £ r.
(iii) If f(t) is 2r-times differentiable for some natural r,
then EX2r < ¥.
(iv) If X, Y are independent random variables, then
fX+Y(t) = fX(t) fY(t) for all t Î IR.
For a d-dimensional random variable X = (X1, ¼, Xd)
the characteristic function
fX: IRd® \sf CC is defined by
fX(t) = Eexp(it·X), where the dot denotes the
dot (scalar) product, ie. x·y = åxkyk.
For a pair of real valued random variables X, Y, we also
write
f(t, s) = f(X, Y)((t, s)) and we call f(t, s)
the joint characteristic
function
of X and Y.
The following is the multi-dimensional version of Theorem 1.5.
Theorem 8
(i) The distribution of X is determined uniquely by its
characteristic function f(t).
(ii) If E||X||r < ¥, then f(t) is r-times
differentiable and
|
EXj1¼Xjk = (-i)k |
¶k
¶tj1¼¶tjk
|
f(t) |
ê ê
ê
|
t = 0
|
|
|
for all 0 £ k £ r.
(iii) If X, Y are independent IRd-valued random variables,
then
for all
t in IRd.
The next result seems to be less known although it is both easy to
prove and to apply. We shall use it on several occasions in
Chapter . The converse is also true if we assume that the integer
parameter r in the proof below is even or that joint characteristic function
f(t, s) is differentiable; to prove the converse, one can follow the
usual proof of the inversion formula for characteristic functions, see, eg.
[].
Kagan, Linnik & Rao []
state explicitly several most frequently used variants of ().
Theorem 9
Suppose real valued random variables
X, Y have the joint characteristic function f(t, s).
Assume that E|X|m < ¥ for some m Î IN. Let
g(y) be such that
Then for all real
s
|
(-i)m |
¶m
¶tm
|
f(t, s) |
ê ê
ê
|
t = 0
|
= Eg(Y)exp( isY). |
| (16) |
In particular, if g(y) = åckyk is a polynomial, then
|
(-i)m |
¶m
¶tm
|
f(t, s) |
ê ê
ê
|
t = 0
|
= |
å
k
|
(-i)kck |
dk
dsk
|
f(0, s). |
| (17) |
Proof. Since by assumption E|X|m < ¥, the joint characteristic function
f(t, s) = Eexp(itX+isY) can be differentiated m times with
respect to t and
|
|
¶m
¶tm
|
f(t, s) = imEXmexp(itX+isY). |
|
Putting t=0 establishes (16), see Theorem 1.4(i).
In order to prove (17), we need to show first that
E|Y|r < ¥, where r is the degree of the polynomial
g(y). By Jensen's inequality E|g(Y)| £ E|X|m < ¥, and since
|g(y)/yr|® const ¹ 0 as |y|® ¥, therefore
there is C > 0 such that |y|r £ C|g(y)| for all y.
Hence E|Y|r < ¥ follows.
Formula (17) is
now a simple consequence of (16); indeed, for 0 £ k £ r
we have EYkexp(isY) = (-i)kkf(0, s); this formula is obtained by
differentiating k-times Eexp(isY) under the integral sign.
[¯]
1.6 Symmetrization
Definition 2 A random variable X (also: a vector valued
random variable X) is symmetric if X and -X have
the same distribution.
Symmetrization
techniques deal with comparison of properties of
an arbitrary variable X with some symmetric variable Xsym.
Symmetric variables are usually easier to deal with, and proofs
of many theorems (not only characterization theorems, see eg.
[]) become simpler when reduced to the
symmetric case.
There are two natural ways to obtain a symmetric random variable
Xsym from an arbitrary random variable X. The first one
is to multiply X by an independent random sign ±1; in
terms of the characteristic functions this amounts to replacing
the characteristic function f of X
by its symmetrization
1/2 ( f(t)+ f(-t)). This approach has the advantage
that if X is symmetric,
then its symmetrization Xsym has the
same distribution as X. Integrability properties are also easy to
compare, because |X| = |Xsym|.
The other symmetrization,
which has perhaps less obvious properties but is
frequently found more useful, is defined as follows. Let X¢ be
an independent copy of X. The symmetrization [X\tilde] of X is
defined by [X\tilde] = X-X¢. In terms of the characteristic functions
this corresponds to replacing the characteristic function f(t) of X
by the characteristic function |f(t)|2. This procedure is
easily
seen to change the distribution of X, except when X = 0.
Theorem 10
(i) If the symmetrization [X\tilde] of a random variable X
has a finite moment of order p ³ 1, then E|X|p < ¥.
(ii) If the symmetrization [X\tilde] of a random variable
X has finite exponential moment Eexp(l|[X\tilde]|),
then Eexpl|X| < ¥, l > 0.
(iii) If the symmetrization [X\tilde] of a random variable X
satisfies Eexpl|[X\tilde]|2 < ¥, then
Eexpl|X|2 < ¥, l > 0.
The usual approach to Theorem 1.6 uses the
symmetrization inequality,
which is of independent interest
(see Problem ) and formula (2).
Our proof requires extra assumptions, but instead is short,
does not require X and X¢ to have the same distribution,
and it also gives a more accurate bound (within its domain of applicability).
Proof in the case, when E|X| < ¥ and EX = 0:
Let g(x) ³ 0 be a convex function, such that Eg([X\tilde]) < ¥
and let X, X¢ be the independent copies of X, so that conditional
expectation EXX¢ = EX = 0. Then Eg(X) = Eg(X-EXX¢) = Eg(EX{X-X¢}).
Since by Jensen's inequality, see Theorem 1.4 (iv) we have
Eg(EX{X-X¢}) £ Eg(X-X¢), therefore
Eg(X) £ Eg(X-X¢) = Eg([X\tilde]) < ¥. To end the proof,
consider three convex functions g(x) = |x|p, g(x) = exp(lx) and
g(x) = exp(lx2).
1.7 Uniform integrability
Recall that a sequence {Xn}n ³ 1
is uniformly
integrable4,
if
|
|
lim
t®¥
|
|
sup
n ³ 1
|
|
ó õ
|
{|Xn| > t|}
|
|Xn| dP = 0. |
|
Uniform integrability is often used in conjunction with weak convergence
to verify the convergence of moments. Namely, if Xn is uniformly
integrable and converges in distribution to Y, then Y is integrable
and
The following result will be used in the proof of the Central Limit Theorem in
Section .
[ 3
If X1,X2,... are centered i. i. d. random variables with
finite second moments and
Sn = åj = 1nXj then {1/nSn2}n ³ 1
is uniformly integrable.
The following lemma is a special case of the celebrated Khinchin inequality.
[ 3
If ej are ±1 valued symmetric independent r. v., then for all real
numbers aj
|
E |
æ è
|
|
n å
j = 1
|
ajej |
ö ø
|
4
|
£ 3 |
æ è
|
|
n å
j = 1
|
aj2 |
ö ø
|
2
|
|
| (19) |
Proof.
By independence and symmetry we have
|
E |
æ è
|
|
n å
j = 1
|
ajej |
ö ø
|
4
|
= |
n å
j = 1
|
aj4+6 |
å
i ¹ j
|
ai2aj2 |
|
which is less than 3(åj = 1naj4+2åi ¹ jai2aj2).
[¯]
The next lemma gives the Marcinkiewicz-Zygmund
inequality
in the special case needed below.
[ 4
If Xk are i. i. d. centered
with fourth moments, then there is a constant C < ¥ such that
Proof.
As in the proof of Theorem 1.6 we can estimate the fourth
moments of a centered r. v. by the fourth moment of its symmetrization,
ESn4 £ E[S\tilde]n4.
Let ej be independent of [X\tilde]k's as in Lemma 1.7.
Then in distribution
[S\tilde]n @ åj = 1nej[X\tilde]j.
Therefore, integrating with respect to the distribution of ej
first, from (19) we get
|
ESn4 £ 3 E |
æ è
|
|
n å
j = 1
|
|
~ X
|
2 j
|
|
ö ø
|
2
|
= 3 E |
n å
i,j = 1
|
|
~ X
|
2 i
|
|
~ X
|
2 j
|
£ 3 n2 E |
~ X
|
4 1
|
. |
|
Since ||X-X¢||4 £ 2 ||X||4 by triangle inequality
(7),
this ends the proof with C = 3·24.
[¯]
We shall also need the following inequality.
[ 5
If U,V ³ 0 then
|
|
ó õ
|
U+V > 2t
|
(U+V)2 dP £ 4 |
æ è
|
|
ó õ
|
U > t
|
U2 dP+ |
ó õ
|
V > t
|
V2 dP |
ö ø
|
. |
|
Proof.
By (2) applied to f(x) = x2 Ix > 2t we have
|
|
ó õ
|
U+V > 2t
|
(U+V)2 dP = |
ó õ
|
¥
2t
|
2x P(U+V > x) dx. |
|
Since P(U+V > x) £ P(U > x/2)+P(V > x/2), we get
|
|
ó õ
|
U+V > 2t
|
(U+V)2 dP £ 4 |
ó õ
|
¥
t
|
(2y P(U > y)+2y P(V > y)) dy = 4 |
ó õ
|
U > t
|
U2 dP+ 4 |
ó õ
|
V > t
|
V2 dP. |
|
[¯]
Proof of Proposition 1.7. We follow Billingsley
[].
Let e > 0 and choose M > 0 such that ò{|X| > M}|X| dP < e. Split
Xk = Xk¢+Xk¢¢,
where Xk¢ = XkI{|Xk| £ M}-E{XkI{|Xk| £ M}}
and let S¢, S¢¢ denote the corresponding sums.
Notice that for any U ³ 0
we have UI{|U| > m} £ U2/m. Therefore
1/nò|Sn¢| > t Ön(Sn¢)2 dP £ t-2n-2E(Sn¢)4,
which by Lemma 1.7 gives
|
|
1
n
|
|
ó õ
|
|Sn¢| > t Ön
|
(Sn¢)2 dP £ C M4/t2. |
| (21) |
Now we use orthogonality to estimate the second term:
|
|
1
n
|
|
ó õ
|
|Sn¢¢| > t Ön
|
(Sn¢¢)2 dP £ |
1
n
|
E(Sn¢¢)2 £ E|X1¢¢|2 < e |
| (22) |
To end the proof notice that by Lemma 1.7 and inequalities (21), (22) we have
|
|
1
n
|
|
ó õ
|
{|Sn| > 2tÖn}
|
Sn2 dP £ |
1
n
|
|
ó õ
|
{|Sn¢|+|Sn¢¢| > 2tÖn}
|
(|Sn¢|+|Sn¢¢|)2 dP £ |
CM4
t2
|
+e. |
|
Therefore
limsupt®¥supn1/nò{|Sn| > 2tÖn}Sn2 dP £ e.
Since e > 0 is arbitrary, this ends the proof.
[¯]
1.8 The Mellin transform
Definition 3 5
The Mellin transform of a random variable
X ³ 0 is defined for all complex s such that
EXÂs -1 < ¥ by the formula M(s) = EXs-1.
The definition is consistent with the usual definition of the Mellin
transform of an integrable function: if X has a probability density
function f(x), then the Mellin transform of X is given by
M(s) = ò0¥ xs-1f(x) dx.
Theorem 11
6
If X ³ 0 is a random variable such that EXa-1 < ¥
for some
a ³ 1, then the Mellin transform M(s) = EXs-1,
considered for s Î \sf CC such that Âs = a, determines the
distribution of X uniquely.
Proof. The easiest case is when a = 1 and X > 0.
Then M (s) is just the characteristic function of
log(X); thus the distribution of log(X), and
hence the distribution of X, is determined uniquely.
In general consider finite non-negative measure m defined on
(IR+, B) by
|
m(A) = |
ó õ
|
X-1(A)
|
Xa-1 dP. |
|
Then M (s)/ M (a) is the characteristic function
of a random variable
x: x®log(x) defined on the probability space
(IR+, B, P¢) with the probability distribution
P¢(.) = m(.)/m(IR+). Thus the distribution of x is
determined uniquely by M (s). Since ex has
distribution P¢(.), m is determined uniquely by M (.).
It remains to notice that if F is the distribution of our original
random variable X, then dF = x1-a m(dx)+m(IR+)d0(dx),
so F(.) is determined uniquely, too.
[¯]
Theorem 12
If X ³ 0 and EXa < ¥ for some a > 0,
then the Mellin transform of X is analytic in the strip
1 < Âs < 1+a.
Proof. Since for every s with 0 < Âs < a the modulus
of the function w® Xslog(X) is bounded by an integrable function
C1+C2|X|a, therefore EXs can be differentiated with
respect
to s under the expectation sign at each point s, 0 < Âs < a.
[¯]
1.9 Problems
Problem 1 [[]]
Use Fubini's theorem to show that if XY, X, Y are
integrable, then
|
EXY-EXEY = |
ó õ
|
¥
-¥
|
|
ó õ
|
¥
-¥
|
(P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)) dt ds. |
|
Problem 2
Let X ³ 0 be a random variable and suppose that
for every 0 < q < 1 there is T = T(q) such that
|
P(X > 2t) £ q P(X > t) for all t > T. |
|
Show that all the moments of X are finite.
Problem 3
Show that if X ³ 0 is a random variable such that
|
P(X > 2t) £ (P(X > t))2 for all t > 0, |
|
then Eexp(l|X|) < ¥ for some l > 0.
Problem 4
Show that if Eexp(lX2) = C < ¥ for some
a > 0, then
for all
real t.
Problem 5
Show that (11) implies E{|X||X|} < ¥.
Problem 6
Prove part (v) of Theorem 1.4.
Problem 7
Prove part (vi) of Theorem 1.4.
Problem 8
Prove part (vii) of Theorem 1.4.
Problem 9
Prove the following conditional version of Chebyshev's
inequality: if F is a s-field, and E|X| < ¥,
then
|
P(|X| > t | F) £ E{|X| | F}/t |
|
almost surely.
Problem 10
Show that if (X, Y) is uniformly distributed on a circle
centered at (0, 0), then for every a, b there is a non-random constant
C = C(a, b) such that E{X|aX+bY} = C(a,b)(aX+bY).
Problem 11
Show that if (U,V,X) are such that in distribution (U,X) @ (V,X)
then E{U|X} = E{V|X} almost surely.
Problem 12
Show that if X, Y are integrable non-degenerate random variables,
such that
|
E{X|Y} = aY, E{Y|X} = bX, |
|
then |ab| £ 1.
Problem 13
Suppose that X, Y are square-integrable random variables such
that
Show that Y = 0 almost
surely7.
Problem 14
Show that if X, Y are integrable such that
E{X|Y} = Y and E{Y|X} = X, then X = Y a. s.
Problem 15
Prove that if X ³ 0,
then function f(t): = EXit, where t Î IR, determines the
distribution of X uniquely.
Problem 16
Prove that function f(t): = Emax{X, t} determines
uniquely the distribution of an integrable random variable X
in each of the following cases:
- [(a)] If X is discrete.
- [(b)] If X has continuous density.
Problem 17
Prove that, if E|X| < ¥, then function
f(t): = E|X-t| determines uniquely the distribution of X.
Problem 18
Let p > 2 be fixed. Show that exp(-|t|p) is
not a characteristic function.
Problem 19
Let Q(t,s) = logf(t,s), where f(t,s) is the
joint characteristic function of square-integrable r. v. X,Y.
- [(i)] Show that E{X|Y} = rY implies
|
|
¶
¶t
|
Q(t,s) |
ê ê
ê
|
t = 0
|
= r |
d
ds
|
Q(0,s). |
|
- [(ii)] Show that E{X2|Y} = a+bY+cY2 implies
|
|
¶2
¶t2
|
Q(t,s) |
ê ê
ê
|
t = 0
|
+ |
æ ç
è
|
|
¶
¶t
|
Q(t,s) |
ö ÷
ø
|
2
|
|
ê ê
ê
|
t = 0
|
|
|
|
= -a +ib |
d
ds
|
Q(0,s)+c |
d2
d s2
|
Q(0,s)+c |
æ ç
è
|
|
d
d s
|
Q(0,s) |
ö ÷
ø
|
2
|
. |
|
Problem 20 [see eg. []]
Suppose a Î IR is the median of X.
- [(i)] Show that the following symmetrization inequality
|
P(|X| ³ t) £ 2P(| |
~ X
|
| ³ t -|a|) |
|
holds for all t > |a|.
- [(ii)] Use this inequality to prove Theorem 1.6 in the general case.
Problem 21
Suppose (Xn,Yn) converge to (X,Y) in distribution and {Xn},
{Yn} are uniformly integrable. If E(Xn|Yn) = rYn for all n, show
that E(X|Y) = rY.
Problem 22
Prove (18).
Chapter 2
Normal distributions
In this chapter we use linear algebra and characteristic
functions to analyze the multivariate
normal random variables. More information and other approaches can
be found, eg. in
[,,].
In Section we give criteria for normality which will be used
often in
proofs in subsequent chapters.
2.1 Univariate normal distributions
The usual definition of the standard normal variable Z specifies
its
density f(x) = [1/( Ö{2p})]e-[(x2)/ 2]. In general, the so
called N(m, s) density is given by
|
f(x) = |
1
|
e -[((x-m)2)/(2s2)]. |
|
By completing the
square one can check that the characteristic function
f(t) = EeitZ = ò-¥¥ eitx f(x) dx of the
standard normal r. v. Z is given by
see Problem .
In multivariate case it is more convenient to
use characteristic functions directly.
Besides, characteristic functions are our main
technical tool and it doesn't hurt to start using them as soon as
possible. We shall therefore begin with the following definition.
Definition 4 A real valued random variable X has the
normal N(m, s)
distribution if its characteristic
function has the form
|
f(t) = exp(itm- |
1
2
|
s2t2), |
|
where m, s are real numbers.
From Theorem 1.5 it is easily to check by direct
differentiation that m = EX and s2 = Var(X).
Using (15) it is
easy to see that every univariate normal X
can be written as
where Z is the standard N(0,1) random variable
with the characteristic function e-[(t2)/ 2].
The following properties of standard normal distribution N(0,1)
are self-evident:
- The characteristic function e-[(t2)/ 2] has
analytic extension e-[(z2)/ 2] to all complex z Î \sf CC.
Moreover, e-[(z2)/ 2] ¹ 0.
- Standard normal random variable Z has finite exponential moments
Eexp(l|Z|) < ¥ for all l; moreover,
Eexp(lZ2) < ¥ for all l < 1/2 (compare Problem 1.9).
Relation (23) translates the above properties to
the general N(m,s) distributions. Namely, if X is normal, then
its characteristic function has non-vanishing analytic extension to \sf CC
and
for some l > 0.
For future reference we state the following simple but
useful observation. Computing EXk for k = 0, 1, 2 from Theorem 1.5
we immediately get.
[ 4
A characteristic function which can be expressed in the form
f(t) = exp(at2+bt+c) for some complex constants
a, b, c, corresponds to the normal random variable, ie.
a Î IR and a < 0, b Î i IR is imaginary and c = 0.
2.2 Multivariate normal distributions
We follow the usual linear algebra notation. Vectors are denoted
by small bold letters x, v, t, matrices by capital bold
initial letters
A, B, C
and vector-valued random variables by capital boldface X, Y, Z;
by the dot we denote the usual dot product in IRd, ie.
x·y: = åj = 1d xjyj;
||x|| = (x·x)1/2 denotes the usual Euclidean norm.
For typographical convenience we sometimes write (a1,¼,ak) for the
vector
[
| ].
By AT we denote the transpose of a matrix A.
Below we shall also consider another scalar product
á·,·ñ associated with the normal
distribution; the corresponding semi-norm will be denoted by the
triple bar |||·|||.
Definition 5 An IRd-valued random variable
Z is multivariate normal, or Gaussian (we shall use
both terms interchangeably; the second term will be preferred
in abstract situations) if for every t Î IRd the real valued
random variable t·Z is normal.
Clearly the distribution of univariate t·Z is determined uniquely by
its mean m = mt and its standard deviation s = st.
It is easy to see that mt = t·m, where m = EZ.
Indeed, by linearity of the expected value
mt = Et·Z = t·EZ. Evaluating the characteristic
function f(s) of the real-valued random variable t·Z
at s = 1 we see that the characteristic function of Z
can be written as
|
f(t) = exp( it·m- |
st2
2
|
). |
|
In order to rewrite this formula in a more useful form, consider
the function B(x, y) of two arguments x, y Î IRd defined by
|
B(x, y) = E{(x·Z)(y·Z)}-(x·mx)(y·my). |
|
That is, B(x, y) is the covariance of two
real-valued (and jointly Gaussian) random variables
x·Z and y·Z.
The following observations are easy to check.
- B(·, ·) is symmetric, ie.
B(x, y) = B(y, x) for all x, y;
- B(·, ·) is a bilinear function, ie.
B(·, y) is linear for every fixed y and
B(x, ·) is linear for very fixed x;
- B(·, ·) is positive definite, ie.
B(x, x) ³ 0 for all x.
We shall need the following well known linear algebra fact (the
proofs are explained below; explicit reference is, eg.
[]).
[ 6 Each bilinear form B has the dot
product representation
where C is a linear mapping, represented by a
d×d matrix C = [ci, j]. Furthermore, if B(·, ·)
is symmetric then C is symmetric, ie. we have C = CT.
Indeed, expand x and y with respect to the standard
orthogonal basis
e 1,¼, e d.
By bilinearity we have
B(x, y) = åi,jxiyj B( e i, e j),
which gives the dot product representation with
ci, j = B( e i, e j). Clearly, for symmetric B(·, ·)
we get ci, j = cj, i; hence C is symmetric.
[ 7 If in addition B(·, ·) is positive definite
then
for a d×d matrix A. Moreover, A can be chosen to be symmetric.
The easiest way to see the last fact is to diagonalize C (this
is always possible, as C is symmetric). The eigenvalues of C
are real and, since B(·, ·) is positive definite, they are
non-negative. If L denotes a (diagonal) matrix (consisting of
eigenvalues of C) in the diagonal representation C = ULUT
and D is the diagonal matrix formed by the square roots of
the eigenvalues, then A = UDUT. Moreover, this
construction gives symmetric A = AT. In general, there is no unique
choice of A and we shall sometimes find it more convenient to use
non-symmetric A, see Example below.
The linear algebra results imply that the characteristic function
corresponding to a normal distribution on IRd can be written in
the form
|
f(t) = exp( it·m- |
1
2
|
Ct·t). |
| (25) |
Theorem 1.5 identifies m Î IRd as the mean of
the normal random variable Z = (Z1, ¼, Zd); similarly,
double differentiation f(t) at t = 0 shows that
C = [ci, j]
is given by
ci, j = Cov(Zi, Zj). This establishes the following.
Theorem 13
The characteristic function corresponding to a normal random variable
Z = (Z1, ¼, Zd) is given by (25),
where m = EZ and C = [ci, j], ci, j = Cov(Zi, Zj),
is the covariance matrix.
From (24) and (25) we get also
|
f(t) = exp( it·m- |
1
2
|
(At)·(At)). |
| (26) |
In the centered case it is perhaps more intuitive to write
B(x,y) = áx,yñ; this bilinear product
might (in degenerate cases) turn out to be 0 on some non-zero vectors.
In this notation (26) can be written as
|
Eexp( it·Z) = exp- |
1
2
|
át, tñ. |
| (27) |
From the above discussion, we have the following multivariate
generalization of (23).
Theorem 14
Each d-dimensional normal random variable Z has the
same distribution as m+A[(g)\vec],
where m Î IRd is
deterministic, A is a (symmetric) d×d matrix
and [(g)\vec] = (g1, ¼, gd) is a random vector
such that the components g1, ¼, gd are
independent N(0, 1) random variables.
Proof. Clearly,
Eexp( it·(m+A[(g)\vec])) = exp(it·m) Eexp( it·(A[(g)\vec])). Since the
characteristic function of [(g)\vec] is
Eexp( ix·[(g)\vec]) = exp-1/2 || x|| 2 and
t·(A[(g)\vec]) = (ATt)·[(g)\vec], therefore we get
Eexp( it·(m+A[(g)\vec])) = expit·mexp -1/2 || ATt|| 2, which is another form of
(26).
[¯]
Theorem 2.2 can be actually interpreted as the almost sure
representation. However, if A is not of full rank,
the number of independent
N(0,1) r. v. can be reduced. In addition, the representation Z @ m+A[(g)\vec] from Theorem 2.2
is not unique if the symmetry condition is dropped. Theorem gives the same
representation with non-symmetric A = [e1,¼,ek].
The argument given below has
more geometric flavor.
Infinite dimensional generalizations are also known, see ()
and the comment preceding Lemma .
Theorem 15
Each d-dimensional normal random variable Z can be written as
where
k £ d, m Î IRd, e 1, e 2, ¼, e k
are deterministic linearly independent vectors in IRd and
g1, ¼, gk are independent identically distributed normal N(0, 1)
random variables.
Proof.
Without loss of generality we may assume EZ = 0 and establish the
representation with m = 0.
Let IH denote the linear span of the columns of A in IRd,
where A is the matrix from (26).
From Theorem 2.2 it follows that with probability one
Z Î IH. Consider now IH as a Hilbert space with a
scalar product áx, yñ, given by
áx, yñ = (Ax)·(Ay). Since
the null space of A and the column space of A have only
zero vector in common,
this scalar product is non-degenerate, ie. áx, xñ ¹ 0
for IH ' x ¹ 0.
Let e 1, e 2, ¼, e k be the
orthonormal (with respect to á·,·ñ) basis of
IH, where k = dimIH. By Theorem 2.2
Z is IH-valued. Therefore with probability
one we can write Z = åj = 1k gj e j,
where
gj = áe j,Zñ are random coefficients
in the orthogonal expansion.
It remains to verify that g1, ¼, gk are i. i. d.
normal N(0, 1) r. v. With this in mind, we use (26) to compute
their joint characteristic function:
|
Eexp( i |
k å
j = 1
|
tjgj) = Eexp( i |
k å
j = 1
|
tjá e j, Zñ) = Eexp(iá |
k å
j = 1
|
tjej, Zñ). |
|
By (27)
|
Eexp(iá |
k å
j = 1
|
tjej, Zñ) = exp(- |
1
2
|
á |
k å
j = 1
|
tjej, |
k å
j = 1
|
tjejñ) = exp(- |
1
2
|
|
k å
j = 1
|
tj2). |
|
The last equality is a consequence of orthonormality of vectors
e 1, e 2, ¼, ek
with respect to the scalar product á·, ·ñ.
[¯]
The next theorem lists two important properties of the normal
distribution that can be easily verified by writing the joint
characteristic function.
The second property is a consequence of the polarization
identity
|
|||t+s|||2+ |||t-s|||2 = |||t|||2+ |||s|||2, |
|
where
|
|||x|||2: = áx, xñ: = || Ax|| 2; |
| (29) |
the proof is left as an exercise.
Theorem 16
If X, Y are independent with the same centered normal distribution, then
a) [(X+Y)/( Ö2)] has the same distribution as
X;
b) X+Y and X-Y are independent.
Now we consider the multivariate normal density. The density of [(g)\vec] in Theorem
2.2 is the product of the one-dimensional standard normal densities, ie.
|
f[(g)\vec](x) = (2p)-d/2exp(- |
1
2
|
|| x||2). |
|
Suppose that detC ¹ 0, which ensures that A is nonsingular.
By the change of variable formula, from Theorem 2.2 we get the
following expression for the multivariate normal density.
Theorem 17
If Z is centered normal with the nonsingular
covariance matrix C, then the density of Z is given by
|
fZ(x) = (2p)-d/2 ( |
det
| A)-1exp(- |
1
2
|
|| A-1x||2), |
|
or
|
fZ(x) = (2p)-d/2 ( |
det
| C)-1/2exp(- |
1
2
|
C-1x·x), |
|
where matrices A and C are related by (24).
In the nonsingular case this immediately implies strong integrability.
Theorem 18
If Z is normal, then there
is e > 0 such that
Remark: Theorem 2.2 holds true also in the singular
case and for Gaussian random variables with values in infinite dimensional
spaces; for the proof based on Theorem 2.2, see
Theorem below.
The Hilbert space IH introduced in the proof of
Theorem 2.2
is
called the
Reproducing Kernel Hilbert Space (RKHS)
of a normal distribution,
cf. [,].
It can be defined also in more general settings.
Suppose we want to consider jointly two independent normal r. v. X
and Y, taking values in IRd1 and IRd2 respectively, with
corresponding reproducing kernel Hilbert spaces IH1, IH2 and
the corresponding dot products á·,·ñ1 and
á·,·ñ2. Then the IRd1+d2-valued random
variable (X, Y) has the orthogonal sum IH1ÅIH2
as
the Reproducing Kernel Hilbert Space.
This method shows further geometric aspects of jointly normal
random variables. Suppose an IRd1+d2-valued random variable
(X, Y) is (jointly) normal and has IH as the reproducing
kernel Hilbert space (with the scalar product
á·,·ñ). Recall that
IH = {A[ |
| ]: [ |
| ] Î IRd1+d2}. Let IHY be the subspace of IH
spanned by the vectors
{[ |
| ]: y Î IRd2}; similarly let IHX be the subspace of IH
spanned by the vectors
[ |
| ]. Let P be (the matrix of) the
linear transformation IHX® IHY obtained from the
á·,·ñ-orthogonal projection IH® IHX
by narrowing its domain to IHX.
Denote Q = PT; Q represents the orthogonal projection in the
dual norm defined in Section below.
Theorem 19
If (X, Y) has jointly normal distribution on IRd1+d2,
then random vectors Y-QY and X are stochastically
independent.
Proof.
The joint characteristic function of X-QY
and Y factors as follows:
|
f(t, s) = Eexp( i t·(X-QY)+i s·Y) |
|
|
= exp(- |
1
2
|
||| |
é ê
ë
|
|
| |
ù ú
û
|
|||2) = exp(- |
1
2
|
||| |
é ê
ë
|
|
| |
ù ú
û
|
|||2)exp(- |
1
2
|
||| |
é ê
ë
|
|
| |
ù ú
û
|
|||2). |
|
The last identity holds because by our choice of P, vectors
[ |
| ] and
[ |
| ] are orthogonal with respect to scalar product
á·,·ñ.
[¯]
In particular, since E{X| Y} = E{X-QY| Y}+QY,
we get
[ 7
If both X and Y have mean zero, then
For general multivariate normal random variables X and Y applying the
above to centered normal random variables X-mX and
Y-mY respectively, we get
vector a = mX-QmY and matrix Q are determined by the
expected values mX, mY and by the (joint) covariance
matrix C (uniquely if the covariance CY of Y
is non-singular). To find Q, multiply (30)
(as a column vector) from the right by (Y-EY)T and take the expected value.
By Theorem 1.4(i) we get Q = R×CY-1,
where we have written C as the (suitable) block matrix
C = [ |
| ].
An alternative
proof of (30) (and of Corollary 2.2) is to use
the converse to Theorem 1.5.
Equality (30) is usually referred to as linearity of
regression. For the
bivariate normal distribution it takes the form
E{X|Y} = a+bY and it can be established by direct
integration; for more than two variables computations become
more difficult and the characteristic functions are quite handy.
[ 8
Suppose (X, Y) has a (joint) normal distribution on IRd1+d2
and IHX, IHY are á,·,·ñ-orthogonal, ie.
every component of X is uncorrelated with all components of
Y. Then X, Y are independent.
Indeed, in this case Q is the zero matrix; the conclusion
follows from Theorem 2.2.
Example 2
In this example we consider a pair of
(jointly) normal random variables X1, X2. For simplicity of
notation we suppose EX1 = 0, EX2 = 0. Let
Var(X1) = s12, Var(X2) = s2 2 and denote corr(X1, X2) = r.
Then
C = [ |
| ] and the joint characteristic function is
|
f(t1, t2) = exp( - |
1
2
|
t12s12- |
1
2
|
t22 s2 2-t1t2r). |
|
If s1 s2 ¹ 0 we can normalize the variables and
consider the pair Y1 = X1/s1 and Y2 = X2/s2.
The covariance matrix of the last pair is
CY = [ |
| ];
the corresponding scalar product is given by
|
|
á
|
|
é ê
ë
|
|
| |
ù ú
û
|
, |
é ê
ë
|
|
| |
ù ú
û
|
|
ñ
|
= x1y1+x2y2+rx1y2+rx2y1 |
|
and the corresponding RKHS norm is
|||[ |
| ] ||| = (x12+x22+2rx1x2)1/2.
Notice that when r = ±1 the RKHS norm is degenerate
and equals |x1±x2|.
Denoting r = sin2q, it is easy to check that
AY = [ |
| ]
and its inverse
AY-1 = [1/( cos2q)] [ |
| ]
exists if q ¹ ±p/4, ie. when r2 ¹ 1.
This implies that the joint density of Y1 and Y2 is given by
|
f(x, y) = |
1
2pcos2q
|
exp(- |
1
2cos22q
|
( x2+ y2-2xysin2q)). |
| (31) |
We can easily verify that in this case Theorem 2.2
gives
for some i.i.d normal N(0, 1) r. v. g1, g2.
One way to see this, is to compare the variances and the
covariances of both sides. Another representation Y1 = g1,
Y2 = rg1+Ö{1-r2}g2 illustrates non-uniqueness and makes Theorem
2.2 obvious in bivariate case.
Returning back to our original random variables X1, X2, we have
X1 = g1 s1cosq+ g2 s1sinq
and X2 = g1 s2sinq+ g2 s2cosq;
this representation holds true also in the degenerate case.
To illustrate previous theorems, notice that Corollary 2.2 in
the bivariate case follows immediately from (31).
Theorem 2.2 says in this case that Y1-rY2 and
Y2 are independent; this can also be easily checked either
by using density (31) directly, or (easier) by
verifying that Y1-rY2 and Y2 are uncorrelated.
Example 3
In this example we analyze a discrete
time Gaussian random walk {Xk}0 £ k £ T.
Let x1,x2,¼ be i. i. d. N(0,1).
We are interested in explicit formulas for the characteristic function
and for the density of the IRT-valued random variable
X = (X1, X2, ¼, XT), where
are partial sums.
Clearly, m = 0. Comparing (32) with (28) we observe that
Therefore from (26) we get
|
f(t) = exp- |
1
2
|
(t12+(t1+t2)2+¼+(t1+t2+¼+tT)2). |
|
To find the formula for joint density, notice that A is the matrix
representation of the linear operator, which to a given sequence of numbers
(x1,x2,¼,xT) assigns the sequence
of its partial sums (x1, x1+x2,¼, x1+x2+¼+xT).
Therefore, its inverse is the finite difference
operator D: (x1,x2,¼,xT)®(x1, x2-x1,¼, xT-xT-1).
This implies
|
A-1 = |
é ê ê ê ê ê
ê ê ê ê ê ë
|
|
| |
ù ú ú ú ú ú
ú ú ú ú ú û
|
. |
|
Since detA = 1, we get
|
f(x) = (2p)-n/2exp- |
1
2
|
(x12+(x2-x1)2+¼+(xT-xT-1)2). |
| (33) |
Interpreting X as the discrete time process X1, X2,¼,
the probability density function
for its trajectory x is given by
f(x) = Cexp(-1/2 ||Dx||2).
Expression
1/2 ||Dx||2 can be interpreted as proportional to the
kinetic energy of the motion described by the path x; assigning
probabilities by Ce-Energy/(kT) is a well known practice in
statistical physics. In continuous time, the derivative plays analogous
role, compare Schilder's theorem
[].
2.3 Analytic characteristic functions
The characteristic function f(t) of the univariate normal
distribution is a well defined differentiable function of complex argument t.
That is, f has analytic extension to complex plane \sf CC. The theory
of functions of complex variable provides a powerful tool; we shall
use it to recognize the normal characteristic functions.
Deeper theory of analytic characteristic functions and stronger versions of
theorems below can be found in monographs
[,].
Definition 6 We
shall say that a characteristic function f(t) is analytic if it
can be extended from the real line IR to the function analytic in a
domain in complex plane \sf CC.
Because of uniqueness we shall use the same
symbol f to denote both.
Clearly, normal distribution has analytic characteristic function. Example
1.5 presents a non-analytic characteristic function.
We begin with the probabilistic (moment) condition for the existence of the
analytic extension.
Theorem 20
If a random variable X has finite
exponential moment Eexp(a|X|) < ¥, where a > 0, then its characteristic
function f(s) is analytic in the strip -a < Ás < a.
Proof. The analytic extension is given explicitly:
f(s) = Eexp(isX). It remains only to check that f(s)
is differentiable in the strip -a < Ás < a. This follows either
by differentiation with respect to s under the expectation sign
(the latter is allowed, since E{|X|exp(|sX|)} < ¥, provided
-a < Ás < a), or by writing directly the series expansion:
f(s) = ån = 0¥ inEXn sn/n! (the last equality
follows by switching the order of integration and summation, ie.
by Fubini's theorem). The series is easily seen to be absolutely
convergent for all -a £ Ás £ a.
[¯]
[ 9
If X is such that
Eexp(a|X|) < ¥ for every real a > 0, then its
characteristic function f(s) is analytic in \sf CC.
The next result says that normal distribution is
determined uniquely by its moments.
For more information on the moment problem, the reader is referred to
the beautiful book by N. I. Akhiezer [].
[ 10
If X is a random variable
with finite moments of all orders and such that
EXk = EZk, k = 1, 2,¼, where Z is normal, then X is normal.
Proof. By the Taylor expansion
|
Eexp(a|X|) = |
å
| akE|X|k/k! = Eexp(a|Z|) < ¥ |
|
for all real a > 0.
Therefore by Corollary 2.3 the characteristic function of X is
analytic in \sf CC and it is determined uniquely by its Taylor expansion
coefficients at 0. However, by Theorem 1.5(ii) the coefficients are
determined uniquely by the moments of X. Since those are the same as the
corresponding moments of the normal r. v. Z, both characteristic functions
are equal.
[¯]
We shall also need the following refinement of Corollary 2.3.
[ 11
Let f(t) be a characteristic
function, and suppose there is s2 > 0 and a sequence {tk}
convergent to 0 such that f(tk) = exp(-s2tk2) and
tk ¹ 0 for all k. Then f(t) = exp(-s2t2) for
every t Î IR.
Proof. The idea of the proof is simply to calculate
all the derivatives at 0 of f(t) along the sequence {tk}.
Since the derivatives determine moments uniquely, by Corollary
2.3 we shall conclude that f(t) = exp(- s2 t2).
The only nuisance is to establish that all the moments of the
distribution are finite. This fact is established by modifying the usual proof
of Theorem 1.5(iii). Let Dt2 be a
symmetric second order difference operator, ie.
|
Dt2(g)(y): = |
g(y+t)+g(y-t)-2 g(y)
t2
|
. |
|
The assumption that f(t) is differentiable 2n
times along the sequence {tk} implies that
|
|
sup
k
|
| Dt(k)2n(f)(0)| = |
sup
k
|
|Dt(k)2Dt(k)2¼Dt(k)2 (f)(0)| < ¥. |
|
Indeed, the assumption says that
limk® ¥Dt(k)2n(f)(0) exists for all n.
Therefore to end the proof we need the following result.
Claim 1 If f(t) is the characteristic function of
a random variable X, t(k)® 0 is a given sequence such that
t(k) ¹ 0 for all k and
|
|
sup
k
|
| Dt(k)2n(f)(0)| < ¥ |
|
for an integer n, then EX2n < ¥.
The proof of the claim rests on the formula which
can be verified by elementary calculations:
|
{Dt2exp(iay)}(y)|y = x = 4t-2exp(iax)sin2(at/2). |
|
This permits to express recurrently the higher order differences,
giving
|
{Dt(k)2nexp(iay)}(y)|y = x = 4nt-2nsin2n(at/2)exp(iax). |
|
Therefore
|
|Dt(k)2n(f)(0)| = 4n t(k)-2nEsin2n(t(k)X/2) |
|
|
³ 4n t(k)-2nE1|X| £ 2/|t(k)|sin2n(t(k)X/2). |
|
The graph of sin(x) shows that inequality
|sin(x)| ³ [2/(p)] |x| holds for all
|x| £ [(p)/ 2]. Therefore
|
|Dt(k)2n(f)(0)| ³ |
æ ç
è
|
|
2
p
|
|
ö ÷
ø
|
2n
|
E1|X| £ 2/|t(k)| X2n. |
|
By the monotone convergence theorem
|
EX2n £ |
limsup
k® ¥
|
E1|X| £ 2/|t(k)| X2n < ¥, |
|
which ends the proof.
[¯]
The next result is converse to Theorem 2.3.
Theorem 21
If the characteristic function
f(t) of a random variable X has the analytic extension in a
neighborhood of 0 in \sf CC, and the extension is such that the
Taylor expansion series at 0 has convergence radius
R £ ¥, then Eexp(a|X|) < ¥
for all 0 £ a < R.
Proof. By assumption, f(s) has derivatives of
all orders. Thus the moments of all orders are finite and
|
mk = EXk = (-i)k |
¶k
¶sk
|
f(s) |
ê ê
ê
|
s = 0
|
, k ³ 1. |
|
Taylor's expansion of
f(s) at s = 0 is given by
f(s) = åk = 0¥ ikmksk/k!.
The series has convergence radius R if and only if
limsupk® ¥ (mk/k!)1/k = 1/R.
This implies that for any 0 £ a < A < R,
there is k0, such that mk £ A-kk!
for all k ³ k0.
Hence
Eexp(a|X|) = åk = 0¥ akmk/k! < ¥, which ends the proof of the theorem.
[¯]
Theorems 2.3 and 2.3 combined together
imply the following.
[ 12
If a characteristic function
f(t) can be extended analytically to the circle |s| < a,
then it has analytic extension f(s) = Eexp( isX) to
the strip -a < Ás < a.
2.4 Hermite expansions
A normal N(0,1) r. v. Z defines a dot product
áf, g ñ = Ef(Z)g(Z), provided that f(Z) and g(Z) are
square integrable functions on W. In particular, the dot product is well
defined for polynomials. One can apply the usual Gram-Schmidt
orthogonalization algorithm to functions 1, Z, Z2, ¼. This produces
orthogonal polynomials in variable Z known as Hermite
polynomials. Those play important role and can be
equivalently defined by
|
Hn(x) = (-1)n exp(x2/2) |
dn
dxn
|
exp(-x2/2). |
|
Hermite polynomials actually form an orthogonal basis of L2(Z). In
particular, every function f such that f(Z) is square integrable can be
expanded as f(x) = ån = 1¥ fk Hk(x), where fk Î IR are Fourier
coefficients of f(·); the convergence is in
L2(Z), ie. in weighted L2 norm on the real line,
L2(IR, e-x2/2dx).
The following is the classical Mehler's formula.
Theorem 22
For a bivariate normal r. v. X,Y with
EX = EY = 0, EX2 = EY2 = 1, EXY = r, the joint density q(x,y)
of X,Y is given by
|
q(x,y) = |
¥ å
k = 0
|
rk/k! Hk(x)Hk(y) q(x) q(y), |
| (34) |
where q(x) = (2p)-1/2 exp(-x2/2) is the marginal
density.
Proof.
By Fourier's inversion formula we have
|
q(x,y) = |
1
2p
|
|
ó õ
|
|
ó õ
|
exp(itx+ity) exp(- |
1
2
|
t2- |
1
2
|
s2)exp(-rts) dt ds. |
|
Since
(-1)ktksk exp(itx+isy) = [(¶2k)/( ¶xk ¶yk)]exp(itx+isy), expanding e-rts into the Taylor series we get
|
q(x,y) = |
¥ å
k = 0
|
|
rk
k!
|
|
¶2k
¶xk ¶yk
|
q(x) q(y). |
|
[¯]
2.5 Cramer and Marcinkiewicz theorems
The next lemma is a direct application of analytic
functions theory.
[ 8
If X is a random variable
such that Eexp(lX2) < ¥ for some l > 0, and
the analytic extension f(z) of the characteristic
function of X satisfies f(z) ¹ 0 for all z Î \sf CC,
then X is normal.
Proof. By the assumption, f(z) = logf(z) is
well defined and analytic for all z Î \sf CC. Furthermore if z = x+ iy is the decomposition of
z Î \sf CC into its real and imaginary parts, then
Âf(z) = log| f(z)| £ log(Eexp|yX|).
Notice that Eexp(tX) £ Cexp([(t2)/( 2l)]) for
all real t, see Problem 1.9. Indeed, since
lX2+t2/l ³ 2tX, therefore
Eexp(tX) £ Eexp(lX2+t2/a)/2 = Cexp([(t2)/( 2l)]).
Those two facts together imply Âf(z) £ const+[(y2)/ 2a]. Therefore a
variant of the Liouville theorem []
implies that f(z) is a quadratic polynomial in variable
z, ie. f(z) = A+Bz+Cz2. It is easy to see that the
coefficients are A = 0, B = i E{X}, C = -Var(X)/2, compare
Proposition 2.1.
[¯]
From Lemma 2.5 we obtain quickly the following
important theorem, due to H. Cramer
[].
Theorem 23
If X1 and X2 are independent
random variables such that X1+X2 has a normal distribution,
then each of the variables X1, X2 is normal.
Theorem 2.5 is celebrated Cramer's decomposition theorem;
for extensions, see [].
Cramer's theorem
complements nicely the Central Limit Theorem in the following
sense. While the Central Limit Theorem
asserts that the distribution of the sum of i. i. d. random variables
with finite variances is close to normal, Cramer's theorem says that it
cannot be exactly normal, except when we start with a normal sequence.
This resembles propagation of chaos phenomenon, where one proves a dynamical
system approaches chaotic behavior, but it never reaches it except from
initially chaotic configurations.
We shall use Theorem 2.5 as a technical tool.
Proof of Theorem 2.5. Without loss of generality we may
assume EX1 = EX2 = 0. The proof of Theorem 1.6 (iii)
implies that Eexp(aXj2) < ¥, j = 1, 2. Therefore, by
Theorem 2.3, the corresponding characteristic functions
f1 (·), f2 (·) are analytic. By the
uniqueness of the analytic extension, f1(s)f2(s) = exp(-s2/2)
for all s Î \sf CC. Thus fj(z) ¹ 0 for all
z Î \sf CC, j = 1, 2, and by Lemma 2.5 both characteristic functions correspond to
normal
distributions.
[¯]
The next theorem is useful in recognizing the normal
distribution from what at first sight seems to be incomplete information about a
characteristic function. The result and the proof come
from Marcinkiewicz
[],
cf. [].
Theorem 24
Let Q(t) be a
polynomial, and suppose that a characteristic
function f has the representation
f(t) = expQ(t) for all t close enough to 0.
Then Q is of degree at most 2 and f
corresponds to a normal distribution.
Proof. First note that formula f(s) = expQ(s),
s Î \sf CC, defines the analytic extension of f. Thus,
by Corollary 2.3, f(s) = Eexp(isX), s Î \sf CC.
By Theorem 2.5, it suffices to show that
f(s) f(-s) corresponds to the normal distribution.
Clearly f(s) f(-s) also has the form exp( P(t)), where
P(s) is a polynomial that has only even terms,
ie. P(s) = åk = 0n aks2k. Since
f(s)f(-s) = | f(s)|2 is a real number for all s, the
coefficients
a1, ¼, an of polynomial P (·) are real.
Moreover, the n-th coefficient satisfies an = -g2 < 0,
as the inequality | f(t)| £ 1 holds for arbitrarily
large real t. Therefore, taking z = N exp( ip/(2n)),
we obtain
|
| f(z)| ³ exp( N(g2- e(N))) |
| (35) |
for large enough real N,
where e(N)® 0 as N® ¥.
On the other hand, using the explicit representation by
expected value, we get
|
| f(z)| = |Eexp( izX)| £ Eexp(Nsin(p/(2n))X) |
|
|
= f(Nsin(p/(2n))) = exp(P(Nsin(p/(2n)))) |
|
|
£ exp( Nsin(p/(2n))(g2+e(N))). |
|
As N® ¥ the last inequality contradicts (35),
unless sin(p/(2n)) = 1, ie. unless n = 1. This means
that P is of degree 2 and, since P(0) = 0, we have
P(t) = -g2t for all t.
[¯]
2.6 Large deviations
Formula (25) shows that a multivariate normal distribution is
uniquely determined by the vector m of expected values and the
covariance matrix C. However, to compute
probabilities of the events of interest might be quite difficult. As Theorem
2.2 shows, even writing explicitly the density is cumbersome in
higher dimensions as it requires inverting large matrices. Additional
difficulties arise in degenerate cases.
Here we shall present the logarithmic term in the asymptotic expansion for
P(X Î nA) as n® ¥. This is the so called large
deviation estimate; it becomes more accurate
for less likely events. The main feature is that it has relatively simple
form and applies to all events. Higher order expansions are more accurate but
work for fairly regular sets A Ì IRd only.
Let us first define the conjugate ``norm'' to the RKHS
seminorm
|||·||| defined by (29).
|
|||y|||* = |
sup
x Î I\negthinspace Rd, |||x||| = 1
|
x·y. |
|
The conjugate norm has all the properties of the norm except that it can attain value ¥. To see this, and also to have a more explicit expression,
decompose I\negthinspace Rd into the orthogonal sum of the null
space of A and the range of A:
IRd = N(A)ÅÂ(A); here A is the
symmetric matrix from (26). Since
A:IRd® Â(A) is onto, there is
a right-inverse A-1:Â(A)® Â(A) Ì I\negthinspace Rd.
For y Î Â(A) we have
|
|
sup
||Ax|| = 1
|
x·y = |
sup
||Ax|| = 1
|
x·AA-1y = |
sup
||Ax|| = 1
|
ATx·A-1y |
| (36) |
Since A is symmetric and
A-1y Î Â(A),
for y Î Â(A) we have by (36)
|
|||y|||* = |
sup
x Î Â(A), ||x|| = 1
|
x·A-1y = ||A-1y||. |
|
For y\not Î Â(A) we write
y = y N+yÂ, where 0 ¹ y N Î N(A).
Then we have sup||Ax|| = 1x·y ³ supx Î N(A)x·y N = ¥. Since C = A×A, we get
where C-1 is the right inverse of the covariance matrix C.
In this notation, the multivariate normal density is
|
f(x) = Ce-1/2|||x-m|||*2, |
| (38) |
where C is the normalizing constant and the integration has to be taken
over the Lebesgue measure l on the support
supp(X) = {x:|||x|||* < ¥}.
To state the Large Deviation Principle, by A°
we denote the interior of a Borel subset A Ì I\negthinspace Rd.
Theorem 25
If X is Gaussian I\negthinspace Rd-valued with the mean m and the covariance matrix C, then
for all measurable A Ì IRd
|
|
limsup
n® ¥
|
|
1
n2
|
logP(X Î nA) £ - |
inf
x Î A
|
|
1
2
|
|||x-m|||*2 |
| (39) |
and
|
|
liminf
n® ¥
|
|
1
n2
|
logP(X Î nA) ³ - |
inf
x Î A°
|
|
1
2
|
|||x-m|||*2. |
| (40) |
The usual interpretation is that the dominant term in
the asymptotic expansion for
P(1/nX Î A) as n®¥ is given by
|
exp(- |
n2
2
|
|
inf
x Î A
|
|||x-m|||*2). |
|
Proof.
Clearly, passing to X-m we can easily reduce the question to
the centered random vector X. Therefore we assume
Inequality (39) follows immediately from
|
P(X Î nA) = |
ó õ
|
supp(X)ÇA
|
C n-ke-[(n2)/ 2] |||x|||*2 dx |
|
|
£ C n-k l(supp(X)ÇA) |
sup
x Î A
|
e-[(n2)/ 2] |||x|||*2, |
|
where C = C(k) is the normalizing constant and k £ d
is the dimension of
supp(X), cf. (38).
Indeed,
|
|
1
n2
|
logP(X Î nA) £ |
C
n2
|
-k |
logn
n2
|
+ |
logl(supp(X)ÇA)
n2
|
- |
1
2
|
|
inf
x Î A
|
|||x|||*2. |
|
To prove inequality (40) without loss of generality
we restrict our attention to open sets A. Let x0 Î A.
Then for all e > 0 small enough, the balls
B(x0,e) = {x: ||x-x0|| < e} are in A. Therefore
|
P(X Î nA) ³ P(X Î nDe) = |
ó õ
|
De
|
C n-ke-[(n2)/ 2] |||x|||*2 dx, |
| (41) |
where
De = B(x0,e)Çsupp(X).
On the support supp(X) the function
x® |||x|||* is finite and convex;
thus it is continuous. For every h > 0 one can find e
such that
|||x|||*2 ³ |||x0|||*2-h for all x Î De.
Therefore (41) gives
|
P(X Î nA) ³ C n-ke-(1-h)[(n2)/ 2] |||x|||*2, |
|
which after passing to the logarithms ends the proof.
[¯]
Large deviation bounds for Gaussian vectors valued in infinite dimensional
spaces and for Gaussian stochastic processes have similar form and involve
the conjugate RKHS norm; needless to say, the proof that uses the density
cannot
go through;
for the
general theory of large deviations
the reader is referred to [].
2.6.1 A numerical example
Consider a bivariate normal (X,Y) with the covariance matrix
[ |
| ].
The conjugate RKHS norm is then
||| [ |
| ]|||* = 2x2-2xy+y2 and the corresponding unit ball is the ellipse
2x2-2xy+y2 = 1.
Figure illustrates the fact that one can actually see
the conjugated RKHS norm.
Asymptotic shapes in more complicated systems are more mysterious, see
[].
Picture Omitted Figure 2.1:
A sample of N = 1500 points from bivariate normal
distribution.
2.7 Problems
Problem 23
If Z is the standard normal N(0,1) random variable,
show by direct integration that its characteristic function is
f(z) = exp(-1/2 z2)
for all complex z Î \sf CC.
Problem 24
Suppose (X, Y) Î IRd1+d2 are jointly normal and
have pairwise uncorrelated components, corr(Xi, Yj) = 0. Show that
X, Y are independent.
Problem 25
For standardized bivariate normal X,Y with correlation coefficient r,
show that P(X > 0, Y > 0) = 1/4+[1/( 2p)]arcsinr.
Problem 26
Prove Theorem 2.2.
Problem 27
Prove that ``moments'' mk = E{Xkexp(-X2)}
are finite and determine the distribution of X uniquely.
Problem 28
Show that the exponential distribution is
determined uniquely by its moments.
Problem 29
If f(s) is an analytic characteristic
function, show that logf(ix) is a well defined
convex function of the real argument x.
Problem 30 [deterministic analogue of
Theorem 2.5]
Suppose f1, f2 are characteristic
functions such that f1(t)f2(t) = exp(i t ) for
each t Î IR. Show that fk(t) = exp(i t ak), k = 1, 2,
where a1, a2 Î IR.
Problem 31 [exponential analogue of
Theorem 2.5]
If X, Y are i. i. d. random variables
such that min{X, Y} has an exponential distribution, then
X is exponential.
Chapter 3 Equidistributed linear forms
In Section 1.1 we present the classical
characterization of the normal distribution by stability.
Then we use this to define Gaussian measures
on abstract spaces and we prove the zero-one law.
In Section we return to the characterizations
of normal distributions. We consider a more difficult
problem of characterizations by the equality of distributions
of two general linear forms.
3.1 Two-stability
The main result of this section is the theorem
due to G. Polya []. Polya's result was obtained before
the axiomatization of probability theory. It was stated in terms
of positive integrable functions and part of the conclusion was that the
integrals of those functions are one, so that
indeed the probabilistic interpretation is valid.
Theorem 26
If X1, X2 are two i. i. d. random variables such
that X1 and (X1+X2)/ Ö2 have the same
distribution, then X1 is normal.
It is easy to see that if X1 and X2 are i. i. d.
random variables with the distribution corresponding to
the characteristic function exp( -|t|p), then the
distributions of X1 and (X1+X2)/ pÖ2 are
equal. In particular, if X1, X2 are normal N(0,1), then
so is (X1+X2)/ Ö2. Theorem 3.1 says that the above
trivial implication can be inverted for p = 2. Corresponding results are
also known for p < 2, but in general there is no uniqueness, see
[,,].
For p ¹ 2 it is not obvious whether
exp( -|t|p) is indeed a characteristic function;
in fact this is true only if 0 £ p £ 2; the
easier part of this statement was given as
Problem 1.9. The distributions with this characteristic
function are the so called (symmetric) stable distributions.
The following corollary shows that p-stable
distributions with p < 2 cannot have finite second moments.
[ 13
Suppose X1, X2 are i. i. d. random variables with finite second
moments and such that for some scale factor k and some location
parameter a the distribution of X1+X2 is the same as the
distribution of k(X1+a). Then X1 is normal.
Indeed, subtracting the expected value if necessary, we may
assume EX1 = 0 and hence a = 0. Then
Var(X1+X2) = Var(X1)+ Var(X2) gives k = 2-1/2 (except if X1 = 0;
but this by definition is normal, so there is nothing to prove).
By Theorem 3.1, X1 (and also X2) is normal.
Proof of Theorem 3.1.
Clearly the assumption of Theorem 3.1 is not changed,
if we pass to the symmetrizations [X\tilde], [Y\tilde] of X, Y.
By Theorem 2.5 to prove the theorem, it remains to show
that [X\tilde] is normal. Let f(t) be the characteristic
function of [X\tilde], [Y\tilde]. Then
for all real t. Therefore recurrently we get
for all real t.
Take t0 such that f(t0) ¹ 0 ; such t0
can be found as f is continuous and f(0) = 1. Let
s2 > 0 such that
f(t0) = exp( - s2 ). Then (43)
implies f(t02-k/2) = exp( - s2 2-k) for
all k = 0, 1, ¼. By Corollary 2.3 we have
f(t) = exp( - s2 t2) for all t, and the
theorem is proved.
[¯]
3.2 Measures on linear spaces
Let \sf V be a linear space over the field IR of real numbers
(we shall also call \sf V a (real) vector space). Suppose \sf V is
equipped with a s-field F such that the
algebraic operations
of scalar multiplication (x, t)® tx and of
vector addition x, y® x+y are measurable transformations
\sf V×IR® \sf V and \sf V× \sf V® \sf V with respect
to the corresponding s-fields FÄ BIR, and
FÄ F respectively. Let (W, M, P)
be a probability space. A measurable function
X: W® \sf V is called a \sf V-valued random variable.
Example 4
Let \sf V = IRd be the vector space of all real d-tuples with the
usual Borel s-field B. A \sf V-valued random
variable is called a d-dimensional random vector. Clearly
X = (X1, ¼, Xd) and if one prefers, one can consider
the family X1, ¼, Xd rather than X.
Example 5
Let \sf V = C[0, 1] be the vector space of all continuous
functions [0, 1] ® IR with the topology defined by
the norm || f || : = sup0 £ t £ 1|f(t)| and
with the s-field F generated by all open sets.
Then a \sf V-valued random variable X is called a
stochastic process with continuous trajectories with time T = [0,1]. The usual
form is to write X(t) for the random continuous
function X evaluated at a point t Î [0,1].
Warning. Although it is known that every abstract
random vector can be interpreted as a random process with
the appropriate choice of time set T, the natural choice
of T (such as T = 1, 2, ¼, d in Example 3.2
and T = [0, 1] in Example 3.2) might sometimes fail.
For instance, let \sf V = L2[0, 1] be the vector space of all (classes
of equivalence) of square integrable functions [0, 1] ® IR with
the usual L2 norm || f || = (òf2(t) dt)1/2. In general, a
\sf V-valued random variable X cannot be represented as a stochastic
process with time T = [0, 1], because evaluation at a point t Î T is
not a well defined mapping. Although
L2[0, 1] is commonly thought as the square integrable functions,
we are actually dealing with the classes of equivalence rather than with
the genuine functions.
For \sf V = L2[0, 1]-valued Gaussian processes, one can show that Xt
exists almost surely as the limit in probability of continuous linear
functionals; abstract variants of this result can be found in
[] and in the references therein.
The following definition of an abstract Gaussian random
variable is motivated by Theorem 3.1.
Definition 7
A \sf V -valued random
variable X is E-Gaussian ( E stays for the
equality of distributions) if the distribution of Ö2X
is equal to the distribution of X+X¢, where X¢ is
an independent copy of X.
In Sections and we shall see that there are
other equally natural candidates for the definitions of a Gaussian vector.
To distinguish between them, we shall keep the longer name
E-Gaussian instead of just calling it Gaussian.
Fortunately, at least in familiar situations,
it does not matter which definition we use. This occurs whenever we
have plenty of measurable linear functionals. By
Theorem 3.1 if L: \sf V®IR is a measurable
linear functional, then the IR-valued random variable
X = L(X) is normal. When this specifies the probability measure on
\sf V uniquely, then all three definitions are equivalent, Let us see, how this
works in two simple but important cases.
Example 3.2 (continued) Suppose X = (X(1), X(2), ¼, X(n))
is an IRn-valued E-Gaussian random variable.
Consider linear functionals L: IRn®IR given by
Lx® åaixi, where
a1, a2, ¼, an Î IR. Then the one-dimensional random
variable a1X(1)+ a2X(2)+¼+ anX(n) has the normal
distribution. This means that X is a Gaussian vector in the
usual sense (ie. it has multivariate normal distribution), as presented in
Section 2.2.
Example 3.2 (continued) Suppose X is a C[0, 1]-valued
Gaussian random variable. Consider the set of all linear functionals
L: C[0, 1]®IR that can be written in the form
|
L = a1 Et(1)+a2 Et(2)+¼+an Et(n), |
|
where a1, ¼, an are real numbers and
Et: C[0, 1]®IR denotes the evaluation at point t
defined by Et(f) = f(t). Then
L(X) = åaiX(ti) is normal. However, since the coefficients
a1, ¼, an are arbitrary, this means that for each
choice of t1, t2, ¼, tn Î [0, 1] the n-dimensional
random variable X(t1), X(t2), ¼, X(tn) has a multivariate
normal distribution, ie. X(t) is a Gaussian stochastic process
in the usual sense8.
The question that we want to address now is motivated by the following
(false) intuition. Suppose a measurable linear subspace
IL Ì \sf V is given. Think for instance about
IL = C1[0, 1] - the space of all continuously differentiable
functions, considered as a subspace of C[0, 1] = \sf V.
In general, it seems plausible that some of the realizations
of a \sf V-valued random variable X may happen to fall
in IL, while other realizations fail to be in IL.
In other words, it seems plausible that with positive
probability some of the trajectories of a stochastic
process with continuous trajectories are smooth, while
other trajectories are not. Strangely, this cannot happen
for Gaussian vectors (and, more generally, for a-stable vectors).
The result is due to Dudley and Kanter and provides an example of
the so called zero-one law. The most famous zero-one
law is of course the one due to Kolmogorov, see eg.
[]; see also the
appendix to []. The proof
given below follows []. Smole\'nski
[] gives an elementary proof, which
applies also to other classes of measures. Krakowiak
[] proves the zero-one law when
IL is a measurable sub-group rather than a measurable linear subspace.
Tortrat [] considers (among other
issues) zero-one laws for Gaussian distributions on groups.
Theorem and
Theorem in the next chapter give the same conclusion under
different definitions of the Gaussian random vector.
Theorem 27
If X is a
\sf V-valued E-Gaussian random variable and IL is a
linear measurable subspace of \sf V, then P(X Î IL) is
either 0, or 1.
Proof. Let X1, X2, ¼ be independent copies of
X. Also, let us choose them to be independent of X. By
2-stability and the linearity of IL we have
|
P(X1+X2 Î IL) = P(Ö2X Î IL) = P(X Î IL). |
| (44) |
By induction, this gives
|
P(X1+X2+¼+ X2n Î IL) = P(X Î IL) |
| (45) |
for all n = 0, 1, ¼.
Let Z = X1+X2. Clearly, Z is independent of X and
2-stability implies that
X1+X2+¼+X2n+1 has the same distribution as
Z+2n/2X. Therefore (45) gives
|
P(Z+2n/2X Î IL) = P(X Î IL). |
| (46) |
Consider now events
An = {Z\not Î IL}Ç{Z+2n/2X Î IL}.
Since event {Z Î IL}Ç{Z+2n/2X Î IL}
is the same as {Z Î IL}Ç{X Î IL}, therefore
by (46)
|
P(An) = P(Z+2n/2X Î IL)-P(Z Î IL)P(X Î IL) |
|
|
= P(X Î IL)-P(Z Î IL)P(X Î IL). |
|
By (44) this says that
P(An) = P(X Î IL)P(X\not Î IL) does not depend on n.
Now let us observe that if m ¹ n, then the events Am and An
are disjoint. We shall prove this by contradiction. Suppose both vectors
Z+2n/2X Î IL and Z+2m/2X Î IL.
Then their difference (2n/2-2m/2)X is in IL, too.
For m ¹ n this implies X Î IL and therefore
Z Î IL. The latter contradicts the definition of
An, proving that Am and An are indeed disjoint.
The preceding two observations show that {An} is an infinite sequence of
disjoint events with the same probability fixed P(An) = P(A1).
This can happen only if P(An) = 0, ie. when
P(X Î IL)P(X\not Î IL) = 0, which ends the proof.
[¯]
To make Theorem 3.2 more concrete, consider the following
application.
Example 6
This example presents a simple-minded model of transmission of information.
Suppose that we have a choice of one of the two signals f(t), or g(t)
be transmitted by a noisy channel within unit time interval 0 £ t £ 1.
To simplify the situation even further, we assume g(t) = 0, ie. g
represents ``no message send". The noise (which is always present) is a random
and continuous function; we shall assume that it is
represented by a C[0, 1]-valued Gaussian random variable
W = {W(t)}0 £ t £ 1. We also assume it is an ``additive" noise.
Under these circumstances the signal received is given by a curve; it is either
{f(t)+W(t)}0 £ t £ 1, or
{W(t)}0 £ t £ 1,
depending on which of the two signals, f or g, was sent.
The objective is to use the received signal to decide, which
of the two possible messages: f(·) or 0 (ie. message, or no
message) was sent.
Notice that, at least from the mathematical
point of view, the task is trivial if f (·) is known to
be discontinuous; then we only need to observe the trajectory of
the received signal and check for discontinuities.
There are of course numerous practical
obstacles to collecting continuous data, which we are not going to discuss here.
If f (·) is
continuous, then the above procedure does not apply.
Problem requires more detailed analysis in this case.
One may adopt the usual approach of testing the null hypothesis
that no signal was sent. This amounts to choosing a suitable critical
region
IL Ì C[0, 1]. As usual in statistics, the decision is
to be made according to whether the observed trajectory falls into IL
(in which case we decide f (·) was sent) or not
(in which case we decide that 0 was sent and that what we have
received was just the noise). Clearly, to get a sensible test
we need P(f (·) +W (·) Î IL) > 0 and
P(W (·) Î IL) < 1.
Theorem 3.2 implies that perfect
discrimination is achieved if we manage to pick the critical region
in the form of a (measurable) linear subspace. Indeed, then by Theorem
3.2 P(W (·) Î IL) < 1 implies
P(W (·) Î IL) = 0 and P(f (·) +W (·) Î IL) > 0 implies
P(f (·) +W (·) Î IL) = 1.
Unfortunately, it is not true that a linear space can always be
chosen for the critical region. For instance, if W (·) is
the Wiener process (see Section ), it is known that
such subspace cannot be found if (and only if!) f (·) is
differentiable for almost all t and
ò([df/ dt])2 dt < ¥. The proof of this
theorem is beyond the scope of this book (cf. Cameron-Martin
formula in []). The result, however, is
surprising
(at least for those readers, who know that trajectories of the
Wiener process are non-differentiable): it implies that, at
least in principle, each non-differentiable (everywhere) signal
f (·) can be recognized without errors despite having
non-differentiable Wiener noise.
(Affine subspaces for centered noise E Wt = 0 do not work, see Problem
)
For a recent work, see [].
3.3 Linear forms
It is easily seen that if a1, ¼, an and
b1, ¼, bn are real numbers such that the sets
A = {|a1|, ¼, |an|} and B = {|b1|, ¼, |bn|} are
equal, then for any symmetric i. i. d. random variables
X1, ¼, Xn the sums åk = 1n akXk and
åk = 1n bkXk have the same distribution.
On the other hand, when n = 2, A = {1, 1} and
B = {0, Ö2} Theorem 3.1 says that
the equality of distributions
of linear forms åk = 1n akXk and
åk = 1n bkXk implies normality.
In this section we shall consider two more characterizations
of the normal distribution by the equality of distributions
of linear combinations åk = 1n akXk and
åk = 1n bkXk. The results are considerably less
elementary than Theorem 3.1.
We shall begin with the following generalization of
Corollary 3.1 which we learned from J. Wesoowski.
Theorem 28
Let X1, ¼, Xn, n ³ 2,
be i. i. d. square-integrable random variables
and let A = {a1, ¼, an} be the set of real numbers
such that A ¹ {1,0,...,0}. If X1 and
åk = 1nakXk have equal distributions, then X1 is normal.
The next lemma is a variant of the result due
to C. R. Rao, see [].
[ 9
Suppose q (·) is continuous in a
neighborhood of 0, q(0) = 0, and in a neighborhood of 0 it
satisfies the equation
|
q(t) = |
n å
k = 1
|
ak2 q(akt), |
| (47) |
where
a1, ¼, an are given numbers such that |ak| £ d < 1 and
åk = 1n ak2 = 1.
Then q(t) = const in some neighborhood of t = 0.
Proof. Suppose (47) holds for all |t| < e.
Then |ajt| < e and from (47) we get
q(ajt) = åk = 1n ak2q(ajakt) for every 1 £ j £ n.
Hence q(t) = åj = 1n åk = 1n aj2ak2q(ajakt)
and we get recurrently
|
q(t) = |
n å
j1 = 1
|
¼ |
n å
jr = 1
|
aj12¼ajr
2q(aj1¼ajrt) |
|
for all r ³ 1.
This implies
|
|q(t)-q(0)| £ ( |
n å
k = 1
|
ak2)r |
sup
|a| £ dr
|
|q(at)-q(0)| = |
sup
|x| £ dr
|
|q(x)-q(0)|® 0 |
|
as r® ¥ for all |t| < e.
[¯]
Proof of Theorem 3.3. Without loss of generality
we may assume Var(X1) ¹ 0.
Let f be the characteristic function of X and let
Q(t) = logf(t). Clearly, Q(t) is well defined for
all t close enough to 0. Equality of distributions gives
|
Q(t) = Q(a1t)+ Q(a2t)+¼+ Q(ant). |
|
The integrability assumption implies that Q has two
derivatives, and for all t close enough to 0 the derivative
q (·) = Q¢¢(·) satisfies equation (47).
Since X1 and åk = 1nakXk have equal variances,
åk = 1n ak2 = 1. Condition |ai| ¹ 0, 1 implies
|ai| < 1 for all 1 £ i £ n.
Lemma 3.3 shows that q (·) is constant in a neighborhood
of t = 0 and ends the proof.
[¯]
Comparing Theorems 3.1 and 3.3 the
pattern seems to be that the less information about coefficients,
the more information about the moments is needed. The next
result ([]) fits into this
pattern, too;
[] present the
general theory of active exponents which permits to recognize
(by examining the coefficients of linear forms), when the equality
of distributions of linear forms implies normality; see also
[]. Variants of characterizations by equality of
distributions are known for group-valued random variables,
see []; [] is also pertinent.
Theorem 29
Suppose
A = {|a1|, ¼, |an|} and
B = {|b1|, ¼, |bn|} are different sets of real numbers and
X1, ¼, Xn are i. i. d. random variables with finite moments
of all orders. If the linear forms åk = 1n akXk and
åk = 1n bkXk are identically distributed, then
X1 is normal.
We shall need the following elementary lemma.
[ 10
Suppose
A = {|a1|, ¼, |an|} and
B = {|b1|, ¼, |bn|} are different sets of real numbers.
Then
|
( |
n å
k = 1
|
ak2r) ¹ ( |
n å
k = 1
|
bk2r) |
| (48) |
for all r ³ 1 large enough.
Proof.
Without loss of generality we may assume that coefficients
are arranged in increasing order
|a1| £ ¼ £ |an| and
|b1| £ ¼ £ |bn|.
Let M be the largest number m £ n such that
|am| ¹ |bm|.
( Clearly, at least one such m exists, because sets A, B
consist of different numbers.) Then |ak| = |bk| for k > M and
åk = 1nak2r ¹ åk = 1nbk2r for all r
large enough. Indeed, by the definition of M we have
åk > Mbk2r = åk > Mak2r but the remaining portions of the sum
are not equal,
åk £ Mbk2r ¹ åk £ M ak2r for r large enough;
the latter holds true because by our choice of M
the limits
limr® ¥ (åk £ Mak2r)1/(2r) = maxk £ M|ak| = |aM| and
limr® ¥ (åk £ Mbk2r)1/(2r) = maxk £ M|bk| = |bM| are not
equal. [¯]
We also need the following
lemma9 due to
Marcinkiewicz
[].
[ 11
Let f be an infinitely differentiable
characteristic function and let Q(t) = logf(t). If there
is r ³ 1 such that Q(k)(0) = 0 for all k ³ r, then
f is the characteristic function of a normal distribution.
Proof. Indeed,
F(z) = exp(åk = 0r [(zk)/ k!] Q(k)(0)) is an analytic function and all derivatives at 0 of the
functions logF(·) and logf(·) are equal.
Differentiating the (trivial) equality fQ¢ = f¢, we get
f(n+1) = åk = 0n(kn) f(n-k)Q(k+1), which shows
that all derivatives at 0 of F(·) and of f(·)
are equal. This means that f(·) is analytic in some
neighborhood of 0 and f(t) = F(t) = expP(t) for all small
enough t, where P is a polynomial of the degree (at most) r.
Hence by Theorem 2.5, f is normal.
[¯]
Proof of Theorem 3.3.
Without loss of generality, we may assume that
X1 is symmetric. Indeed, if random variables
X1, ¼, Xn satisfy the assumptions of the theorem,
then so do their symmetrizations [X\tilde]1, ¼, [X\tilde]n,
see Section 1.6. If we could prove the theorem
for symmetric random variables, then [X\tilde]1 would
be be normal. By Theorem 2.5, this would imply that
X1 is normal. Hence it suffices to prove the theorem
under the additional symmetry assumption.
Let f be the characteristic function of X's and let
Q(t) = logf(t); Q is well defined for all t close
enough to 0. The assumption implies that Q has derivatives
of all orders and also that
Q(a1t)+ Q(a2t)+¼+ Q(ant) = Q(b1t)+ Q(b2t)+¼+ Q(bnt).
Differentiating the last equality 2r times at t = 0 we obtain
|
|
n å
k = 1
|
ak2r Q(2r)(0) = |
n å
k = 1
|
bk2r Q(2r)(0), r = 0, 1, ¼ |
| (49) |
Notice that by (48), equality (49)
implies Q(2r)(0) = 0 for
all r large enough. Thus by (49) (and by the symmetry
assumption to handle the derivatives of odd order), Q(k)(0) = 0 for all
k ³ 1 large enough. Lemma 3.3 ends the proof.
[¯]
3.4 Exponential analogy
Characterizations of the normal distribution frequently lead
to analogous characterizations of the exponential distribution.
The idea behind this correspondence is that adding random variables
is replaced by taking their minimum. This is explained by the
well known fact that the minimum of independent exponential
random variables is exponentially distributed; the observation
is due to Linnik [], see
[]. Monographs
[,],
present such results as well as the characterizations of the
exponential distribution by its intrinsic properties, such as lack
of memory. In this book some of the exponential analogues serve as
exercises.
The following result, written in the form analogous to
Theorem *, illustrates how the exponential analogy works.
The i. i. d. assumption can easily be weakened to independence
of X and Y (the details of this modification are left to
the reader as an exercise).
Theorem 30
Suppose X, Y non-negative
random variables such that
(i) for all a, b > 0 such that a+b = 1, the random variable
min{X/a, Y/b} has the same distribution as X;
(ii) X and Y are independent and identically distributed.
Then X and Y are exponential.
Proof. The following simple observation stays behind
the proof.
If X, Y are independent non-negative random variables, then
the tail distribution function, defined for any Z ³ 0
by NZ(x) = P(Z ³ x), satisfies
|
Nmin{X, Y}(x) = NX(x) NY(x). |
| (50) |
Using (50) and the assumption we obtain N(at)N(bt) = N(t)
for all a, b, t > 0 such that a+b = 1. Writing
t = x+y, a = x/(x+y), b = y/(x+y) for arbitrary x, y > 0 we get
Therefore to prove the theorem, we need only to solve functional
equation (51) for the unknown function N(·)
such that 0 £ N (·) £ 1; N(·) is also
right-continuous non-increasing and N(x)® 0 as x® ¥.
Formula (51) shows recurrently that for all integer n
and all x ³ 0 we have
Since N(0) = 1 and N(·) is right continuous, it follows from
(52) that r = N(1) > 0. Therefore (52) implies
N(n) = rn and N(1/n) = r1/n
(to see this, plug in (52) values
x = 1 and x = 1/n respectively).
Hence N(n/m) = N(1/m)n = rn/m (by putting x = 1/m in (52)),
ie. for each rational q > 0 we have
Since N(x) is right-continuous,
N(x) = limq\searrow x N(q) = rx for each x ³ 0. It remains to notice that r < 1,
which follows from the fact that N(x)® 0 as x® ¥.
Therefore r = exp(-l) for some l > 0, and
N(x) = exp(-lx), x ³ 0.
[¯]
3.5 Exponential distributions on lattices
The abstract notation of this section follows
[].
Let IL be a vector space with norm || ·||. Suppose
that IL is also a lattice with the operations minimum
Ù and maximum Ú which are consistent with the
vector operations and with the norm. The related order is then
defined by
x\preceq y iff xÚy = y (or, alternatively: iff
xÙy = x).
By consistency with vector operations we mean that10
|
(x+y)Ù(z+y) = y+(xÙz) for all x, y, z Î IL |
|
|
(ax)Ù(ay) = a(xÙy) for all x, y Î IL, a ³ 0 |
|
and
Consistency with the norm means
|
|| x|| £ || y|| for all 0 \preceq x\preceq y |
|
Moreover, we assume that there is a s-field F such
that all the operations considered are measurable.
Vector space IRd with
|
xÙy = ( |
min
| {xj; yj})1 £ j £ d |
| (54) |
with the norm: || x|| = maxj|xj| satisfies
the above requirements. Other examples are provided by the
function spaces with the usual norms; for instance, a familiar
example is the space C[0, 1] of all continuous functions with
the standard supremum norm and the pointwise minimum of functions
as the lattice operation, is a lattice.
The following abstract definition complements
[].
Definition 8 A random variable X: W®IL has
exponential distribution if the following two conditions are satisfied:
(i) X\succeq 0;
(ii) if X¢ is an independent copy of X then for any
0 < a < 1 random variables X/aÙX¢/(1-a) and X
have the same distribution.
Example 7
Let IL = IRd with Ù
defined coordinatewise by (54) as in the above discussion.
Then any IRd-valued exponential random variable has the
multivariate exponential distribution in the sense of
Pickands,
see []. This distribution is also
known as Marshall-Olkin distribution.
Using the definition above, it is easy to notice that if
(X1, ¼, Xd) has the exponential distribution, then
min{X1, ¼, Xd} has the exponential distribution on
the real line. The next result is attributed to Pickands
see [].
[ 5
Let X = (X1, ¼, Xd) be
an IRd-valued exponential random variable. Then
the real random variable
min{X1/a1, ¼, Xd/ad}
is exponential for all a1, ¼, ad > 0.
Proof. Let Z = min{X1/a1, ¼, Xd/ad}. Let
Z¢ be an independent copy of Z. By Theorem 3.4 it remains to
show that
for all a, b > 0 such that a+b = 1.
It is easily seen that
|
|
min
| {Z/a;Z¢/b} = |
min
| {Y1/a1, ¼, Yd/ad}, |
|
where Yi = min{Xi/a; X¢i/b} and X¢ is an independent
copy of X. However by the definition, X has the same
distribution as (Y1,¼, Yd), so (55) holds.
[¯]
Remark: By taking a limit as aj® 0 for all j ¹ i, from
Proposition 3.5 we obtain in particular that each component
Xi is exponential.
Example 8
Let IL = C[0, 1] with
{fÙg}(x): = min{f(x), g(x)}. Then exponential
random variable X defines the stochastic process
X(t) with continuous trajectories and such that
{X(t1), X(t2), ¼, X(tn)} has the n-dimensional
Marshall-Olkin distribution
for each integer n and for
all t1, ¼, tn in [0, 1].
The following result shows that the supremum supt|X(t)|
of the
exponential process from Example 3.5 has
the moment generating function
in a neighborhood of 0. Corresponding result for Gaussian
processes will be proved in Sections and
. Another result on infinite dimensional
exponential distributions will be given in Theorem .
[ 6
If IL is a lattice with
the measurable norm || ·|| consistent with algebraic
operation Ù, then for each exponential IL-valued
random variable X there is l > 0 such that
Eexp(l|| X|| ) < ¥.
Proof. The result follows easily from the trivial inequality
|
P( || X|| ³ 2 x) = P( || XÙX¢|| ³ x) £ (P( || X|| ³ x))2 |
|
and Corollary 1.3.
[¯]
3.6 Problems
Problem 32 [deterministic analogue of Theorem 3.1)]
Show that if X, Y ³ 0 are i. i. d.
and 2X has the same distribution as X+Y, then X, Y are non-random
11.
Problem 33
Suppose random variables X1, X2 satisfy
the assumptions of Theorem 3.1 and have finite second
moments. Use the Central Limit Theorem to prove that X1 is normal.
Problem 34
Let \sf V be a metric space with
a measurable metric d. We shall say that a \sf V-valued sequence
of random variables Sn converges to Y in distribution, if
there exist a sequence [^S]n convergent to Y in probability
(ie. P( d([^S]n, Y) > e)® 0 as n® ¥ ) and
such that Sn @ [^S]n (in distribution) for each n. Let Xn be
a sequence of \sf V-valued independent random variables and put
Sn = X1+¼+Xn. Show that if Sn converges
in distribution (in the above sense), then the limit is
an E-Gaussian random variable12.
Problem 35
For a separable Banach-space valued Gaussian vector X
define the mean m = EX as the unique vector that satisfies
l(m) = El(X) for all continuous linear functionals l Î \sf V*.
It is also known that random vectors with equal characteristic functions
f(l) = Eexpil(X) have the same probability distribution.
Suppose X is a Gaussian vector with the non-zero mean m.
Show that for a measurable linear subspace IL Ì \sf V,
if m\not Î IL then P(X Î IL) = 0.
Problem 36 [deterministic analogue of Theorem 3.3)]
Show
that if i. i. d. random variables X,Y have moments of all orders and
X+2Y @ 3X, then X, Y are non-random.
Problem 37
Show that if X,Y are independent and X+Y @ X, then Y = 0 a. s.
Chapter 4 Rotation invariant distributions
4.1 Spherically symmetric vectors
Definition 9 A random vector X = (X1, X2, ¼, Xn)
is spherically symmetric if the distribution of every
linear form
is the same for all a1, a2, ¼, an, provided
a12+a22+¼+an2 = 1.
A slightly more general class of the so called elliptically
contoured distributions has
been studied from the point of view of applications to statistics in
[]. Elliptically
contoured distributions are images of spherically symmetric random
variables under a linear transformation of IRn.
Additional information can also be found in
[],
which is devoted to the characterization problems and overlaps
slightly with the contents of this section.
Let f(t) be the characteristic function of X. Then
|
f(t) = f |
æ ç ç ç
ç ç è
|
||t|| |
é ê ê ê
ê ê ë
|
|
| |
ù ú ú ú
ú ú û
|
|
ö ÷ ÷ ÷
÷ ÷ ø
|
, |
| (57) |
ie. the characteristic function at t can be written as a function
of ||t|| only. Conversely, if f(t) is a characteristic
function of a real random variable, then f(||t||) corresponds to
an IRn-valued random vector.
From the definition we also get the following.
[ 7
If X = (X1, ¼, Xn) is
spherically symmetric, then each of its marginals Y = (X1, ¼, Xk), where k £ n, is spherically symmetric.
This fact is very simple; just consider linear forms (56) with ak+1 = ¼ = an = 0.
Example 9
Suppose
[(g)\vec] = (g1, g2, ¼, gn)
is the sequence of independent identically distributed
normal N(0, 1) random variables. Then [(g)\vec] is spherically symmetric.
Moreover, for any m ³ 1, [(g)\vec] can be extended to a longer
spherically invariant sequence
(g1, g2, ¼, gn+m).
In Theorem we will see that up to a random
scaling factor, this is essentially the only example of a spherically
symmetric sequence with arbitrarily long spherically symmetric
extensions13.
In general a multivariate normal distribution is not spherically symmetric. But
if X is centered non-degenerated Gaussian r. v.,
then A-1X is spherically symmetric, see Theorem 2.2.
Spherical symmetry together with Theorem is sometimes
useful in computations as illustrated in Problem .
Example 10
Suppose X = (X1, ¼, Xn) has the
uniform distribution on the sphere || x|| = r. Obviously, X is
spherically symmetric. For k < n, vector Y = (X1, ¼, Xk) has the
density
|
f(y) = C(r2-||y||2)(n-k)/2-1, |
| (58) |
where C is the normalizing constant (see for instance,
[]). In particular, Y is spherically
symmetric and absolutely continuous in IRk.
The density of real valued random variable Z = || Y|| at
point z has an additional factor coming from the area of the
sphere of radius z in IRk, ie.
|
fZ(z) = C zk-1(r2-z2)(n-k)/2-1. |
| (59) |
Here C = C(r, k, n) is again the normalizing constant. By rescaling,
it is easy to see that C = rn-2 C1(k, n), where
|
C1(k, n) = ( |
ó õ
|
1
-1
|
zk-1(1-z2)(n-k)/2-1 dz)-1 |
|
|
= |
2G(n/2)
G(k/2) G((n-k)/2)
|
= |
2
B(k/2,(n-k)/2)
|
. |
|
Therefore
|
fZ(z) = C1 rn-2zk-1(r2-z2)(n-k)/2-1. |
| (60) |
Finally, let us point out that the conditional distribution of
|| (Xk+1, ¼, Xn)|| given Y is concentrated at
one point (r2-|| Y||2)1/2.
From expression (58) it is easy to see that for fixed
k, if n® ¥ and the
radius is r = Ön, then the
density of the corresponding Y converges to the density
of the i. i. d. normal sequence
(g1, g2, ¼, gk).
(This well known fact is usually attributed to H. Poincaré).
Calculus formulas of Example 4.1 are important
for the general spherically symmetric case because
of the following representation.
Theorem 31
Suppose X = (X1, ¼, Xn) is
spherically symmetric.
Then X = RU,
where random variable U is uniformly
distributed on the unit sphere in
IRn, R ³ 0 is real valued with distribution R @ ||X||, and
random variables variables R, U are stochastically independent.
Proof. The first step of the proof is to show that the
distribution of X is invariant under all rotations
U\sf J: IRn® IRn. Indeed, since by definition
f(t) = Eexp(it·X) = Eexp(i||t||X1), the characteristic function f(t)
of X is a function of || t|| only. Therefore the characteristic
function y of
U\sf JX satisfies
|
y(t) = Eexp(it·U\sf JX) = Eexp(iU\sf JTt·X) = Eexp(i||t||X1) = f(t). |
|
The group O(n) of rotations of IRn (ie. the group of
orthogonal n×n matrices) is a compact group; by m
we denote the normalized Haar measure
(cf. []).
Let G be an O(n)-valued random variable with the
distribution m and independent of X (G can be actually
written down explicitly; for example if
n = 2, G = [ |
| ],
where q is uniformly distributed on [0, 2p].)
Clearly X @ GX @ || X|| GX/|| X||
conditionally on the event || X|| ¹ 0. To take care of the
possibility that X = 0, let Q be uniformly distributed
on the unit sphere and put
It is easy to see that U is uniformly distributed on the unit
sphere in IRn and that U, X are independent. This ends the
proof, since
X @ GX = || X||U.
[¯]
The next result explains the connection between spherical symmetry
and linearity of regression. Actually, condition () under
additional assumptions characterizes elliptically contoured
distributions, see
[,].
[ 8
If X
is a spherically symmetric random vector with finite first moments,
then
|
E{X1| a1X1+¼+anXn} = r |
n å
k = 1
|
akXk |
| (61) |
for all real numbers a1, ¼, an, where
r = [(a1)/( a12+¼+ an2)] .
The simplest approach
here is to use the converse to Theorem 1.5; if
f(|| t||2) denotes the characteristic function of X (see
(57)), then the characteristic function of
X1, a1X1+¼ +anXn evaluated at point (t, s) is
y(t, s) = f ((s+a1t)2+(a2t)2+¼+(ant)2). Hence
|
(a12+¼+an2) |
¶
¶s
|
y(t, s) |
ê ê
ê
|
s = 0
|
= a1 |
¶
¶t
|
y(t, 0). |
|
Picture Omitted Figure 4.1:
Linear regression for the uniform distribution on a circle.
Another possible proof is to use Theorem 4.1 to reduce
() to the uniform case. This can be done as follows.
Using the well known properties of conditional expectations, we have
|
E{X1| a1X1+¼+anXn} = E{RU1| R(a1U1+¼+anUn)} |
|
|
= E{E{RU1| R, a1U1+¼+anUn}|R(a1U1+¼+anUn)}. |
|
Clearly,
|
E{RU1| R, a1U1+¼+anUn} = RE{U1| R, a1U1+¼+anUn} |
|
and
|
E{U1| R, a1U1+¼+anUn} = E{U1|a1U1+¼+anUn}, |
|
see
Theorem 1.4 (ii) and (iii).
Therefore it suffices to establish () for the uniform
distribution on the unit sphere. The last fact is quite obvious from
symmetry considerations; for the 2-dimensional situation this
can be illustrated on a picture. Namely, the hyper-plane
a1x1+¼+anxn = const intersects the unit sphere along a
translation of a suitable (n-1)-dimensional sphere S; integrating
x1 over S we get the same fraction (which depends on
a1,¼, an) of const. [¯]
The following theorem shows that spherical symmetry allows us to
eliminate the assumption of independence in Theorem *, see also
Theorem below.
The result for rational a is due to
S. Cambanis, S. Huang & G. Simons
[];
for related exponential results see
[].
Theorem 32
Let X = (X1, ¼, Xn) be a
spherically symmetric random vector such that
E|| X||a < ¥ for some real a > 0 . If
|
E{|| (X1, ¼, Xm)|| a|(Xm+1, ¼, Xn)} = const |
|
for some 1 £ m < n, then X is Gaussian.
Our method of proof of Theorem 4.1 will also provide easy access
to the following interesting result due to Szabowski
[],
see also [].
Theorem 33
Let X = (X1, ¼, Xn) be a
spherically symmetric random vector such that E|| X||2 < ¥
and P(X = 0) = 0.
Suppose c(x) is a real function with the property that there is
0 £ U £ ¥ such that 1/c(x) is integrable
on each finite sub-interval of the interval
[0, U] and that c(x) = 0 for all x > U.
If for some 1 £ m < n
|
E{|| (X1, ¼, Xm)|| 2 | (Xm+1, ¼, Xn)} = c(|| (Xm+1, ¼, Xn)|| ), |
|
then the
distribution of X is determined uniquely by c(x).
To prove both theorems we shall need the following.
[ 12
Let X = (X1, ¼, Xn) be a
spherically symmetric random vector such that P(X = 0) = 0 and let
H denote the distribution of || X||. Then we have the following.
(a) For m < n r. v. || (Xm+1, ¼, Xn)|| has the density function g(x) given
by
|
g(x) = C xn-m-1 |
ó õ
|
¥
x
|
r-n+2(r2-x2)m/2-1H(dr), |
| (62) |
where
C = 2G(1/2 n)(G(1/2 m)G(1/2(n-m)))-1 is a normalizing constant of no further importance
below.
(b) The distribution of X is determined uniquely by the
distribution of its single component X1.
(c) The conditional distribution of
|| (X1, ¼, Xm)|| given (Xm+1, ¼, Xn)
depends only on
the IRm-n-norm || (Xm+1, ¼, Xn)|| and
|
E{|| (X1, ¼, Xm)|| a|(Xm+1, ¼, Xn)} = h( || (Xm+1, ¼, Xn)|| ), |
|
where
|
h(x) = |
|
|
ó õ
|
¥
x
|
r-n+2(r2-x2)(m+a)/2-1H(dr) |
|
|
ó õ
|
¥
x
|
r-n+2(r2-x2)m/2-1H(dr) |
|
|
| (63) |
Sketch of the proof. Formulas (62) and
(63) follow from Theorem 4.1 by conditioning on
R, see Example 4.1. Fact (b) seems to be intuitively obvious;
it says that from the distribution of the product U1R of independent
random variables (where U1 is the 1-dimensional marginal of the
uniform distribution on the unit sphere in IRn) we can recover the
distribution of R. Indeed, this follows from Theorem 1.8
and (62) applied to m = n-1: multiplying
g(x) = Còx¥ r-n+2(r2-x2)(n-1)/2-1H(dr) by
xu-1 and integrating, we get the formula which shows that
from g(x) we can determine the integrals
ò0¥ rt-1 H(dr), cf. () below.[¯]
[ 13
Suppose ca(·) is
a function such that
|
E{|| (X1, ¼, Xm)|| a|(Xm+1, ¼, Xn)} = ca(|| (Xm+1, ¼, Xn)|| 2). |
|
Then the function f(x) = x(m+1-n)/2g(x1/2), where g(.) is
defined by (62), satisfies
|
ca(x)f(x) = |
1
B(a/2, m/2)
|
|
ó õ
|
¥
x
|
(y-x)a/2-1f(y) dy. |
| (64) |
Proof. As previously, let H(dr) be the distribution of
|| X||. The following formula for the beta
integral is well known, cf.
[].
|
(r2-x2)(m+a)/2-1 = |
2
B(a/2, m/2)
|
|
ó õ
|
1
x
|
(t2-x2)a/2-1(r2-t2)m/2-1 dt. |
| (65) |
Substituting (65) into (63) and changing the
order of integration we get
|
= Cxn-m-1 |
2
B(a/2, m/2)
|
|
ó õ
|
¥
x
|
(t2-x2)a/2-1 |
ó õ
|
¥
t
|
r-n+2(r2-t2)m/2-1H(dr) dt. |
|
Using (62) we have therefore
|
ca(x2)g(x) = xn-m-1 |
2
B(a/2, m/2)
|
|
ó õ
|
x
|
¥(t2-r2)a/2-1tm+1-ng(t) dt. |
|
Substituting f(·) and changing the variable of integration
from t to t2 ends the proof of (64).
[¯]
Proof of Theorem 4.1. By Lemma 4.1 we need
only to show that for a = 2 equation (64) has the
unique solution. Since f(·) ³ 0, it follows from
(64) that f(y) = 0 for all y ³ U. Therefore it
suffices to show that f(x) is determined uniquely for x < U.
Since the right hand side of (64) is differentiable,
therefore from (64) we get 2[d/ dx](c(x)f(x)) = -mf(x).
Thus b(x): = c(x)f(x) satisfies equation
at each point 0 £ x < U. Hence
b(x) = Cexp(-1/2mò0x1/c(t) dt). This shows that
|
f(x) = |
C
c(x)
|
exp(- |
1
2
|
m |
ó õ
|
x
0
|
|
1
c(t)
|
dt) |
|
is
determined uniquely (here C > 0 is a normalizing constant).
[¯]
[ 14
If p(s) is a periodic and analytic function of complex
argument s with the real
period, and for real t the function t® log(p(t)G(t+C))
is real valued and convex, then p(s) = const.
Proof.
For all positive x we have
|
|
d2
dx2
|
logp(x)+ |
d2
dx2
|
logG(x) ³ 0. |
| (66) |
However it is known that
[(d2)/( dx2)]logG(x) = ån ³ 0(n+x)-2® 0 as
x® ¥, see [].
Therefore (66) and the periodicity of p(.) imply that
[(d2)/( dx2)] logp(x) ³ 0.
This means that the first derivative [d/ dx] logp(.) is
a continuous, real valued, periodic and non-decreasing function of the
real argument. Hence [d/ dx] logp(x) = B Î IR for all
real x. Therefore logp(s) = A+Bs and, since p(.) is
periodic with real period, this implies
B = 0. This ends the proof.
[¯]
Proof of Theorem 4.1. There is nothing to prove,
if X = 0. If P(X = 0) < 1 then P(X = 0) = 0. Indeed, suppose,
on the contrary, that P(X = 0) > 0. By Theorem 4.1 this
means that p = P(R = 0) > 0 and that
E{|| (X1, ¼, Xm)||a|(Xm+1, ¼, Xn)} = 0
with positive probability p > 0. Therefore
E{|| (X1, ¼, Xm)||a|(Xm+1, ¼, Xn)} = 0
with probability 1. Hence R = 0 and X = 0 a. s., a contradiction.
Throughout the rest of this proof we assume without loss of generality
that P(X = 0) = 0. By Lemmas 4.1 and 4.1, it remains
to show that the integral equation
|
f(x) = K |
ó õ
|
¥
x
|
(y-x)b-1f(y) dy |
| (67) |
has the unique solution in the class of functions satisfying conditions
f(.) ³ 0 and ò0¥ x(n-m)/2-1f(x) dx = 2.
Let M (s) = xs-1f(x)dx be the Mellin transform of f(.),
see Section 1.8. It can be checked that M (s) is well
defined and analytic for s in the half-plane
Âs > 1/2(n-m), see Theorem 1.8. This holds true because
the moments of all orders are finite, a claim which can be recovered with
the help of a variant of Theorem , see Problem ;
for a stronger conclusion see also
[].
The Mellin transform applied to both sides of (67) gives
|
M (s) = K |
G(b)G(s)
G(b +s)
|
M (b+s). |
|
Thus the Mellin transform M1(.)
of the function f(Cx), where C = (KG(b))-1/b,
satisfies
|
M1(s) = M1(b+s) |
G (s)
G(b+s)
|
. |
|
This shows that
M1(s) = p (s)G(s), where p(.) is analytic and periodic with real period
b. Indeed, since G(s) ¹ 0 for Âs > 0, function
p(s) = M1(s)/G(s) is well defined and analytic in the
half-plane Âs > 0. Now notice that p(.), being periodic, has
analytic extension to the whole complex plane.
Since f(.) ³ 0,
log M1(x) is a well defined convex function of the
real argument x. This follows from the Cauchy-Schwarz
inequality, which says that
M1((t+s)/2) £ ( M1(t) M1(s))1/2.
Hence by Lemma 4.1, p(s) = const.[¯]
Remark: Solutions of equation (67) have been found in
[]. Integral
equations of similar, but more general form occurred in potential
theory, see Deny [], see also Bochner
[] for an early work; for another proof
and recent literature, see
[].
4.2 Rotation invariant absolute moments
The following beautiful theorem is due to
M. S. Braverman []14.
Theorem 34
Let X, Y, Z be independent identically distributed random variables
with finite moments of fixed order p Î IR+\2IN.
Suppose that there is constant C such that for all real a, b, c
|
E|aX+bY+cZ|p = C(a2+b2+c2)p/2. |
| (68) |
Then X, Y, Z are normal.
Condition (68)
says that the absolute moments of a fixed order p of
any axis,
no matter how rotated, are the same; this fits well into the framework of
Theorem *.
Theorem 4.2
is a strictly 3-dimensional phenomenon,
at least if no additional conditions on random variables are
imposed. It does not hold for pairs of i. i. d. random
variables, see Problem below15.
Theorem 4.2 cannot be extended to other values of
exponent p; if p is an even integer, then (68)
is not strong enough to imply the normal distribution
(the easiest case to see this is of course p = 2).
Following Braverman's argument, we obtain
Theorem 4.2 as a corollary to Theorem 3.1.
To this end, we shall use the following result of independent interest.
Theorem 35
If p Î IR+\2IN and
X, Y, Z are independent symmetric
p-integrable random variables such that P(Z = 0) < 1 and
|
E|X+tZ|p = E|Y+tZ|p for all real t, |
| (69) |
then X @ Y in distribution.
Theorem 4.2 resembles
Problem 1.9, and it seems to be related to potential theory, see
[] and
[].
Similar results have functional analytic importance, see
Rudin []; also Hall []
and Hardin [] might
be worth seeing in this context. Koldobskii
[,] gives Banach
space versions of the results and relevant references.
Theorem 4.2 follows immediately
from Theorem 4.2 by the following argument.
Proof of Theorem 4.2 . Clearly there is
nothing to prove, if C = 0, see also Problem . Suppose
therefore C ¹ 0. It follows from the assumption that
E|X+Y+tZ|p = E|Ö2X+tZ|p for all real t. Note also that
E|Z|p = C ¹ 0. Therefore Theorem 4.2 applied to X+Y, X¢
and Z, where X¢ is an independent copy of Ö2X, implies that
X+Y and Ö2X have the same distribution. Since X, Y are
i. i. d., by Theorem 3.1 X, Y, Z are normal.
[¯]
A related result
The next result can be thought as a version of
Theorem 4.2 corresponding to p = 0.
For the proof see
[,,].
Theorem 36
If X = (X1, ¼, Xn) is at
least 3-dimensional random vector such that its components
X1, ¼, Xn are independent, P(X = 0) = 0 and X/||X||
has the uniform distribution on the unit sphere in IRn, then X
is Gaussian.
4.2.1 Proof of Theorem for p = 1
We shall first present a slightly simplified proof for p = 1 which is based on
elementary identity max{x, y} = (x+y+|x-y|).
This proof leads directly to the exponential analogue of
Theorem 4.2; the exponential version is given as Problem
below.
We shall begin with
the lemma which gives an analytic version of condition (69).
[ 15
Let X1, X2, Y1, Y2 be symmetric
independent random variables such that E|Yi| < ¥ and
E|Xi| < ¥, i = 1, 2. Denote
Ni(t) = P(|Xi| ³ t), Mi(t) = P(|Yi| ³ t), t ³ 0, i = 1, 2.
Then each of the conditions
|
| |
|
|
E|a1X1+a2X2| = E|a1Y1+a2Y2| for all a1, a2 Î IR; |
|
| (70) | |
|
|
ó õ
|
¥
0
|
N1(t)N2(xt) dt = |
ó õ
|
¥
0
|
M1(t)M2(xt) dt for all x > 0; |
|
| (71) | |
|
|
ó õ
|
¥
0
|
N1(xt)N2(yt) dt = |
ó õ
|
¥
0
|
M1(xt)M2(yt) dt |
|
| (72) | |
|
for all x, y ³ 0, |x|+|y| ¹ 0; |
|
|
| |
|
implies the other two.
Proof. For all real numbers x, y we have
|x-y| = 2max{x, y} - (x+y). Therefore, taking into account the symmetry of the distributions for
all real a, b we have
|
E|aX1-bX2| = 2 E |
max
| {aX1, bX2}. |
| (73) |
For an integrable random variable Z we have
EZ = ò0¥ P(Z ³ t) dt - ò0¥ P(-Z ³ t) dt, see (3).
This identity applied to
Z = max{aX1, bX2}, where a, b ³ 0
are fixed, gives
|
E |
max
| {aX1, bX2} = |
ó õ
|
¥
0
|
P(Z ³ t) dt - |
ó õ
|
¥
0
|
P(Z £ -t) dt |
|
|
= |
ó õ
|
¥
0
|
P(aX1 ³ t) dt + |
ó õ
|
¥
0
|
P(bX2 ³ t) dt |
|
|
- |
ó õ
|
¥
0
|
P(aX1 ³ t)P(bX2 ³ t) dt- |
ó õ
|
¥
0
|
P(aX1 £ -t)P(bX2 £ -t) dt. |
|
Therefore, from (73) after taking the symmetry of
distributions into account, we obtain
|
E|aX1-bX2| = 2aEX1+ + 2bEX2+ -4 |
ó õ
|
¥
0
|
P(aX1 ³ t) P(bX2 ³ t) dt, |
|
where Xi+ = max{Xi, 0}, i = 1, 2. This gives
|
E|aX1-bX2| = 2aEX1+ + 2bEX2+ -4 |
ó õ
|
¥
0
|
N1(t/a)N2(t/b) dt. |
| (74) |
Similarly
|
E|aY1-bY2| = 2aEY1+ + 2bEY2+ -4 |
ó õ
|
¥
0
|
M1(t/a)M2(t/b) dt. |
| (75) |
Once formulas (74) and (75) are established, we
are ready to prove the equivalence of conditions
(70)-(72).
(70)Þ(71): If condition (70)
is satisfied, then E|Xi| = E|Yi|, i = 1, 2 and thus by
symmetry
EXi+ = EYi+, i = 1, 2. Therefore (74) and
(75) applied to a = 1, b = 1/x imply (71) for
any fixed x > 0.
(71) Þ(72): Changing the variable in
(71) we obtain (72) for all x > 0, y > 0.
Since E|Yi| < ¥ and E|Xi| < ¥ we can pass in
(72) to the limit as x® 0, while y is fixed, or as
y® 0, while x is fixed, and hence (72) is proved
in its full generality.
(72)Þ(70): If condition (72)
is satisfied, then taking x = 0, y = 1 or x = 1, y = 0 we obtain
E|Xi| = E|Yi|, i = 1, 2 and thus by symmetry
EXi+ = EYi+, i = 1, 2. Therefore identities (74) and (75)
applied to
a = 1/x, b = 1/y imply (70) for any a1 > 0, a2 < 0. Since
E|Yi| < ¥ and E|Xi| < ¥, we can pass in (70)
to the limit as a1® 0, or as a2® 0. This
proves that equality (70) for all a1 ³ 0, a2 £ 0. However, since
Xi, Yi, i = 1, 2, are symmetric, this proves
(70) in its full generality.
[¯]
The next result translates (70) into the property
of the Mellin transform. A similar analytical identity is used in the proof of
Theorem 2.0.1.
[ 16
Let X1, X2, Y1, Y2 be symmetric
independent random variables such that E|Yj| < ¥ and
E|Xj| < ¥, j = 1, 2. Let 0 < u < 1 be fixed. Then condition
(70) is equivalent to
|
E|X1|u+it E|X2|1-u-it = E|Y1|u+it E|Y2|1-u-it for all t Î IR. |
| (76) |
Proof. By Lemma 2.4.3, it suffice to show that conditions
(76) and (71) are equivalent.
Proof of (71)Þ(76): Multiplying both
sides of (71) by x-u-it, where t Î IR is fixed,
integrating with respect to x in the limits from 0 to ¥
and changing the order of integration (which is allowed, since the
integrals are absolutely convergent), then substituting x = y/t, we
get
|
|
ó õ
|
¥
0
|
tit+u-1 N1(t) dt |
ó õ
|
¥
0
|
y-u-it N2(y) dy |
|
|
= |
ó õ
|
¥
0
|
tit+u-1M1(t) dt |
ó õ
|
¥
0
|
y-u-it M2(y) dy. |
|
This clearly implies (76), since, eg.
|
|
ó õ
|
¥
0
|
tit+u-1 Nj(t) dt = E|Xj|u+it/(u+it), j = 1, 2 |
|
(this is just tail integration formula (2)).
Proof of (76)Þ(71): Notice that
|
fj(t): = |
uE|Xj|u+it
(u+it)E|Xj|u
|
, j = 1, 2 |
|
is the
characteristic function of a random variable with the probability density
function
fj, u(x): = Cjexp(xu)Nj(exp(x)), x Î IR, j = 1, 2, where
Cj = Cj(u) is the normalizing constant.
Indeed,
|
|
ó õ
|
¥
-¥
|
eixtexp(xu) Nj(exp(x)) dx = |
ó õ
|
¥
0
|
yityu-1Nj(y) dy = E|Xj|u+it/(u+it) |
|
and the normalizer Cj(u) = u/E|Xj|u is then chosen to have fj(0) = 1, j = 1, 2.
Similarly
|
yj(t): = |
uE|Yj|u+it
(u+it)E|Yj|u
|
|
|
is the characteristic function of a random variable with the probability
density function gj, u (x): = Kjexp(xu)Mj(exp(x)), x Î IR,
where Kj = u/E|Yj|u, j = 1, 2. Therefore (76) implies that
the following two convolutions are equal f1, u* [`f]2, 1-u = g1, u*[`g]2, 1-u, where
[`f]2(x) = f2(-x), [`g]2(x) = g2(-x). Since (76)
implies
C1(u)C2(1-u) = K1(u)K2(1-u), a simple calculation shows that the
equality of convolutions implies
|
|
ó õ
|
¥
-¥
|
exN1(ex)N2(eyex) dx = |
ó õ
|
¥
-¥
|
exM1(ex)M2(eyex) dx |
|
for all real y. The last equality differs from (71) by
the change of variable only.
[¯]
Now we are ready to prove Theorem 4.2. The conclusion of
Lemma 4.2.1 suggests using the Mellin transform
E|X|u+it, t Î IR. Recall from Section 1.8 that
if for some fixed u > 0 we have E|X|u < ¥, then the function
E|X|u+it, t Î IR, determines the distribution of
|X| uniquely.
This and Lemma 4.2.1 are used in the proof of Theorem 4.2.
Proof of Theorem 4.2.
Lemma 4.2.1 implies that for each 0 < u < 1, -¥ < t < ¥
|
E|X|u+it E|Z|1-u-it = E|Y|u+it E|Z|1-u-it. |
| (77) |
Since E|Z|s is an analytic function in the strip 0 < Âs < 1,
see Theorem 1.8, and E|Z| = C ¹ 0 by (68),
therefore the equation E|Z|u+it = 0 has at most a countable number
of solutions (u, t) in the strip 0 < u < 1 and -¥ < t < ¥.
Indeed, the equation has at most a finite number of solutions in
each compact set - otherwise we would have Z = 0 almost surely by the
uniqueness of analytic extension. Therefore one can find 0 < u < 1 such
that E|Z|u+it ¹ 0 for all t Î IR. For this value of
u from (77) we obtain
for all real t,
which by Theorem 1.8 proves that random variables X and Y have the
same distribution.
[¯]
4.2.2 Proof of Theorem in the general case
The following lemma shows that under assumption (69) all even moments of order less than p match.
[ 17
Let k = [p/2]. Then (69) implies
for j = 0,1,¼, k.
Proof.
For j £ k the derivatives [(¶j)/(¶tj)]|tX+Z|p are integrable.
Therefore (79) follows by the consecutive differentiation (under
the integral signs) of the equation E|tX+Z|p = E|tY+Z|p at t = 0.
[¯] The following is a general version of (76).
[ 18
Let 0 < u < p be fixed. Then condition
(69) and
|
E|X|u+it E|Z|p-u-it = E|Y|u+it E|Z|p-u-it for all t Î IR. |
| (80) |
are equivalent.
Proof. We prove only the implication
(69)Þ(80); we will not use the other one.
Let k = [p/2]. The following elementary formula follows by the change of
variable16
|
|a|p = Cp |
ó õ
|
¥
0
|
|
æ è
|
cosax - |
k å
j = 0
|
(-1)j a2jx2j |
ö ø
|
|
dx
xp+1
|
|
| (81) |
for all a.
Since our variables are symmetric,
applying (81) to a = X+aZ and a = Y+aZ from (69) and Lemma 4.2.2 we get
|
|
ó õ
|
¥
0
|
|
(fX(x)-fY(x))fZ(ax)
xp+1
|
dx = 0 |
| (82) |
and the integral converges absolutely.
Multiplying (82)
by a-p+u+it-1, integrating with respect to a in the limits
from 0 to ¥ and switching the order of integrals we get
|
|
ó õ
|
¥
0
|
|
fX(x)-fY(x)
xp+1
|
|
ó õ
|
¥
0
|
a-p+u+it-1fZ(ax) da dx = 0. |
| (83) |
Notice that
|
|
ó õ
|
¥
0
|
a-p+u+it-1fZ(ax) da = xp-u-it |
ó õ
|
¥
0
|
b-p+u+it-1fZ(b) db |
|
|
= xp-u-itG(-p+u+it)E|Z|p-u-it. |
|
Therefore (83) implies
|
G(-p+u+it)G(-u-it)(E|X|u+it-E|Y|u+it)E|Z|p-u-it = 0. |
|
This shows that identity (80) holds for all values of t, except perhaps a
for a countable discrete set arising from the zeros of the Gamma function.
Since E|Y|z is analytic in the strip -1 < Âz < p, this implies
(80) for all t.
[¯]
Proof of Theorem 4.2 (general case).
The proof of the general case follows the previous argument for p = 1
with (80) replacing (76).
[¯]
4.2.3 Pairs of random variables
Although in general Theorem 4.2 doesn't hold for a pair of
i. i. d. variables, it is possible to obtain a variant for pairs under
additional assumptions.
Braverman [] obtained the
following result.
Theorem 37
Suppose X, Y are i. i. d. and there are
positive p1 ¹ p2 such that p1, p2\not Î 2IN and
E|aX+bY|pj = Cj(a2+b2)pj for all a, b Î IR, j = 1, 2. Then
X is normal.
Proof of Theorem .
Suppose 0 < p1 < p2. Denote by Z the standard normal N(0,1) random variable
and let
|
fp(s) = |
E|X|p/2+s
E|Z|p/2+s
|
. |
|
Clearly fp is
analytic in the strip -1 < p/2+Âs < p2.
For -p1/2 < Âs < p2/2 by Lemma 4.2.2 we have
and
Put r = 1/2(p2-p1). Then fp2(s) = fp1(s+r) in the strip
-p1/2 < Âs < p1/2. Therefore (85) implies
where to simplify the notation we write f = fp1.
Using now (84) we get
|
f(r+s) = |
C2
f(r-s)
|
= |
C2
C1
|
f(s-r) |
| (86) |
Equation (86) shows that the function p(s): = Ksf(s),
where K = (C1/C2)[1/ 2r], is periodic with real
period 2r. Furthermore, since p1 > 0,
p(s) is analytic in the strip of the width strictly larger than 2r;
thus it extends analytically to \sf CC. By Lemma 4.1 this
determines uniquely the Mellin transform of |X|. Namely,
Therefore in distribution we have the representation
where K is a constant, Z is normal N(0,1), and c is a
{0,1}-valued
independent of Z random variable such that P(c = 1) = C.
Clearly, the
proof is concluded if C = 0 (X being degenerate normal).
If C ¹ 0 then by (87)
|
| |
|
|
| (88) | |
|
= C(1-C)2(t2+u2)p/2E|Z|p+C(1-C)(|t|p+|u|p)E|Z|p. |
|
| |
|
Therefore C = 1, which ends the proof.
[¯]
The next result comes from [] and uses stringent
moment conditions; Braverman []
gives examples which imply that the condition on zeros of
the Mellin transform cannot be dropped.
Theorem 38
Let X, Y be symmetric i. i. d.
random variables such that
for some
l > 0, and E|X|s ¹ 0 for all s Î \sf CC such that Âs > 0.
Suppose there is a constant C such that for all real a, b
Then X, Y are normal.
The rest of this section is devoted to the proof of Theorem 4.2.3.
The function
f(s) = E|X|s is analytic in the half-plane
Âs > 0. Since E|Z|s = p-1/2 Ks G([(s+1)/ 2]),
where K = p1/2E|Z| > 0 and G(.) is the Euler gamma
function, therefore (76) means that
f(s) = p-1/2 Ks a(s) G([(s+1)/ 2]),
where a(s) : = p1/2K-s f(s)/ G([(s+1)/ 2])
is analytic in the half-plane
Âs > 0, a([`s]) = [`(a(s))] and satisfies
|
a(s)a(1-s) = 1 for 0 < Âs < 1. |
| (89) |
We shall need the following estimate, in which without loss of generality
we may assume 0 < lK < 1 (choose l > 0 small enough).
[ 19
There is a constant C > 0 such that
|a(s)| £ C|s| (lK)-Âs for all s in the half-plane
Âs ³ 1/2.
Proof. Since Eexp(l2|X|2) < ¥ for some l > 0,
therefore P(|X| ³ t) £ Ce-l2t2, where
C = Eexp(l2|X|2), see Problem 1.9. This implies
|
|f(s)| £ C1|s|l-Âs G ( |
1
2
|
Âs ), Âs > 0. |
| (90) |
In particular |a(s)| £ Cexp(o(|s|2)), where o(x)/x® 0
as x® ¥.
Consider now function u(s) = a(s)(lK)s/s,
which is analytic in Âs > 0. Clearly
|u(s)| £ Cexp(o(|s|2)) as |s|® ¥. Moreover
|u(1/2+it)| £ const for all real t by (89);
for all real x
|
|u(x)| = p1/2x-1lxf(x)/G( |
x+1
2
|
) £ C1G( |
1
2
|
x)/G( |
x+1
2
|
) £ p1/2C, |
|
by (90). Therefore by the Phragmén-Lindelöf principle,
see, eg. [], applied
twice to the angles
-1/2p £ args £ 0, and
0 £ args £ 1/2p, the Lemma is proved.
[¯]
By Lemma 4.2.1 Theorem 4.2.3 follows from the next result.
[ 20
Suppose X is a symmetric random
variable satisfying
for some l > 0,
and
for all s Î C, such that Âs > 0. Let Z be a centered
normal random variable such that
|
E|X|1/2+it E|X|1/2-it = E|Z|1/2+it E|Z|1/2-it |
| (91) |
for all t Î IR.
Then X is normal.
Proof.
We shall use Lemma 4.2.3 to show that a(s) = C1C2s for some
real C1, C2 > 0. It is clear that a(s) ¹ 0 if Âs > 0.
Therefore b(s) = loga(s) is a well defined function
which is analytic in the half-plane Âs > 0. The function
v(s): = Â(b(-is)) = log|a(-is)| is harmonic in the
half-plane Ás > -1/2 and
limsup |s|® ¥ v(s)/|s| < ¥ by Lemma 4.2.3.
Furthermore by (89) we have v(t) = 0 for real t .
By the Nevanlina integral representation, see []
|
v(x+iy) = |
y
p
|
|
ó õ
|
¥
-¥
|
|
v(t)
(t-x)2+y2
|
dt+ky |
|
for some real constant k and for all real x,y with y > 0. This in
particular implies that
b(y+1/2) = Â(b(y+1/2)) = v(-iy) = c y. Thus by the uniqueness of analytic extension we get
a(s) = C1C2s and hence
|
f(s) = p-1/2KsC1C2sG( |
s+1
2
|
) |
| (92) |
for some constants C1, C2 such that C12C2 = 1 (the latter is the
consequence of (89)).
Formula (92) shows that the distribution of X is
given by (87).
To exclude the
possibility that P(X = 0) ¹ 0 it remains to verify that C1 = 1.
This again follows from (88).
By Theorem 1.8,
the proof is completed.
[¯]
4.3 Infinite spherically symmetric sequences
In this section we present results that hold true for infinite
sequences only and which might fail for finite sequences.
Definition 10
An infinite sequence
X1, X2, ¼ is spherically
symmetric if the finite
sequence X1, X2, ¼, Xn is spherically symmetric for all n.
The following provides considerably more information
than Theorem 4.1.
Theorem 39 [[]]
If an infinite sequence X = (X1, X2, ¼) is
spherically symmetric, then there is a sequence of independent
identically distributed Gaussian random variables
[(g)\vec] = (g1, g2, ¼) and a non-negative random variable R independent
of [(g)\vec] such that
This result is based on exchangeability.
Definition 11 A sequence (Xk) of random variables is
exchangeable, if
the joint distribution of
Xs(1), Xs(2), ¼, Xs(n) is the
same as the joint distribution of
X1, X2, ¼,Xn for all n ³ 1 and for all
permutations s of {1, 2, ¼, n}.
Clearly, spherical symmetry implies exchangeability.
The following beautiful theorem due to
B. de Finetti []
points out the role of exchangeability in characterizations as a substitute for
independence; for more information and the references see
[].
Theorem 40
Suppose that X1, X2, ¼ is an infinite exchangeable sequence.
Then there exist a s-field N such that X1, X2, ¼ are
N-conditionally i. i. d., that is
|
P(X1 < a1, X2 < a2, ¼, Xn < an | N) |
|
|
= P(X1 < a1| N) P(X1 < a2 | N)¼P(X1 < an | N) |
|
for all a1, ¼, an Î IR and all n ³ 1.
Proof. Let N be the tail s-field, ie.
|
N = |
¥ Ç
k = 1
|
s(Xk, Xk+1, ¼) |
|
and put
Nk = s(Xk, Xk+1, ¼). Fix bounded measurable functions f, g, h and denote
|
G n, m = g(Xn+1, ¼, Xm+n) ; |
|
|
H n, m, N = h(Xm+n+N+1, Xm+n+N+2, ¼), |
|
where n, m, N ³ 1.
Exchangeability implies that
|
EFnGn, mHn, m, N = EFnGn+r, mHn, m, N |
|
for all r £ N.
Since Hn, m, N is an arbitrary bounded Nm+n+N+1-measurable
function,
this implies
|
E{FnGn, m| Nm+n+N+1} = E{FnGn+r, m| Nm+n+N+1}. |
|
Passing to the limit as N® ¥, see Theorem 1.4, this gives
|
E{FnGn, m| N} = E{FnGn+r, m| N}. |
|
Therefore
|
E{FnGn, m| N} = E{Gn+r, mE{Fn| Nn+r+1}| N}. |
|
Since E{Fn| Nn+r+1}
converges in L1 to E{Fn| N} as r® ¥,
and since g is bounded,
|
E{Gn+r, mE{Fn| Nn+r+1}| N} |
|
is arbitrarily close (in the L1 norm) to
|
E{Gn+r, mE{Fn | N}| N} = E{Fn | N} E{Gn+r, m | N} |
|
as
r® ¥.
By exchangeability E{Gn+r, m | N} = E{Gn, m | N} almost
surely,
which proves that
|
E{FnGn, m| N} = E{Fn| N} E{Gn, m | N}. |
|
Since f, g are arbitrary, this proves N-conditional
independence of the sequence. Using the exchangeability of the sequence
once again, one can see that random variables X1, X2, ¼ have
the same N-conditional distribution and thus the theorem is
proved. [¯]
Proof of Theorem 4.3.
Let N be the tail s-field as defined in
the proof of Theorem 4.3. By assumption, sequences
|
(2-1/2(X1+X2), 2-1/2(X1-X2), X3, X4, ¼) |
|
are all identically distributed and all have the same tail
s-field N. Therefore,
by Theorem 4.3 random variables X1, X2, are
N-conditionally
independent and identically distributed;
moreover, each variable has the symmetric N-conditional
distribution and N-conditionally X1 has the same distribution as
2-1/2(X1+X2).
The rest of the argument repeats the proof of Theorem 3.1.
Namely, consider conditional
characteristic function f(t) = E{exp( itX1)| N}.
With probability one f(1) is real by N-conditional symmetry
of distribution and f(t) = ( f(2-1/2t))2. This implies
almost surely,
n = 0, 1, ¼.
Since f(2-n/2)® f(0) = 1 with probability 1,
we have f(1) ¹ 0 almost surely.
Therefore on a subset
W0 Ì W of probability P(W0) = 1,
we have f(1) = exp( -R2),
where R2 ³ 0 is N-measurable random variable.
Applying17
Corollary 2.3 for each fixed
w Î W0 we get that f(t) = exp( -tR2) for all real t.
[¯]
The next corollary shows how much simpler the theory of infinite
sequences is, compare Theorem 4.1.
[ 14
Let X = (X1, X2, ¼) be
an infinite spherically
symmetric sequence such that E|Xk|a < ¥ for some real
a > 0 and all k = 1, 2, ¼. Suppose that for some m ³ 1
|
E{||(X1, ¼, Xm)||a| (Xm+1, Xm+2, ¼)} = const. |
| (94) |
Then X is Gaussian.
Proof. From Theorem 4.3 it follows that
|
E{||(X1, ¼, Xm)||a | (Xm+1, Xm+2, ¼)} |
|
|
= E{Ra||(g1, ¼, gm)||a | (Xm+1, Xm+2,¼)}. |
|
However, R is measurable with respect to the tail s-field,
and hence it also is
s(Xm+1, Xm+2, ¼)-measurable for all m. Therefore
|
E{||(X1, ¼, Xm)||a|(Xm+1, Xm+2, ¼)} |
|
|
= Ra E{||(g1, ¼, gm)||a|R(gm+1, gm+2, ¼)} |
|
|
= Ra E{E{||(g1, ¼, gm)||a|R, (gm+1, gm+2, ¼)}|R(gm+1, gm+2, ¼)}. |
|
Since
R and [(g)\vec] are independent, we finally get
|
E{||(X1, ¼, Xm)||a|(Xm+1, Xm+2, ¼)} |
|
|
= Ra E{||(g1, ¼,gm)||a|(gm+1, gm+2, ¼)} = Ca Ra. |
|
Using now (94)
we have R = const almost surely and hence X is Gaussian.
[¯]
The following corollary of Theorem 4.3 deals
with exponential distributions as
defined in Section 3.5. Diaconis & Freedman
[]
have a dozen of de Finetti-style results, including this one.
Theorem 41
If X = (X1, X2, ¼) is
an infinite sequence of
non-negative random variables such
that
random variable min{X1/a1, ¼, Xn/an}
has the same distribution as
(a1+¼+an)-1X1 for all n and all a1, ¼, an > 0 , then X = L[(e)\vec], where L and
[(e)\vec] are independent
random variables and [(e)\vec] = (e1, e2, ¼)
is a sequence of independent identically
distributed exponential random variables.
Sketch of the proof:
Combine Theorem 3.4 with Theorem 4.3 to get
the result for the pair X1, X2.
Use the reasoning from the proof of Theorem 3.4
to get the representation for any finite
sequence X1, ¼, Xn, see also
Proposition 3.5.
4.4 Problems
Problem 38
Prove the converse of (57). Namely, if f(s)
is the characteristic function of a one-dimensional random variable,
then there is a spherically symmetric (X1,¼, Xn) such that
f(||t||) is its characteristic function.
Problem 39
For centered bivariate normal r. v. X,Y with variances 1 and correlation
coefficient r
(see Example 2.2), show that E{|X| |Y|} = [2/(p)](Ö{1-r2}+rarcsinr).
Problem 40
Let X, Y be i. i. d. random variables
with the probability density function defined by
f(x) = C |x|-3exp(-1/x2), where C is a normalizing constant, and
x Î IR. Show that for any choice of a, b Î IR we have
where K = E|X|.
Problem 41
Using the methods used in
the proof of Theorem 4.2 for p = 1 prove the following.
Theorem 42 Let X, Y, Z ³ 0 be i. i. d. and integrable random
variables. Suppose that there is a constant C ¹ 0 such that
Emin{X/a, Y/c, Z/c} = C/(a+b+c) for all a, b, c > 0.
Then X, Y, Z are exponential.
Problem 42 [deterministic analogue of theorem 4.2]
Show
that if X, Y are independent with the same distribution, and
E|aX+bY| = 0 for some a, b ¹ 0, then X, Y are non-random.
Chapter 5 Independent linear forms
In this chapter the property of interest is the independence
of linear forms in independent random variables. In Section
we give a characterization
result that is both simple to state and to prove;
it is nevertheless of considerable interest.
Section parallels Section 3.2.
We use
the characteristic property of the normal distribution
to define abstract group-valued Gaussian random variables.
In this broader context we again
obtain the zero-one law;
we also prove an important result about the existence of exponential
moments. In Section we return to characterizations,
generalizing Theorem . We
show that the stochastic independence of arbitrary two linear forms
characterizes the normal distribution.
We conclude the chapter with abstract Gaussian
results when all forces are joined.
5.1 Bernstein's theorem
The following result due to Bernstein
[]
characterizes normal distribution by the independence
of the sum and the difference of two independent random variables.
More general but also more difficult result is
stated in Theorem below.
An early precursor is Narumi [], who proves a variant of Problem .The elementary proof below is adapted from Feller [].
Theorem 43
If X1, X2 are independent
random variables such that X1+X2 and X1 - X2 are
independent, then X1 and X2 are normal.
The next result
is an elementary version of Theorem 2.5.
[ 21
If X, Z are independent
random variables such that Z and X+Z are normal, then X is normal.
Indeed, the characteristic function f of random variable
X satisfies
|
f(t)exp( - (t - m)2/ s2 ) = exp( - (t - M)2/S2) |
|
for some constants m, M, s, S. Therefore
f(t) = exp(at2+bt+c), for some real constants
a, b, c, and by Proposition 2.1, f corresponds
to the normal distribution.
[ 22
If X, Z are independent
random variables and Z is normal, then X+Z has a
non-vanishing probability density function which has
derivatives of all orders.
Proof. Assume for simplicity that Z is
N(0, 2 - 1/2). Consider f(x) = Eexp( - (x - X)2).
Then f(x) ¹ 0 for each x, and since each derivative
[(dk)/( dyk)] exp( -(y - X)2) is bounded uniformly in
variables y, X, therefore f (·) has derivatives of
all orders. It remains to observe that
p-1/2f (·) is the probability density function of
X+Z. This is easily verified using the cumulative distribution
function:
|
P(X+Z £ t) = p-1/2 |
ó õ
|
¥
-¥
|
exp( - z2) |
ó õ
|
W
|
IX £ t - z dP dz |
|
|
= p-1/2 |
ó õ
|
W
|
|
ì í
î
|
|
ó õ
|
¥
-¥
|
exp( - z2)Iz+X £ t dz |
ü ý
þ
|
dP |
|
|
= p -1/2 |
ó õ
|
W
|
|
ì í
î
|
|
ó õ
|
¥
-¥
|
exp( - (y - X)2)Iy £ t dy |
ü ý
þ
|
dP |
|
|
= p-1/2 |
ó õ
|
t
-¥
|
Eexp( - (y - X)2) dy. |
|
[¯]
Proof of Theorem 5.1. Let Z1, Z2 be i. i. d.
normal random variables, independent of X's. Then random variables
Yk = Xk+Zk, k = 1, 2, satisfy the assumptions of the theorem, cf.
Theorem 2.2. Moreover, by Lemma 5.1, each of
Yk's has a smooth non-zero probability density function
fk(x), k = 1, 2. The joint density of the pair Y1+Y2, Y1 - Y2
is
1/2 f1([(x+y)/ 2])f2([(x-y)/ 2]) and by assumption
it factors into the product of two functions, the first being the
function of x, and the other being the function of y only.
Therefore the logarithms Qk(x): = logfk(1/2x), k = 1, 2,
are twice differentiable and satisfy
|
Q1(x+y) + Q2(x - y) = a(x)+b(y) |
| (95) |
for some twice differentiable functions a, b (actually a = Q1+Q2).
Taking the mixed second order derivative of (95) we
obtain
Taking x = y this shows that Q1¢¢(x) = const . Similarly
taking x = - y in (96) we get that Q2¢¢(x) = const.
Therefore Qk(2x) = Ak+Bkx+Ckx2, and hence
fk(x) = exp(Ak+Bkx+Ckx2), k = 1, 2.
As a probability density function, fk has to be integrable, k = 1, 2.
Thus Ck < 0, and then Ak = - 1/2log( - 2pCk)
is determined uniquely from the condition that òfk(x) dx = 1.
Thus fk(x) is a normal density and Y1, Y2 are normal.
By Lemma 5.1 the theorem is proved.
[¯]
5.2 Gaussian distributions on groups
In this section we shall see that the conclusion of
Theorem 5.1 is related to integrability just as
the conclusion of
Theorem 3.1 is related to the fact that the normal
distribution is a limit distribution for sums of i. i. d.
random variables, see Problem 3.6.
Let \sf CG be a group with a s-field F such
that group operation x, y® x+y, is a measurable
transformation
( \sf CG×\sf CG, FÄ F)® ( \sf CG, F).
Let (W, M, P) be a probability space.
A measurable function X: (W, M)® (\sf CG, F),
is called a \sf CG-valued random variable and its distribution is
called a probability measure on \sf CG.
Example 11
Let \sf CG = IRd be the vector
space of all real d-tuples with vector addition as
the group operation and with the usual Borel
s-field B. Then a \sf CG-valued random variable
determines a probability distribution on IRd.
Example 12
Let \sf CG = S1 be the group of
all complex numbers z such that |z| = 1 with multiplication
as the group operation and with the usual Borel s-field F
generated by open sets. A distribution of \sf CG-valued random
variable is called a probability measure on S1.
Definition 12
A \sf CG-valued random variable X is
Á-Gaussian (letter Á stays here for independence)
if random variables X+X¢ and X- X¢, where X¢ is
an independent copy of X, are independent.
Clearly, any vector space is an Abelian group with vector addition as the
group operation. In particular, we now have two
possibly distinct notions of Gaussian vectors: the
E-Gaussian vectors introduced in
Section 3.2 and the Á-Gaussian vectors
introduced in this section. In general, it seems to be not
known, when the two definitions coincide;
[]
gives related examples that satisfy suitable versions
of the 2-stability condition
(as in our definition of E-Gaussian)
without being Á-Gaussian.
Let us first check that at least
in some simple situations both definitions give the same result.
Example 5.2 (continued) If \sf CG = IRd and
X is an IRd-valued Á-Gaussian random variable,
then for all
a1, a2, ¼, ad Î IR the one-dimensional random
variable a1 X(1)+ a2X(2)+¼+ adX(d) has the normal
distribution. This means that X is a Gaussian vector in the
usual sense, and in this case the definitions of Á-Gaussian
and E-Gaussian random variables coincide. Indeed, by
Theorem 5.1, if L: \sf CG®IR is a measurable
homomorphism, then the IR-valued random variable X = L(X)
is normal.
In many situations of interest the reasoning that we applied to
IRd can be repeated and both the definitions are consistent
with the usual interpretation of the Gaussian distribution.
An important example is the vector space C[0, 1] of all continuous
functions on the unit interval.
To some extend, the notion of Á-Gaussian variable is
more versatile. It has wider applicability
because less algebraic
structure is required. Also there is some
flexibility
in the choice of the linear forms; the particular linear
combination X+X¢ and X- X¢ seems to be quite arbitrary,
although it might be a bit simpler for algebraic
manipulations, compare the proofs of Theorem and
Lemma below.
This is quite different from Section 3.2; it is known, see
[]
that even in the real case not every pair of linear forms could be
used to define an E-Gaussian random variable.
Besides, Á-Gaussian variables satisfy the following variant
of E-condition.
In analogy with
Section 3.2, for any
\sf CG-valued random variable X we may say that
X is E¢-Gaussian, if 2X has the same distribution as
X1+X2+X3+X4, where X1, X2, X3, X4 are four
independent copies of X. Any symmetric Á-Gaussian random
variable is always E¢-Gaussian in the above sense, compare
Problem . This observation allows to repeat the proof of
Theorem 3.2 in the Á-Gaussian case, proving the
zero-one law. For simplicity, we chose to consider only
random variables with values in a vector space \sf V; notation
2nx makes sense also for groups - the reader may want to check
what goes wrong with the argument below for non-Abelian groups.
Theorem 44
If X is a \sf V-valued
Á-Gaussian random variable and IL is a linear measurable
subspace of \sf V, then P(X Î IL) is either 0, or 1.
Indeed, let X1, ¼, Xn, ¼ be independent copies of
X, taken to be also independent from X. Recurrently we see that
X1+¼+ X4n and 2nX have the same distribution
for all n ³ 1. Since IL is a linear subspace of \sf V,
we have P(X1+¼+ X4n Î IL) = P(X Î IL).
Put Z = X1+X2+X3+ X4. Since
X1+¼+ X4n+1 has the same distribution as
Z+2nX, therefore P(Z+2nX Î IL) = P(X Î IL)
does not depend on n. As in the proof of Theorem 3.2,
define events An = {Z\not Î IL}Ç{Z+2nX Î IL}.
It is again easily verified
that events {An}n ³ 1 are disjoint; therefore
P(An) = P(X Î IL)P(X\not Î IL) = 0.
[¯]
The main result of this section, Theorem ,
needs additional notation. This notation is natural for
linear spaces. Let \sf CG be a group with a translation
invariant metric d(x, y), ie. suppose
d(x+z, y+z) = d(x, y) for all x, y, z Î \sf CG.
Such a metric d(·, ·) is uniquely defined by the function
x® D(x): = d(x, 0). Moreover, it is easy to see that
D(x) has the following properties: D(x) = D( - x) and
D(x+y) £ D(x)+D(y) for all x, y Î \sf CG.
Indeed, by translation invariance
D( - x) = d( - x, 0) = d(0, x) = d(x, 0) and
D(x+y) = d(x+y, 0) £ d(x+y,y)+d(y, 0) = D(x)+D(y).
Theorem 45
Let \sf CG be a group with a
measurable translation invariant metric d(.,.). If X is
an Á-Gaussian \sf CG-valued random variable, then
Eexpld(X, 0) < ¥ for some l > 0.
More information
can be gained in concrete situations. To mention one
such example of great importance, consider a C[0, 1]-valued
Á-Gaussian random variable, ie. a Gaussian stochastic
process with continuous trajectories. Theorem 5.2 says that
|
Eexpl( |
sup
0 £ t £ 1
|
|X(t)|) < ¥ |
|
for some
l > 0. On the other hand, C[0, 1] is a normed space
and another (equivalent) definition applies; Theorem
below implies stronger integrability
property
|
Eexpl( |
sup
0 £ t £ 1
|
|X(t)|2) < ¥ |
|
for some
l > 0. However, even the weaker conclusion of
Theorem 5.2 implies that the real random variable
sup0 £ t £ 1|X(t)| has moment generating
function and that all its moments are finite. Lemma
below is another application of the same line of reasoning.
Proof of Theorem 5.2. Consider a real function
N(x): = P(D(X) ³ x), where as before D(x): = d(x, 0).
We shall show
that there is
x0 such that
for each x ³ x0. By Corollary 1.3 this will end the proof.
Let X1, X2 be the independent copies of X.
Inequality (97) follows from the fact that event
{D(X1) ³ 2x} implies that either the event
{D(X1) ³ 2x}Ç{D(X2) ³ 2x0}, or the event
{D(X1+X2) ³ 2(x - x0)}Ç{D( X1 - X2 ) ³ 2(x - x0)} occurs.
Indeed,
let x0 be such that P(D(X2) ³ 2x0) £ 1/2.
If D(X1) ³ 2x and D(X2) < 2x0 then
D(X1±X2) ³ D(X1) - D(X2) ³ 2(x - x0).
Therefore using independence and the trivial bound
P(D(X1+X2) ³ 2a) £ P(D(X1) ³ a)+P(D(X2) ³ a), we obtain
|
P(D(X1) ³ 2x) £ P(D(X1) ³ 2x)P(D(X2) ³ 2x0) |
|
|
+ P(D(X1+X2) ³ 2(x - x0)) P(D(X1 - X2) ³ 2(x - x0) ) |
|
for each x ³ x0.
[¯]
More theory of Gaussian distributions on groups can be developed when
more structure is available, although technical difficulties arise;
for instance, the Cramer theorem (Theorem 2.5) fails on the
torus, see Marcinkiewicz
[]. Series expansion questions (cf. Theorem 2.2 and the
remark preceding Theorem ) are studied in
[], see also references therein. One can also study Gaussian
distributions on normed vector spaces. In Section
below we shall see to what extend this extra structure is helpful,
for integrability question; there are deep
questions specific to this situation, such as what are the
properties of the distribution of the real r. v. ||X||; see
[].
Another research subject, entirely left out from this book,
are Gaussian distributions on Lie groups; for more information
see eg. [].
Further information about abstract Gaussian random variables,
can be found also in
[,,,].
5.3 Independence of linear forms
The next result generalizes Theorem to more
general linear forms of a given independent sequence
X1, ¼, Xn. An even more general result that admits
also zero coefficients in linear forms,
was obtained
independently by Darmois [] and
Skitovich [].
Multi-dimensional variants of Theorem are also known,
see []. Banach space version of Theorem
was proved in [].
Theorem 46
If X1, ¼, Xn is a sequence
of independent random variables such that the linear forms
åk = 1nakXk and åk = 1nbkXk have all non-zero
coefficients and are independent, then random variables Xk
are normal for all 1 £ k £ n.
Our proof of Theorem 5.3 uses additional information
about the existence of moments, which then allows us to use an
argument from [] (see also
[]).
Notice that we don't allow for vanishing coefficients;
the latter case is covered by
[]
but the proof is considerably more involved18.
We need a suitable generalization of Theorem 5.2,
which for simplicity we state here for real valued random variables only.
The method of proof seems also to work in more general context
under the assumption of independence of certain nonlinear statistics,
compare [],
[] and
Lemma below.
[ 23
Let a1, ¼, an, b1, ¼, bn
be two sequences of non-zero real numbers. If X1, ¼, Xn
is a sequence of independent
random variables such that two linear forms
åk = 1nakXk and åk = 1nbkXk are independent,
then random variables Xk, k = 1, 2, ¼, n have finite moments of
all orders.
Proof. We shall repeat the idea from the proof of Theorem 5.2
with suitable technical modifications. Suppose that
0 < e £ |ak|, |bk| £ K < ¥ for k = 1, 2, ¼, n.
For x ³ 0 denote N(x): = maxj £ nP(|Xj| ³ x) and
let C = 2nK/e. For 1 £ j £ n we have trivially
|
P(|Xj| ³ Cx) £ P(|Xj| ³ Cx, |Xk| £ x "k ¹ j) |
|
|
+ |
n å
k ¹ j
|
P(|Xj| ³ x)P(|Xk| ³ x). |
|
Notice that the event
Aj: = {|Xj| ³ Cx }Ç{|Xk| £ x "k ¹ j} implies that both |åk = 1nakXk| ³ nKx and
|åk = 1nbkXk| ³ nKx. Indeed,
|
| |
n å
k = 1
|
akXk| ³ |Xj| |aj| - |
å
k, k ¹ j
|
|akXk| ³ (eC - nK)x = nKx |
|
and the second inclusion follows analogously.
By independence of the linear forms this shows that
|
P(|Xj| ³ Cx) £ P(| |
n å
k = 1
|
akXk| ³ nKx)P(| |
n å
k = 1
|
bkXk| ³ nKx) |
|
|
+ |
n å
k ¹ j
|
P(|Xj| ³ x)P(|Xk| ³ x). |
|
Therefore
N(Cx) £ P(|åk = 1nakXk| ³ nKx)P(|åk = 1nbkXk| ³ nKx)+nN2(x).
Using the trivial bound
|
P(| |
n å
k = 1
|
akXk| ³ nKx) £ nN(x), |
|
we get
Corollary 1.3
now ends the proof.
[¯]
Proof of Theorem 5.3.
We shall begin with
reducing the theorem to the case with more information about the
coefficients of the linear forms. Namely, we shall reduce the proof
to the case when all ak = 1,
and all bk are different.
Since all ak are non-zero, normality of Xk is equivalent to
normality of akXk; hence
passing to Xk¢ = akXk, we may assume that
ak = 1, 1 £ k £ n. Then, as the second step of the
reduction, without loss of generality we may assume that all
bj's are different. Indeed, if, eg. b1 = b2, then substituting
X1¢ = X1+X2 we get (n-1) independent random variables
X1¢, X3, X4, ¼, Xn which still satisfy the assumptions
of Theorem 5.3; and if we manage to prove that X1¢ is
normal, then by Theorem 2.5 the original random variables
X1, X2 are normal, too.
The reduction argument allows without loss of generality to assume
that ak = 1, 1 £ k £ n and
0 ¹ b1 ¹ b2 ¹ ¼ ¹ bn. In particular,
the coefficients of linear forms satisfy the assumption of
Lemma 5.3. Therefore random variables
X1, ¼, Xn have finite moments of all orders and
linear forms åk = 1nXk and åk = 1nbkXk are
independent.
The joint characteristic function of
åk = 1nXk, åk = 1nbkXk is
|
f(t, s) = |
n Õ
k = 1
|
fk(t+bks), |
|
where fk is the characteristic function of random
variable Xk, k = 1, ¼, n.
By independence of linear forms f(t,s) factors
Hence
|
|
n Õ
k = 1
|
fk(t+bks) = Y1(t) Y2(s). |
| (98) |
Passing to the logarithms Qk = logfk in a
neighborhood of 0, from (98) we obtain
|
|
n å
k = 1
|
Qk(t+bks) = w1(t)+w2(s). |
| (99) |
By Lemma 5.3 functions Qk and wj
have derivatives of all orders, see Theorem 1.5.
Consecutive differentiation of (99) with respect to
variable s at s = 0 leads to the following system of equations
Differentiation with respect to t gives now
|
| |
|
|
| |
|
| (101) | |
|
| |
|
n å
k = 1
|
bkn - 1Qk(n)(t) |
|
|
| |
|
|
| |
|
(clearly, the last equation was not differentiated).
Equations (101) form a system of linear equations
(101) for unknown values
Qk(n)(t), 1 £ k £ n.
Since all bj are non-zero and different, therefore the determinant
of the system
is non-zero19.
The unique solution Qk(n)(t) of the system is
Qk(n)(t) = constk and does not depend on t.
This means that in a neighborhood of 0 each of the
characteristic functions fk (·)
can be written as fk(t) = exp(Pk(t)),
where Pk is a polynomial of at most n-th degree.
Theorem 2.5 now concludes the proof.
[¯]
Remark: Additional integrability information was used to solve equation
(99). In general equation (99) has the same solution
but the proof is
more difficult, see
[].
5.4 Strongly Gaussian vectors
Following Fernique, we give yet another definition of a Gaussian random variable.
Let \sf V be a linear space and let X be an \sf V-valued random
variable. Denote by X¢ an independent copy of X.
Definition 13 X is S-Gaussian ( S stays
here for strong) if for all real a random variables
cos(a)X¢+sin(a)X, and
sin(a)X¢- cos(a)X are independent and have
the same distribution as X.
Clearly any S-Gaussian random vector is both
Á-Gaussian and E-Gaussian, which motivates the
adjective ``strong''.
Let us quickly show how Theorems 3.2 and 5.2
can be obtained for S-Gaussian vectors. The proofs follow
Fernique [].
Theorem 47
If X is an \sf V -valued
S-Gaussian random variable and IL is a linear measurable
subspace of \sf V, then P(X Î IL) is either equal
to 0, or to 1.
Proof. Let X, X¢ be independent copies of X.
For each 0 < a < p/2, let
Xa = cos(a) X+sin(a) X¢, and consider
the event
|
A(a) = {w: Xa(w) Î IL}Ç{ Xp/2-a(w)\not Î IL}. |
|
Clearly P(A(a)) = P(X Î IL)P(X\not Î IL).
Moreover, it is easily seen that {A(a)}0 < a < p/2
are pairwise disjoint events. Indeed, if
A(a)ÇA(b) ¹ Æ, then we would have vectors
v, w such that
cos(a) v+sin(a) w Î IL,cos(b)v+sin(b)w Î IL, which for a ¹ b
implies that v, w Î IL. This contradicts
cos(p/2-a)v+sin(p/2-a)w\not Î IL.
Therefore P( A(a)) = 0 for each a and in particular
P(X Î IL)P(X\not Î IL) = 0, which ends the proof.
[¯]
The next result is taken from Fernique [].
It strengthens considerably the conclusion of Theorem 3.2.2.
Theorem 48
Let \sf V be a normed linear space
with the measurable norm ||·||. If X is an
S-Gaussian \sf V-valued random variable, then
there is e > 0 such that Eexp(e|| X||2) < ¥.
Proof. As previously, let N(x): = P( || X|| ³ x).
Let X1, X2 be independent copies of X.
It follows from the definition that
and
|
2 - 1/2 || X1+X2 ||, 2 - 1/2 || X1 - X2 || |
|
are two pairs of independent copies of || X||.
Therefore for any 0 £ y £ x
we have the following estimate
|
N(x) = P( || X1 || ³ x, || X2 || ³ y)+P( || X1 || ³ x, || X2 || < y) |
|
|
£ N(x)N(y)+P( || X1+X2 || ³ x - y)P( || X1 - X2 || ³ x - y). |
|
Thus
|
N(x) £ N(x)N(y)+N2(2 - 1/2(x - y)). |
| (102) |
Take x0 such that N(x0) £ 1/2.
Substituting t = Ö2x in (102)
we get
for each t ³ t0.
This is similar to, but more precise than (97).
Corollary 1.3 ends the proof.
[¯]
5.5 Joint distributions
Suppose X1, ¼, Xn, n ³ 1, are (possibly dependent) random
variables such that the joint distribution of n linear forms
L1, L2, ¼, Ln in variables X1, ¼, Xn is given.
Then, except in the degenerate cases, the joint distribution of
(L1, L2, ¼, Ln) determines uniquely the joint distribution
of (X1, ¼, Xn). The point to be made here is that if
X1, ¼, Xn are independent, then even degenerate
transformations provide a lot of information. This
phenomenon is responsible for results in Chapters 3 and 5.
More general results which have little to do with the Gaussian
distribution are also known. For instance, if X1, X2, X3 are
independent, then the joint distribution m(dx,dy) of the pair
X1 - X2, X2 - X3 determines the distribution of
X1, X2, X3 up to a change of location, provided that the
characteristic function of m does not vanish, see
[]. This result was found independently by a
number of authors, see
[,,]; for related results
see also [,].
Nonlinear functions were analyzed in []
and the references therein.
5.6 Problems
Problem 43
Let X1, X2, ¼ and
Y1, Y2, ¼ be two sequences of i. i. d. copies of random variables
X, Y respectively. Suppose X, Y have finite second moments and are such
that U = X+Y and V = X - Y are independent. Observe that in distribution
X @ X1 = 1/2(U+V) @ 1/2(X1+Y1+X2 - Y2), etc.
Use this observation and the Central Limit Theorem to prove Theorem 5.1
under the additional assumption of finiteness of second moments.
Problem 44
Let X and Y be two independent
identically distributed random variables such that U = X+Y and
V = X - Y are also independent. Observe that 2X = U+V and hence
the characteristic function f(·) of X satisfies
equation f(2t) = f(t) f(t) f( - t). Use this observation
to prove Theorem 5.1 under the additional assumption of i. i. d.
Problem 45 [Deterministic version of Theorem 5.1]
Suppose X,U,V are independent and X+U, X+V are independent. Show
that X is non-random.
The next problem gives a one dimensional converse to Theorem 2.2.
Problem 46 [From []]
Let X, Y be (dependent) random variables
such that for some number r ¹ 0,±1 both X - rY and
Y are independent and also Y - rX and X are independent.
Show that (X, Y) has bivariate normal distribution.
Chapter 6 Stability and weak stability
The stability problem is the question of to what extent the conclusion of a
theorem
is sensitive to small changes in the assumptions. Such description is, of
course, vague until the questions of how to quantify the departures both from
the conclusion and from the assumption are answered. The latter is to some
extent
arbitrary; in the
characterization context, typically, stability reasoning depends on the ability
to prove that small changes (measured with respect to some measure
of smallness) in assumptions of a given characterization theorem result in
small departures (measured with respect to one of the distances of
distributions) from the normal distribution.
Below we present only one stability result; more about stability of
characterizations can be found in
[], see also [].
In Section we also give two results that establish what
one may call weak stability. Namely, we establish that moderate
changes in assumptions still preserve some properties of the normal
distribution.
Theorem below is the only result of this chapter used later on.
6.1 Coefficients of dependence
In this section we introduce a class of measures of departure from independence,
which we shall call coefficients of dependence.
There is no natural measure of dependence between random variables; those
defined below have been used to define strong mixing conditions
in limit theorems; for the latter the reader is referred to
[]; see also
[].
To make the definition look less arbitrary, at first
we consider an infinite parametric family of measures of
dependence. For a pair of
s-fields F, G let
|
ar, s( F, G) = |
sup
| { |
|P(AÇB)-P(A)P(B)|
P(A)rP(B)s
|
: A Î F, B Î G non-trivial} |
|
with the range of parameters
0 £ r £ 1, 0 £ s £ 1, r+s £ 1.
Clearly, ar, s is a number between 0 and 1. It is obvious
that ar, s = 0
if and only if the s-fields F, G are independent.
Therefore
one could use each of the coefficients ar, s as a measure of
departure from independence.
Fortunately, among the infinite number
of coefficients of dependence thus introduced, there are just four really
distinct, namely a0,0, a0,1, a1,0, and
a1/2,1/2. By this we mean that the convergence to zero of
ar,s (when the s-fields F, G vary) is equivalent to the convergence to 0 of one
of the above four coefficients. And since a0,1 and a1,0
are mirror images of each other, we are actually left with three coefficients
only.
The formal statement of this equivalence takes the form of the following inequalities.
[ 9
If r+s < 1, then
ar, s £ (a0,0)1-r-s.
If r+s = 1 and 0 < r £ 1/2 £ s < 1, then
ar, s £ (a1/2,1/2)2r.
Proof. The first inequality follows from the fact that
|
|
|P(AÇB)-P(A)P(B)|
P(A)rP(B)s
|
|
|
|
= |P(AÇB)-P(A)P(B)|1-r-s|P(B|A)-P(B)|r|P(A|B)-P(A)|s |
|
|
£ |P(AÇB)-P(A)P(B)|1-r-s. |
|
The second one is a consequence of
|
|
|P(AÇB)-P(A)P(B)|
P(A)rP(B)s
|
|
|
|
= |
æ ç
è
|
|
|P(AÇB)-P(A)P(B)|
P(A)1/2P(B)1/2
|
|
ö ÷
ø
|
2r
|
|P(A|B)-P(A)|s-r £ (a1/2,1/2)2r |
|
[¯]
Coefficients a0,0 and a0,1, a1,0 are the basis for
the definition of classes of stationary sequences called in the limit theorems
literature strong-mixing and uniform strong mixing (called
also f-mixing);
a1/2,1/2 is equivalent to the maximal correlation coefficient
(), which is the basis of the so called r-mixing condition.
Monograph [] gives recent exposition and relevant
references; see also [].
There is also a whole continuous spectrum of non-equivalent coefficients
ar, s when r+s > 1. As those coefficients may attain value ¥,
they are less frequently used; one notable exception is a1, 1, which
is the basis of the so called y-mixing condition and occurs occasionally
in the assumptions of some limit theorems. Condition equivalent to
a1, 1 < ¥ and conditions related to ar, s
with r+s > 1 are also employed in large deviation theorems, see
[].
The following bounds20 for the covariances between random variables in
Lp( F) and in Lq( F) will be used later on.
[ 10
If X is F-measurable with p-th
moment finite (1 £ p £ ¥) and Y is G-measurable with
q-th moment finite (1 £ q £ ¥ ) and 1/p+1/q £ 1, then
|
| |
|
|
| (104) | |
|
4(a0,0)1-1/p-1/q(a1,0)1/p(a0,1)1/q||X||p||Y||q |
|
|
| |
|
where ||X||p = (E|X|p)1/p if p < ¥ and ||X||¥ = ess sup|X|.
Proof.
We shall prove the result for p = 1, q = ¥ and p = q = ¥ only;
these are the only cases we shall actually need; for the general case,
see eg. [] or
[].
Let M = ess sup|Y|. Switching the order of integration
(ie. by Fubini's theorem) we get, see Problem 1.9,
|
| |
|
|
| |
|
| |
ó õ
|
¥
-¥
|
|
ó õ
|
M
-M
|
(P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)) dt ds| |
|
| |
|
|
ó õ
|
¥
-¥
|
|
ó õ
|
M
-M
|
|P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| dt ds. |
|
| (105) |
| |
|
Since |P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| £ a1, 0 P(X ³ t)
(which is good for positive t) and
|P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| = |P(X < t, Y ³ s)-P(X < t)P(Y ³ s)| £ a1, 0 P(X £ t) (which works well for negative
t), inequality (105) implies
|
|EXY-EXEY| £ a1, 0 |
ó õ
|
¥
0
|
|
ó õ
|
M
-M
|
P(X ³ t) dt ds |
|
|
+a1, 0 |
ó õ
|
¥
0
|
|
ó õ
|
M
-M
|
P(X £ -t) dt ds = 2a1, 0E|X| ||Y||¥. |
|
Similar argument using
|P(X ³ t, Y ³ s)-P(X ³ t)P(Y ³ s)| £ a0, 0 gives
|
|EXY-EXEY| £ 4a0, 0||X||¥ ||Y||¥ . |
|
[¯]
6.1.1 Normal case
Here we review without proofs the relations between the
dependence coefficients in the multivariate normal case.
Ideas behind the proofs can be found in the solutions to
the Problems , , and
.
The first result points points out that the coefficients a0,1
and a1,0 are of little interest in the normal case.
Theorem 49
Suppose (X, Y) Î IRd1+d2 are jointly normal and
a0,1(X,Y) < 1. Then X, Y are independent.
Denote by r the maximal correlation coefficient
|
r = |
sup
| {corr(f(X) g(Y)): f(X), g(Y) Î L2}. |
| (106) |
The following estimate due to Kolmogorov & Rozanov
[] shows that in the normal case the maximal correlation
coefficient (106) can be estimated by a0,0.
In particular, in the normal case we have
Theorem 50
Suppose X, Y Î IRd1+d2 are jointly normal.
Then
|
corr(f(X),g(Y)) £ 2pa0,0(X,Y) |
|
for all square integrable f,g.
The next inequality is known as the
so called Nelson's hypercontractive estimate
[]
and is of importance in mathematical physics.
It is also known in general that inequality () implies
a bound for maximal correlation, see [].
Theorem 51
Suppose (X, Y) Î IRd1+d2 are jointly normal.
Then
|
Ef(X)g(Y) £ ||f(X)||p||g(Y)||p |
| (107) |
for all p-integrable f,g,
provided p ³ 1+r, where r is the maximal correlation
coefficient (106).
6.2 Weak stability
A weak version of the stability problem may be described as allowing relatively
large
departures from the assumptions of a given theorem. In return, only a
selected part of the conclusion is to be preserved. In this section the part of
the characterization conclusion that we want to preserve is integrability. This
problem is of its own interest. Integrability results are often useful
as a first step in some proofs, see the proof of Theorem 5.3, or the
proof of Theorem below.
As a simple example of weak stability we first consider Theorem 5.1, which
says that for independent r. v. X,Y we have a1,0(X+Y, X-Y) = 0 only
in the normal case.
We shall show that if the coefficient of dependence a1,0(X+Y, X-Y) is
small,
then the distribution of X still has some finite moments.
The method of proof is an adaptation of the proof of Theorem 5.2.
[ 11
Suppose X, Y are independent random variables such
that random variables X+Y and X-Y satisfy a1,0(X+Y, X-Y) < 1/2.
Then X and Y have finite moments E|X|b < ¥ for
b < -log2(2a1,0).
Proof.
Let N(x) = max{P(|X| ³ x), P(|Y| ³ x)}. Put a = a1,0.
We shall show that for each r > 2a, there is x0 > 0 such that
for all x ³ x0.
Inequality (108) follows from the fact that the event
{|X| ³ 2x} implies that either {|X| ³ 2x}Ç{|Y| ³ 2y} or
{|X+Y| ³ 2(x-y)}Ç{|X-Y| ³ 2(x-y)} holds (make a picture).
Therefore, using the independence of X, Y, the
definition of a = a1,0(X+Y, X-Y) and trivial bound
P(|X+Y| ³ a) £ P(|X| ³ 1/2a)+P(|Y| ³ 1/2a) we obtain
|
P(|X| ³ 2x) £ P(|X| ³ 2x)P(|Y| ³ 2y) |
|
|
+ P(|X+Y| ³ 2(x-y)) (a+P(|X-Y| ³ 2(x-y))) |
|
|
£ N(2x)N(2y)+2aN(x-y)+4N2(x-y). |
|
For any e > 0 pick y so that N(2y) £ e/(1+e). This gives
N(2x) £ (1+e)2aN(x-y)+4(1+e)N2(x-y) for all x > y.
Now pick x0 ³ y such that N(x-y) £ ea/(1+e) for all x > y.
Then
|
N(2x) £ 2(1+2e)aN(x-y) £ 2(1+3e)aN(x-x0) |
|
for all x ³ x0. Since e > 0 is arbitrary, this ends the proof of
(108).
By Theorem 1.3 inequality (108) concludes the
proof, eg. by formula (2).
[¯]
In Chapter we shall consider assumptions about conditional moments.
In Section we need the integrability result
which we state below. The assumptions are motivated by the fact that a pair
X, Y with the bivariate normal distribution has linear regressions
E{X|Y} = a0+a1Y and E{Y|X} = b0+b1X, see
(30); moreover, since X-(a0+a1Y) and Y are independent
(and similarly Y-( b0+b1X ) and X are independent), see Theorem
2.2, therefore the conditional variances Var(X|Y) and Var(Y|X) are
non-random. These two properties do not characterize the
normal distribution, see Problem . However, the assumption that regressions are linear and
conditional variances are constant might be considered as the departure from
the assumptions of Theorem 5.1 on the one hand and from the assumptions
of Theorem on the other. The following somehow surprising fact
comes from []. For similar implications see also [] and [].
Theorem 52
Let X, Y be random variables with finite second
moments and suppose that
|
E{| X-(a0+a1Y)|2|Y} £ const |
| (109) |
and
|
E{|Y-(b0+b1X)|2|X} £ const |
| (110) |
for some real numbers a0, a1, b0, b1 such that a1b1 ¹ 0,1,-1.
Then X, Y have finite moments of all orders.
In the proof we use the conditional version of Chebyshev's
inequality stated as Problem 1.9.
[ 24
If F is a s-field and E|X| < ¥, then
|
P(|X| > t | F) £ E{|X| | F}/t |
|
almost surely.
Proof. Fix t > 0 and let A Î F. By the definition of the
conditional expectation
|
|
ó õ
|
A
|
P(|X| > t| F) dP = E{IAI|X| > t} £ E{|X|/t IAI|X| > t} £ t-1E{|X|IA}. |
|
This end the proof by Lemma 1.4.
[¯]
Proof of Theorem 6.2. First let us observe that without losing
generality we may assume a0 = b0 = 0. Indeed, by triangle inequality
(E{|X-a1Y|2|Y})1/2 £ |a0|+(E{|X-(a0+a1Y)|2|Y})1/2 £ const,
and the analogous bound takes care of (110). Furthermore, by passing
to -X or -Y if necessary, we may assume a = a1 > 0 and b = b1 > 0.
Let N(x) = P(|X| ³ x)+P(|Y| ³ x). We shall show that there are constants
K, C > 0 such that
This will end the proof by Corollary 1.3.
To prove (111) we shall proceed as in the proof of Theorem 5.2.
Namely,
the event {|X| ³ Kx}, where x > 0 is fixed and K will be chosen later,
can be decomposed into the sum of two disjoint events
{|X| ³ Kx}Ç{|Y| ³ x}
and {|X| ³ Kx}Ç{|Y| < x}. Therefore trivially we have
|
| |
|
|
P(|X| ³ Kx) £ P(|X| ³ x, |Y| ³ x) |
|
| (112) | |
|
|
| |
|
For K large enough the second term on the right hand side of (112)
can be estimated by conditional Chebyshev's inequality from Lemma 6.2.
Using trivial estimate |Y-bX| ³ b|X|-|Y| we get
|
| |
|
|
P2 £ P(|Y-bX| ³ (Kb-1)x, |X| ³ Kx) |
|
| (113) | |
|
|
ó õ
|
|X| ³ Kx
|
P(|Y-bX| ³ (Kb-1)x|X) dP £ const N(Kx)/x2. |
|
|
| |
|
To estimate P1 in (112), observe that the event {|X| ³ x}
implies that either |X-aY| ³ Cx, or |Y-bX| ³ Cx, where
C = |1-ab|/(1+a). Indeed, suppose both are not true, ie. |Y-bX| < Cx and
|X-aY| < Cx. Then we obtain trivially
|
|1-ab| |X| = |X-abX| £ |X-aY|+a|Y-bX| < C(1+a)x. |
|
By our choice of C, this contradicts |X| ³ x.
Using the above observation and conditional Chebyshev's inequality we obtain
|
P1 £ P(|X-aY| ³ Cx, |Y| ³ x) |
|
|
+P(|Y-bX| ³ Cx, |X| ³ x) £ C1 N(x)/x2. |
|
This, together with (112) and (113)
implies P(|X| ³ Kx) £ CN(x)/x2
for any K > 1/b with constant C depending on K but not on x. Similarly
P(|Y| ³ Kx) £ CN(x)/x2 for any K > 1/a, which proves (111).
[¯]
6.3 Stability
In this section we shall use the coefficient a0, 0 to analyze the
stability of a variant21 of Theorem
5.1 which is based on the approach sketched in Problem 5.6.
Theorem 53
Suppose X, Y are i. i. d. with the cumulative distribution
function F(·). Assume that EX = 0, EX2 = 1 and E|X|3 = K < ¥ and
let F(·) denote the cumulative distribution function of the standard normal distribution.
If a0, 0(X+Y; X-Y ) < e, then
|
|
sup
x
|
|F(x)-F(x)| £ C(K)e1/3. |
| (114) |
The following corollary is a consequence of Theorem 6.3 and
Proposition 6.2.
[ 15
Suppose X, Y are i. i. d. with the cumulative distribution
function F(·). Assume that EX = 0, EX2 = 1.
If a1, 0(X+Y; X-Y ) < e, then there is C < ¥
such that (114) holds.
Indeed, by Proposition 6.2 the third moment exists if e < e-3/2;
choosing large enough C inequality
(114) holds true trivially for e ³ e-3/2.
The next lemma gives the estimate of the left hand side of (114) in
terms of characteristic functions. Inequality () is called
smoothing inequality
- a name well motivated by the method of proof; it is due to Esseen [].
[ 25
Suppose F, G are cumulative distribution functions
with the characteristic functions f, y respectively. If G is
differentiable, then for all T > 0
|
|
sup
x
|
|F(x)-G(x)| £ |
1
p
|
|
ó õ
|
T
-T
|
| f(t)- y(t)| dt/t+ |
12
pT
|
|
sup
x
|
|G¢(x)|. |
| (115) |
Proof. By the approximation argument, it suffices to
prove (115) for F, G differentiable and with integrable
characteristic functions only. Indeed, one can approximate
F uniformly by the cumulative distribution functions Fd, obtained by
convoluting F with the normal N(0, d) distribution, compare Lemma
5.1. The approximation, clearly, does not affect (115). That
is, if (115) holds true for the approximants, then it holds true for
the actual cdf's as well.
Let f, g be the densities of F and G respectively. The inversion formula
for characteristic functions gives
|
f(x) = |
1
2p
|
|
ó õ
|
¥
-¥
|
e-itx f(t) dt, |
|
|
g(x) = |
1
2p
|
|
ó õ
|
¥
-¥
|
e-itx y(t) dt. |
|
From this we obtain
|
F(x)-G(x) = |
i
2p
|
|
ó õ
|
¥
-¥
|
e-itx |
f(t)-y(t)
t
|
dt. |
|
The latter formula can be checked, for instance, by verifying that both sides have
the same derivative, so that they may differ by a constant only. The constant
has to be 0, because the left hand side has limit 0 at ¥
(a property of cdf) and the right
hand side has limit 0 at ¥ (eg. because we convoluted with the
normal distribution while doing our approximation step; another way of seeing
what is the asymptotic at ¥ of the right hand side is to use the
Riemann-Lebesgue theorem, see eg. []).
This clearly implies
|
|
sup
x
|
|F(x)-G(x)| £ |
1
2p
|
|
ó õ
|
¥
-¥
|
| f(t)- y(t)| dt/t. |
| (116) |
This inequality, while resembling (115), is not good enough; it is
not preserved by our approximation procedure, and the right hand side is
useless when the density of F doesn't exist. Nevertheless (116)
would do, if one only knew that the characteristic functions vanish outside
of a finite interval.
To achieve this, one needs to consider one more convolution approximation, this
time we shall use density
hT(x) = [1/(pT)][(1-cos(Tx))/( x2)]. We shall need the
fact that the characteristic function hT(t) of hT(x) vanishes for
|t| ³ T (and we shall not need the explicit formula hT(t) = 1-|t|/T for
|t| £ T, cf. Example 1.5). Denote by FT and GT the cumulative distribution functions
corresponding to convolutions f*hT and g*hT respectively.
The corresponding characteristic functions are f(t)hT(t) and
y(t)hT(t) respectively and both vanish for |t| ³ T. Therefore,
inequality (116) applied to FT and GT gives
|
| |
|
|
| (117) | |
|
|
1
2p
|
|
ó õ
|
T
-T
|
|(f(t)-y(t))hT(t)| dt/t |
|
|
£ |
1
2p
|
|
ó õ
|
T
-T
|
|f(t)- y(t)| dt/t. |
|
| |
|
It remains to verify that supx|FT(x)-GT(x)| does not differ too much from
supx|F(x)-G(x)|. Namely, we shall show that
|
|
sup
x
|
|F(x)-G(x)| £ 2 |
sup
x
|
|FT(x)-GT(x)|+ |
12
pT
|
|
sup
x
|
|G¢(x)|, |
| (118) |
which together with (117) will end the proof of (115).
To verify (118), put M = supx|G¢(x)| and pick x0 such that
|
|
sup
x
|
|F(x)-G(x)| = |F(x0)-G(x0)|. |
|
Such x0 can be found, because F and G are continuous and
F(x)-G(x) vanishes as x® ±¥.
Suppose supx|F(x)-G(x)| = G(x0)-F(x0). (The other case:
supx|F(x)-G(x)| = F(x0)-G(x0) is handled similarly, and is done explicitly
in []).
Since F is non-decreasing, and the rate of growth of G is bounded by M,
for all s ³ 0 we get
|
G(x0-s)-F(x0-s) ³ G(x0)-F(x0)-sM. |
|
Now put
a = [(G(x0)-F(x0))/ 2M], t = x0+a, x = a-s. Then for all |x| £ a we
get
|
G(t-x)-F(t-x) ³ |
1
2
|
(G(x0)-F(x0))+Mx. |
| (119) |
Notice that
|
GT(t)-FT(t) = |
1
pT
|
|
ó õ
|
¥
-¥
|
(F(t-x)-G(t-x))(1-cosTx)x-2 dx |
|
|
³ |
1
pT
|
|
ó õ
|
a
-a
|
(F(t-x)-G(t-x))(1-cosTx)x-2 dx |
|
|
- |
sup
x
|
|F(x)-G(x)| |
2
pT
|
|
ó õ
|
¥
a
|
y-2 dy. |
|
Clearly,
|
|
sup
x
|
|F(x)-G(x)| |
2
pT
|
|
ó õ
|
¥
a
|
y-2 dy = (G(x0)-F(x0)) |
2
pT
|
a-1 = 4M/(pT) |
|
by our choice of a.
On the other hand (119) gives
|
|
1
pT
|
|
ó õ
|
a
-a
|
(F(t-x)-G(t-x))(1-cosTx)x-2 dx |
|
|
³ |
1
pT
|
|
ó õ
|
a
-a
|
Mx(1-cosTx)x-2 dx |
|
|
+ |
1
2
|
(G(x0)-F(x0))(1- |
2
pT
|
|
ó õ
|
¥
a
|
y-2 dy) |
|
|
= |
1
2
|
(G(x0)-F(x0))-2M/(pT); |
|
here we used
the fact that the first integral vanishes by symmetry. Therefore
G(x0)-F(x0) £ 2(GT(x0+a)-FT(x0+a))+12M/(pT), which clearly implies
(118).
[¯]
Proof of Theorem 6.3.
Clearly only small e > 0 are of interest.
Throughout the proof C will denote a constant depending on K only,
not always the same at each occurrence.
Let f(.) be the characteristic function of X. We have
Eexp it(X+Y)expit(X-Y) = f(2t) and
Eexpit(X+Y)Eexpit(X-Y) = (f(t))3 f(-t). Therefore by a complex valued variant of (104)
with p = q = ¥, see
Problem , we have
|
| f(2t)-( f(t))3 f(-t)| £ 16e. |
| (120) |
We shall use (115) with T = e-1/3 to show that (120)
implies (114).
To this end we need only to establish that for some C > 0
|
|
1
pT
|
|
ó õ
|
T
-T
|
| f(t)-e- 1/2t2|/t dt £ Ce1/3. |
| (121) |
Put h(t) = f(t)-e- 1/2t2. Since EX = 0, EX2 = 1 and
E|X|3 < ¥, we can choose e > 0 small enough so that
for all |t| £ e1/3. From (120) we see that
|
|h(2t)| = | f(2t)-exp(-2t2)| £ 16e+|( f(t))3 f(-t)-exp( -2t2)|. |
|
Since f(t) = exp( -1/2t2)+h(t), therefore we get
|
|h(2t)| £ 16e+ |
3 å
r = 0
|
|
æ ç
è
|
|
| |
ö ÷
ø
|
exp( - |
1
2
|
rt2)|h(t)|4-r. |
| (123) |
Put tn = e1/32n, where n = 0, 1, 2, ¼, [1-2/3log2(e)],
and let hn = max{|h(t)|: tn-1 £ t £ tn}. Then (123) implies
|
hn+1 £ 16e+4exp( - |
1
2
|
tn2)hn(1+ |
3
2
|
hn+hn2)+hn4. |
| (124) |
Claim 2
Relation (124) implies that for all sufficiently small
e > 0 we have
|
hn £ 2(C0+44)e4n exp(-t024n/6), |
| (125) |
where 0 £ n £ [1- 2/3log2(e)], and C0 is a constant from
(122).
Claim 6.3 now ends the proof. Indeed,
|
|
ó õ
|
T
-T
|
| f(t)-e- 1/2t2|/t dt = 2 |
ó õ
|
t0
0
|
|h(t)|/t dt+2 |
n å
i = 1
|
|
ó õ
|
ti
ti-1
|
|h(t)|/t dt |
|
|
£ 2C0e+2 |
n å
i = 1
|
hi/ti-1 |
ó õ
|
ti
ti-1
|
1 dt £ 2C0e+4 |
n å
i = 1
|
(C0+44)e4n e-t024n/6 |
|
|
£ 2C0e+24(C0+44) |
e
t02
|
|
ó õ
|
¥
0
|
e-x dx £ Ce1/3. |
|
[¯]
Proof of Claim 6.3. We shall prove (126) by induction,
and (125) will be established in the induction step.
By (122), inequality (126) is true for n = 1, provided
e < C0-4/3. Suppose m ³ 0 is such
that (126) holds for all n £ m.
Since 3/2hn+hn2 < 3e1/4 = d, thus (124) implies
|
hm+1 £ 32e+ 4exp(- |
1
2
|
tn2)hm(1+d) |
|
|
£ 32e |
n-1 å
j = 1
|
4j(1+d)j exp( - |
1
2
|
|
j å
k = 1
|
tn-k2)+4n(1+d)n exp( - |
1
2
|
|
n å
k = 1
|
tn-k2)h1 |
|
|
= 32e |
n-1 å
j = 1
|
4j(1+d)j exp( -t02(4n-4n-j)/6)+4n(1+d)n exp( -t02(4n-1)/6)h1. |
|
Therefore
|
hm+1 £ (h1+44e) (1+d)n4n e-t024n/6. |
| (127) |
Since
|
(1+d)n £ (1+3e1/4)2- 2/3log2(e) £ 2 |
|
and
|
4n e-t024n/6 £ 4e-4/3exp(- |
1
6
|
e-2/3) £ e-2/3 |
|
for all e > 0 small enough, therefore, taking (122) into account,
we get
hm+1 £ 2(44+C0)e1/3 £ e1/4, provided e > 0 is small
enough. This proves (126) by induction. Inequality (125)
follows now from (127).
[¯]
6.4 Problems
Problem 47
Show that for complex valued random variables X, Y
|
|EXY-EXEY| £ 16a0, 0 ||X||¥ ||Y||¥ . |
|
(The constant is not sharp.)
Problem 48
Suppose (X, Y) Î IR2 are jointly normal and
a0,1(X,Y) < 1. Show that X, Y are independent.
Problem 49
Suppose (X, Y) Î IR2 are jointly normal with correlation coefficient
r.
Show that
Ef(X)g(Y) £ ||f(X)||p||g(Y)||p for all p-integrable f(X),g(Y),
provided p ³ 1+|r|.
Hint: Use the expl |