6/4/2020

Probability Basics

The sample space \(\Omega\) is the set of possible outcomes of an experiment.

Points \(\omega\) in \(\Omega\) are called sample outcomes, realizations, or elements.

Subsets of \(\Omega\) are events.

A function \(P\) that assigns a real number \(P(A)\) to every event \(A\) is a probability distribution if it satisfies three properties:

  • \(P(A)\geq 0\) for all \(A\in \Omega\)
  • \(P(\Omega)=1\)
  • If $A_1, A_2, … $ are disjoint, then \(P\left(\cup_{i=1}^\infty A_i \right)=\sum_{i=1}^\infty P(A_i)\).

Independence

A set of events \(\{A_i:i\in I\}\) is independent if

\[P\left(\cap_{i\in J} A_i \right)=\prod_{i\in J} P(A_i ) \]

for every finite subset \(J\) of \(I\).

Random Variables

A random variable \(Y\) is a mapping \[Y:\Omega→\mathbb{R}\] that assigns a real number \(Y(\omega)\) to each outcome \(\omega\).

The cumulative distribution function (CDF) of \(Y\) is a function \(F_Y:\mathbb{R}\to [0,1]\) defined by \(F_Y (y)=P(Y\leq y)\).

\(Y\) is a discrete random variable if it takes countably many values \(\{x_1,x_2,\dots \}\).

The probability mass function (pmf) for \(Y\) is \(f_Y (y)=P(Y=y)\).

Random Variables

\(Y\) is a continuous random variable if there exists a function \(f_Y (y)\) such that: - \(f_Y (y)\geq 0\) for all \(y\), - \(\int_{-\infty}^\infty f_Y (y) dy = 1\), - and for \(a\leq b\), \(P(a<Y<b)=\int_a^b f_Y (y) dy\).

The function \(f_Y\) is called the probability density function (pdf).

Additionally, \(F_Y (y)=\int_{-\infty}^y f_Y (y) dy\) and \(f_Y (y)=F'_Y(y)\) at any point \(y\) at which \(F_Y\) is differentiable.

Expected value and Variance

The expected value, mean, or first moment of \(Y\) is defined as

\[E(Y)= \begin{cases}\sum\limits_y yf(y) & \text{if }Y\text{ is discrete}\\ ∫yf(y) dy &\text{if } Y\text{ is continuous}\end{cases}\] assuming the sum and integral are well-defined.

The variance of \(Y\) is defined by \[var(Y)=E(Y-E(Y))^2\] and the standard deviation of Y is \[SD(Y)=\sqrt{var(Y) }\].

Bivariate distributions

Consider two random variables \(X\) and \(Y\). Let \(S\) be the joint support of \(X\) and \(Y\) (all the possible combinations of X and Y).

The joint CDF of the random variables is \[F(x,y)=P(X\leq x,Y\leq y)\].

If \(X\) and \(Y\) are jointly discrete, the joint pmf \(f(x,y)\) specifies \(P(X=x,Y=y)\) and satisfies the following properties:

  • \(0\leq f(x,y)\leq 1\)
  • \(\sum\sum_{(x,y)\in S} f(x,y) =1\)
  • \(P((X,Y)\in A)=\sum\sum_{(x,y)\in A} f(x,y)\) .

Marginal Density of Discrete RV

If \(X\) and \(Y\) are jointly discrete with joint pmf \(f(x,y)\), then the marginal pmf of \(X\), \(f_X (x)\) is obtained via the formula

\[f_X (x)=\sum _{\text{all }y}f(x,y)\] and

\[E(XY)=\sum_{\text{all }x}\sum_{\text{all }y}xyf(x,y).\]

Joint of continuous RV

If \(X\) and \(Y\) are jointly continuous, \(f(x,y)\) is the joint pdf if:

  • f\((x,y)\geq 0\) for all \((x,y)\in S\)
  • \(\int_{-\infty}^\infty∫_{-infty}^\infty f(x,y) dx dy=1\)
  • \(P\left((X,Y)\in A\right)=\int \int_{x,y\in A} f(x,y)dx dy\).

If \(X\) and \(Y\) are jointly continuous with joint pdf \(f(x,y)\), then the marginal density of \(X\), \(f_X (x)\) is obtained via the formula

\[f_X (x)=∫_{-\infty}^\infty f(x,y) dy\]

\[E(XY)=\int_{-\infty}^\infty\int_{-\infty}^\infty xyf(x,y)dy dx.\]

Covariance

The covariance between random variables \(Y\) and \(Z\) is \[cov(Y,Z)=E[(Y-E(Y))(Z-E(Z))]=E(YZ)-E(Y)E(Z).\]

Properties of Variance and Covariance

Let \(a\) and \(b\) be scalar constants. Then:

  • \(E(aY)=aE(Y)\)
  • \(E(a+Y)=a+E(Y)\)
  • \(E(aY+bZ)=aE(Y)+bE(Z)\)
  • \(var(aY)=a^2 var(Y)\)
  • \(var(a+Y)=var(Y)\)
  • \(cov(aY,bZ)=ab cov(Y,Z).\)
  • \(var(Y+Z)=var(Y)+var(Z)+2cov(Y,Z).\)

Independence

\(Y\) and \(Z\) are independent if \(F(y,z)=F_Y (y) F_Z (z).\)

If \(Y\) and \(Z\) are independent, then:

  • E(YZ)=E(Y)E(Z)
  • cov(Y,Z)=0.

Properties of random vectors

Let \(\mathbf{y}=(Y_1,Y_2,\dots,Y_n )^T\) be an \(n×1\) vector of random variables. \(\mathbf{y}\) is a random vector.

  • A vector is always defined to be a column vector, even if the notation is ambiguous.

\[E(\mathbf{y})=\begin{pmatrix}E(Y_1)\\E(Y_2)\\\vdots\\E(Y_n)\end{pmatrix}\]

Properties of random vectors

Let \(\mathbf{y}=(Y_1,Y_2,\dots,Y_n )^T\) be an \(n×1\) vector of random variables. \(\mathbf{y}\) is a random vector.

\[\begin{aligned} var(\mathbf{y})= & E(\mathbf{y}\mathbf{y}^T )-E(\mathbf{y})E(\mathbf{y})^T\\ =& \begin{pmatrix}var(Y_1) & cov(Y_1,Y_2) &\dots &cov(Y_1,Y_n)\\cov(Y_2,Y_1 )&var(Y_2)&\dots&cov(Y_2,Y_n)\\\vdots&\vdots&\vdots&\vdots\\ cov(Y_n,Y_1)&cov(Y_n,Y_2)&…&var(Y_n)\end{pmatrix}\end{aligned}.\]

Important Properties

Define:

  • \(A\) to be an m×n matrix of constants
  • \(\mathbf{x}=(X_1,X_2,…,X_n )^T\)and \(\mathbf{z}=(Z_1,Z_2,…,Z_n )^T\) to be \(n×1\) random vectors.

Formally, \[cov(\mathbf{x},\mathbf{y})=E(\mathbf{x}\mathbf{y}^T )-E(\mathbf{x})E(\mathbf{y})^T.\]

Important Properties

Additionally:

  • \(E(A\mathbf{y})=AE(\mathbf{y}), E(\mathbf{y}A^T )=E(\mathbf{y}) A^T.\)
  • \(E(\mathbf{x}+\mathbf{y})=E(\mathbf{x})+E(\mathbf{y})\)
  • \(var(A\mathbf{y})=A\ var(\mathbf{y}) A^T\)
  • \(cov(\mathbf{x}+\mathbf{y},\mathbf{z})=cov(\mathbf{x},\mathbf{z})+cov(\mathbf{y},\mathbf{z})\)
  • \(cov(\mathbf{x},\mathbf{y}+\mathbf{z})=cov(\mathbf{x},\mathbf{y})+cov(\mathbf{x},\mathbf{z})\)
  • \(cov(A\mathbf{x},\mathbf{y})=A\ cov(\mathbf{x},\mathbf{y})\text{ and } cov(\mathbf{x},A\mathbf{y})=cov(\mathbf{x},\mathbf{y}) A^T.\)

Important Properties

Let \(\mathbf{a}\) is an \(n×1\) vector of constants and \(\mathbf{0}_{n\times n}\) be an \(n\times n\) matrix of zeros, then

\[var(a)=0_{n\times n},\]

\[cov(\mathbf{a},\mathbf{y})=0_{n\times n},\]

and

\[var(\mathbf{a}+\mathbf{y})=var(\mathbf{y}).\]

Multivariate normal (Gaussian) distribution

\(\mathbf{y}=(Y_1,\dots,Y_n )^T\) has a multivariate normal distribution with mean \(\mu\) (an \(n\times 1\) vector) and covariance \(\Sigma\) (an \(n\times n\) matrix) if the joint pdf is

\[f(\mathbf{y})=\frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2} } \exp\left(-\frac{1}{2} (\mathbf{y}-\mathbf{\mu})^T \Sigma^{-1} (\mathbf{y}-\mathbf{\mu})\right).\]

Note that \(\Sigma\) must be symmetric and positive definite.

We would denote this as \(\mathbf{y}∼N(\mathbf{\mu},\Sigma)\).

Properties

Important fact: A linear function of a multivariate normal random vector (i.e., \(a+A\mathbf{y}\)) is also multivariate normal (though it could collapse to a single random variable).

Application: Suppose that \(\mathbf{y}∼N(\mu,\Sigma)\). For an \(m\times n\) matrix of constants \(A\), \(A\mathbf{y}∼N(A\mu,A\Sigma A^T)\).

Example

Gasoline is to be stocked in a bulk tank once at the beginning of each week and then sold to individual customers. Let \(Y_1\) denote the proportion of the capacity of the bulk tank that is available after the tank is stocked at the beginning of the week. Because of the limited supplies, \(Y_1\) varies from week to week. Let \(Y_2\) denote the proportion of the capacity of the bulk tank that is sold during the week. Because \(Y_1\) and \(Y_2\) are both proportions, both variables are between 0 and 1. Further, the amount sold, \(y_2\), cannot exceed the amount available, \(y_1\). Suppose the joint density function for \(Y_1\) and \(Y_2\) is given by \[f(y_1,y_2 )=3y_1;\ 0≤y_2\leq y_1\leq 1.\]

Problem 1

Determine \(P(0\leq Y_1\leq 0.5;\ 0.25\leq Y_2)\)

Problem 2

Determine \(f_{Y_1 }\) and \(f_{Y_2 }\)

Problem 3

Determine \(E(Y_1)\) and \(E(Y_2)\)

Problem 4

Determine \(var(Y_1)\) and \(var(Y_2)\)

Problem 5

Determine \(E(Y_1 Y_2)\)

Problem 6

Determine \(cov(Y_1,Y_2)\)

Problem 7

Determine the mean and variance of \(a^T y\), where \(a=(1,-1)^T\) and \(y=(Y_1,Y_2 )^T\). This is the expectation and variance of the different between the amount of gas available and the amount of gas sold:

Matrix Differentiation

Matrix Differentiation 1

Let \[\mathbf{y=Ax},\] where \(\mathbf{y}\) is \(m\times 1\), \(\mathbf{x}\) is \(n\times 1\) , \(\mathbf{A}\) is \(m\times n\), and \(\mathbf{A}\) does not depend on \(\mathbf{x}\), then \[\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=A\]

Matrix Differentiation 1 (Proof)

Since \(i\)th element of \(\mathbf{y}\) is given by \[y_i=\sum\limits_{k=1}^{n}a_{ik}x_k,\] it follows that \[\frac{\partial y_i}{\partial x_j}=a_{ij}\] for all \(i=1,\dots ,m,\quad j=1,\dots ,n\). Hence \[\frac{\partial \mathbf{y}}{\partial \mathbf{x}}=A\]

Matrix Differentiation 2

Let the scalar \(\alpha\) be defined by \[\alpha =\mathbf{y}^T\mathbf{Ax},\] where \(\mathbf{y}\) is \(m\times 1\), \(\mathbf{x}\) is \(n\times 1\) , \(\mathbf{A}\) is \(m\times n\), and \(\mathbf{A}\) does not depend on \(\mathbf{x}\) and \(\mathbf{y}\), then \[\frac{\partial \alpha }{\partial \mathbf{x}}=\mathbf{y}^T\mathbf{A}\] \[\frac{\partial \alpha }{\partial \mathbf{y}}=\mathbf{x}^T\mathbf{A}^T\]

Matrix Differentiation 2 (Proof)

Define \(\mathbf{w}^T=\mathbf{y}^T\mathbf{A}\) and note that \(\alpha =\mathbf{w}^T\mathbf{x}\)

Hence, \[\frac{\partial \alpha}{\partial \mathbf{x}}=\mathbf{w}^T=\mathbf{y}^T\mathbf{A}.\] Since \(\alpha\) is a scalar we can write \[\alpha =\alpha^T=\mathbf{x}^T\mathbf{A}^T\mathbf{y}\] hence, \[\frac{\partial \alpha }{\partial \mathbf{y}}=\mathbf{x}^T\mathbf{A}^T\]

Matrix Differentiation 3

For the special case in which the scalar \(\alpha\) is given by the quadratic form\[\alpha=\mathbf{x}^T\mathbf{A}\mathbf{x}\] where \(\mathbf{x}\) is \(n\times 1\) , \(\mathbf{A}\) is \(n\times n\), and \(\mathbf{A}\) does not depend on \(\mathbf{x}\), then \[\frac{\partial \alpha}{\partial \mathbf{x}}=\mathbf{x}^T(\mathbf{A}+\mathbf{A}^T)\]

Matrix Differentiation 3 (Proof)

By definition,\[\alpha =\sum\limits_{j=1}^{n}\sum\limits_{i=1}^{n}a_{ij}x_ix_j\] Differentiating with respect to the \(k\)th element of \(x\) we have \[\frac{\partial \alpha}{\partial x_k}=\sum\limits_{j=1}^{n}a_{kj}x_j+\sum\limits_{i=1}^{n}a_{ik}x_i\] for all \(k=1,\dots ,n\), and consequently, \[\frac{\partial \alpha }{\partial \mathbf{x}}=\mathbf{x}^T\mathbf{A}^T+\mathbf{x}^T\mathbf{A}=\mathbf{x}^T(A^T+A)\]

Matrix Differentiation 3.5

For the special case where \(\mathbb{A}\) is a symmetric matrix and \[\alpha=\mathbf{x}^T\mathbf{A}\mathbf{x}\] where \(\mathbf{x}\) is \(n\times 1\) , \(\mathbf{A}\) is \(n\times n\), and \(\mathbf{A}\) does not depend on \(\mathbf{x}\), then \[\frac{\partial \alpha}{\partial \mathbf{x}}=2\mathbf{x}^T\mathbb{A}.\]

Matrix Differentiation 4

Let the scalar \(\alpha\) be defined by\[\alpha =\mathbf{y}^T\mathbf{x}\] where \(\mathbf{y}\) is \(n\times 1\), \(\mathbf{x}\) is \(n\times 1\), and both \(\mathbf{y}\) and \(\mathbf{x}\) are functions of the vector \(\mathbf{z}\). Then \[\frac{\partial \alpha}{\partial \mathbf{z}}=\mathbf{x}^T\frac{\partial \mathbf{y}}{\partial \mathbf{z}}+\mathbf{y}^T\frac{\partial \mathbf{x}}{\partial \mathbf{z}}\]

Matrix Differentiation 4.5

Let the scalar \(\alpha\) be defined by \[\alpha =\mathbf{x}^T\mathbf{x}\] where \(\mathbf{x}\) is \(n\times 1\), and \(\mathbf{x}\) is a function of the vector \(\mathbf{z}\). Then \[\frac{\partial \alpha }{\partial \mathbf{z}}=2\mathbf{x}^T\frac{\partial \mathbf{x}}{\partial \mathbf{z}}.\]