Probability distribution
In statistics , the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.
The probability density function for the random matrix X (n × p ) that follows the matrix normal distribution M N n , p ( M , U , V ) {\displaystyle {\mathcal {MN}}_{n,p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )} has the form:
p ( X ∣ M , U , V ) = exp ( − 1 2 t r [ V − 1 ( X − M ) T U − 1 ( X − M ) ] ) ( 2 π ) n p / 2 | V | n / 2 | U | p / 2 {\displaystyle p(\mathbf {X} \mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} )={\frac {\exp \left(-{\frac {1}{2}}\,\mathrm {tr} \left[\mathbf {V} ^{-1}(\mathbf {X} -\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\right]\right)}{(2\pi )^{np/2}|\mathbf {V} |^{n/2}|\mathbf {U} |^{p/2}}}} where t r {\displaystyle \mathrm {tr} } denotes trace and M is n × p , U is n × n and V is p × p , and the density is understood as the probability density function with respect to the standard Lebesgue measure in R n × p {\displaystyle \mathbb {R} ^{n\times p}} , i.e.: the measure corresponding to integration with respect to d x 11 d x 21 … d x n 1 d x 12 … d x n 2 … d x n p {\displaystyle dx_{11}dx_{21}\dots dx_{n1}dx_{12}\dots dx_{n2}\dots dx_{np}} .
The matrix normal is related to the multivariate normal distribution in the following way:
X ∼ M N n × p ( M , U , V ) , {\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} ),} if and only if
v e c ( X ) ∼ N n p ( v e c ( M ) , V ⊗ U ) {\displaystyle \mathrm {vec} (\mathbf {X} )\sim {\mathcal {N}}_{np}(\mathrm {vec} (\mathbf {M} ),\mathbf {V} \otimes \mathbf {U} )} where ⊗ {\displaystyle \otimes } denotes the Kronecker product and v e c ( M ) {\displaystyle \mathrm {vec} (\mathbf {M} )} denotes the vectorization of M {\displaystyle \mathbf {M} } .
The equivalence between the above matrix normal and multivariate normal density functions can be shown using several properties of the trace and Kronecker product , as follows. We start with the argument of the exponent of the matrix normal PDF:
− 1 2 tr [ V − 1 ( X − M ) T U − 1 ( X − M ) ] = − 1 2 vec ( X − M ) T vec ( U − 1 ( X − M ) V − 1 ) = − 1 2 vec ( X − M ) T ( V − 1 ⊗ U − 1 ) vec ( X − M ) = − 1 2 [ vec ( X ) − vec ( M ) ] T ( V ⊗ U ) − 1 [ vec ( X ) − vec ( M ) ] {\displaystyle {\begin{aligned}&\;\;\;\;-{\frac {1}{2}}{\text{tr}}\left[\mathbf {V} ^{-1}(\mathbf {X} -\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\right]\\&=-{\frac {1}{2}}{\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)^{T}{\text{vec}}\left(\mathbf {U} ^{-1}(\mathbf {X} -\mathbf {M} )\mathbf {V} ^{-1}\right)\\&=-{\frac {1}{2}}{\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)^{T}\left(\mathbf {V} ^{-1}\otimes \mathbf {U} ^{-1}\right){\text{vec}}\left(\mathbf {X} -\mathbf {M} \right)\\&=-{\frac {1}{2}}\left[{\text{vec}}(\mathbf {X} )-{\text{vec}}(\mathbf {M} )\right]^{T}\left(\mathbf {V} \otimes \mathbf {U} \right)^{-1}\left[{\text{vec}}(\mathbf {X} )-{\text{vec}}(\mathbf {M} )\right]\end{aligned}}} which is the argument of the exponent of the multivariate normal PDF with respect to Lebesgue measure in R n p {\displaystyle \mathbb {R} ^{np}} . The proof is completed by using the determinant property: | V ⊗ U | = | V | n | U | p . {\displaystyle |\mathbf {V} \otimes \mathbf {U} |=|\mathbf {V} |^{n}|\mathbf {U} |^{p}.}
If X ∼ M N n × p ( M , U , V ) {\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )} , then we have the following properties:[ 1] [ 2]
The mean, or expected value is:
E [ X ] = M {\displaystyle E[\mathbf {X} ]=\mathbf {M} } and we have the following second-order expectations:
E [ ( X − M ) ( X − M ) T ] = U tr ( V ) {\displaystyle E[(\mathbf {X} -\mathbf {M} )(\mathbf {X} -\mathbf {M} )^{T}]=\mathbf {U} \operatorname {tr} (\mathbf {V} )} E [ ( X − M ) T ( X − M ) ] = V tr ( U ) {\displaystyle E[(\mathbf {X} -\mathbf {M} )^{T}(\mathbf {X} -\mathbf {M} )]=\mathbf {V} \operatorname {tr} (\mathbf {U} )} where tr {\displaystyle \operatorname {tr} } denotes trace .
More generally, for appropriately dimensioned matrices A ,B ,C :
E [ X A X T ] = U tr ( A T V ) + M A M T E [ X T B X ] = V tr ( U B T ) + M T B M E [ X C X ] = V C T U + M C M {\displaystyle {\begin{aligned}E[\mathbf {X} \mathbf {A} \mathbf {X} ^{T}]&=\mathbf {U} \operatorname {tr} (\mathbf {A} ^{T}\mathbf {V} )+\mathbf {MAM} ^{T}\\E[\mathbf {X} ^{T}\mathbf {B} \mathbf {X} ]&=\mathbf {V} \operatorname {tr} (\mathbf {U} \mathbf {B} ^{T})+\mathbf {M} ^{T}\mathbf {BM} \\E[\mathbf {X} \mathbf {C} \mathbf {X} ]&=\mathbf {V} \mathbf {C} ^{T}\mathbf {U} +\mathbf {MCM} \end{aligned}}} Transpose transform:
X T ∼ M N p × n ( M T , V , U ) {\displaystyle \mathbf {X} ^{T}\sim {\mathcal {MN}}_{p\times n}(\mathbf {M} ^{T},\mathbf {V} ,\mathbf {U} )} Linear transform: let D (r -by-n ), be of full rank r ≤ n and C (p -by-s ), be of full rank s ≤ p , then:
D X C ∼ M N r × s ( D M C , D U D T , C T V C ) {\displaystyle \mathbf {DXC} \sim {\mathcal {MN}}_{r\times s}(\mathbf {DMC} ,\mathbf {DUD} ^{T},\mathbf {C} ^{T}\mathbf {VC} )} Let's imagine a sample of n independent p -dimensional random variables identically distributed according to a multivariate normal distribution :
Y i ∼ N p ( μ , Σ ) with i ∈ { 1 , … , n } {\displaystyle \mathbf {Y} _{i}\sim {\mathcal {N}}_{p}({\boldsymbol {\mu }},{\boldsymbol {\Sigma }}){\text{ with }}i\in \{1,\ldots ,n\}} . When defining the n × p matrix X {\displaystyle \mathbf {X} } for which the i th row is Y i {\displaystyle \mathbf {Y} _{i}} , we obtain:
X ∼ M N n × p ( M , U , V ) {\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {U} ,\mathbf {V} )} where each row of M {\displaystyle \mathbf {M} } is equal to μ {\displaystyle {\boldsymbol {\mu }}} , that is M = 1 n × μ T {\displaystyle \mathbf {M} =\mathbf {1} _{n}\times {\boldsymbol {\mu }}^{T}} , U {\displaystyle \mathbf {U} } is the n × n identity matrix, that is the rows are independent, and V = Σ {\displaystyle \mathbf {V} ={\boldsymbol {\Sigma }}} .
Maximum likelihood parameter estimation [ edit ] Given k matrices, each of size n × p , denoted X 1 , X 2 , … , X k {\displaystyle \mathbf {X} _{1},\mathbf {X} _{2},\ldots ,\mathbf {X} _{k}} , which we assume have been sampled i.i.d. from a matrix normal distribution, the maximum likelihood estimate of the parameters can be obtained by maximizing:
∏ i = 1 k M N n × p ( X i ∣ M , U , V ) . {\displaystyle \prod _{i=1}^{k}{\mathcal {MN}}_{n\times p}(\mathbf {X} _{i}\mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} ).} The solution for the mean has a closed form, namely
M = 1 k ∑ i = 1 k X i {\displaystyle \mathbf {M} ={\frac {1}{k}}\sum _{i=1}^{k}\mathbf {X} _{i}} but the covariance parameters do not. However, these parameters can be iteratively maximized by zero-ing their gradients at:
U = 1 k p ∑ i = 1 k ( X i − M ) V − 1 ( X i − M ) T {\displaystyle \mathbf {U} ={\frac {1}{kp}}\sum _{i=1}^{k}(\mathbf {X} _{i}-\mathbf {M} )\mathbf {V} ^{-1}(\mathbf {X} _{i}-\mathbf {M} )^{T}} and
V = 1 k n ∑ i = 1 k ( X i − M ) T U − 1 ( X i − M ) , {\displaystyle \mathbf {V} ={\frac {1}{kn}}\sum _{i=1}^{k}(\mathbf {X} _{i}-\mathbf {M} )^{T}\mathbf {U} ^{-1}(\mathbf {X} _{i}-\mathbf {M} ),} See for example [ 3] and references therein. The covariance parameters are non-identifiable in the sense that for any scale factor, s >0, we have:
M N n × p ( X ∣ M , U , V ) = M N n × p ( X ∣ M , s U , 1 s V ) . {\displaystyle {\mathcal {MN}}_{n\times p}(\mathbf {X} \mid \mathbf {M} ,\mathbf {U} ,\mathbf {V} )={\mathcal {MN}}_{n\times p}(\mathbf {X} \mid \mathbf {M} ,s\mathbf {U} ,{\tfrac {1}{s}}\mathbf {V} ).} Drawing values from the distribution [ edit ] Sampling from the matrix normal distribution is a special case of the sampling procedure for the multivariate normal distribution . Let X {\displaystyle \mathbf {X} } be an n by p matrix of np independent samples from the standard normal distribution, so that
X ∼ M N n × p ( 0 , I , I ) . {\displaystyle \mathbf {X} \sim {\mathcal {MN}}_{n\times p}(\mathbf {0} ,\mathbf {I} ,\mathbf {I} ).} Then let
Y = M + A X B , {\displaystyle \mathbf {Y} =\mathbf {M} +\mathbf {A} \mathbf {X} \mathbf {B} ,} so that
Y ∼ M N n × p ( M , A A T , B T B ) , {\displaystyle \mathbf {Y} \sim {\mathcal {MN}}_{n\times p}(\mathbf {M} ,\mathbf {AA} ^{T},\mathbf {B} ^{T}\mathbf {B} ),} where A and B can be chosen by Cholesky decomposition or a similar matrix square root operation.
Relation to other distributions [ edit ] Dawid (1981) provides a discussion of the relation of the matrix-valued normal distribution to other distributions, including the Wishart distribution , inverse-Wishart distribution and matrix t-distribution , but uses different notation from that employed here.
^ A K Gupta; D K Nagar (22 October 1999). "Chapter 2: MATRIX VARIATE NORMAL DISTRIBUTION". Matrix Variate Distributions . CRC Press. ISBN 978-1-58488-046-2 . Retrieved 23 May 2014 . ^ Ding, Shanshan; R. Dennis Cook (2014). "DIMENSION FOLDING PCA AND PFC FOR MATRIX- VALUED PREDICTORS". Statistica Sinica . 24 (1): 463–492. ^ Glanz, Hunter; Carvalho, Luis (2013). "An Expectation-Maximization Algorithm for the Matrix Normal Distribution". arXiv :1309.6609 [stat.ME ].
Discrete univariate
with finite support with infinite support
Continuous univariate
supported on a bounded interval supported on a semi-infinite interval supported on the whole real line with support whose type varies
Mixed univariate
Multivariate (joint) Directional Degenerate and singular Families