UCバークレー留学記

カリフォルニア大学バークレー校(UC Berkeley)での思考の記録。

ブートストラップ

The Nonparametric Bootstrap

Resampling Plans

The Parametric Bootstrap

Influence Functions and Robust Estimations

The Percentile Methods

Bias-Corrected Confidence Intervals

Second-Order Accuracy

Bootstrap-t Intervals

Objective Bayes Intervals and the Confidence Distribution

最尤推定

A family of probability densities

 \mathcal{F} = \left\{f_{\mu}(x); x \in \mathcal{X}, \mu \in \mathcal{\Omega}\right\}
 x, the observed data, is a point in the sample space  \mathcal{X}, while the unobserved parameter  \mu is a point in the parameter space  \mathcal{\Omega}.

Log likelihood functions

 l_{x}(\mu) = \log{f_\mu(x)}
The parameter vector  \mu is varying while the observed data vector  x is fixed.

Maximum Likelihood Estimate

 \hat{\mu} = \argmax_{\mu \in \mathcal{\Omega}} l_x(\mu)

Fisher Information

With a one-parameter family of densities
 \mathcal{F} = \left\{f_{\theta}(x); x \in \mathcal{X}, \theta \in \mathcal{\Omega}\right\}
The score function is the first derivative of  f(x) with respect to  \theta
 \dot{l}_x(\theta) = \frac{\dot{f}_\theta(x)}{f_\theta(x)}
Then, the Fisher information  \mathcal{L}_\theta is defined to be the variance of the score function,
 \mathcal{L}_{\theta} = \int_{\mathcal{X}} \dot{l}_{x}(\theta)^{2}f_{\theta}(x)dx
 \dot{l}_x(\theta) \sim (0, \mathcal{L}_{\theta})
The MLE  \hat{\theta} has approximately normal distribution with mean  \theta and variance  1/\mathcal{L}_{\theta},
 \hat{\theta} \dot{\sim} \mathcal{N}(\theta, 1/\mathcal{L}_{\theta}).

Confidence Interval

An i.i.d. sample
 x_{i} \sim^{iid} \mathcal{N}(0, 1), \ i = 1, 2, ..., n With a one-parameter family of densities,  \hat{\theta} \dot{\sim} \mathcal{N}(0, 1/I(x)),
where  I(x) is the observed Fisher information
 I(x) = -\ddot{I}_{x}(\hat{\theta}) = -\frac{\partial^{2}}{\partial\theta^{2}}l_{x}(\theta)|_{\hat{\theta}}

Permutation and Randomization

広告を非表示にする

線形代数 2

Set topology

Hull

 \mathcal{P} = \left\{ x^{(1)}, ..., x^{(n)} \right\}
* Linear hull
 x = \sum \lambda_{i} x^{(i)}
* Affine hull
 x = \sum \lambda_{i} x^{(i)} , where  \sum \lambda_{i} =1
* Convex hull
 x = \sum \lambda_{i} x^{(i)} , where  \sum \lambda_{i} =1 and  \lambda_{i} \geq 0
* Conic hull
 x = \sum \lambda_{i} x^{(i)} , where  \lambda_{i} \geq 0

最適化2

Optimality conditions

Unrestricted

 \nabla f_{0}=0

Equality-constrained

 Ax=b, and  \exists \nu \in \mathbb{R}^{m};  \nabla f_{0}+A^{T}\nu = 0

Inequality-constrained

 \lambda_{i} \geq 0,  i \in \mathcal{A}(x)
 \lambda _{0} + \sum_{i \in \mathcal{A}(x)} \lambda_{i} \nabla f_{i}=0,
where  \mathcal{A}(x) is a set of active constraints.

Duality

Consider a primal in standard form.  p^{*} = \min_{x \in \mathbb{R}^{n}} f_{0}(x), s.t.  f_{i}(x) \leq 0, i=1, ..., m,  h_{i}(x) = 0, i = 1, ..., q.

Lagrangian

 \mathcal{L}(x, \lambda, \nu) \doteq f_{0}(x) + \sum_{i=1}^{m}\lambda_{i}f_{i}(x) + \sum_{i=1}^{q}\nu_{i}h_{i}(x),
where  \lambda, \nu are called the Lagrange multipliers, or dual variables, of the problem.

Lagrange dual function

 g(\lambda, \nu) = \min_{x \in \mathcal{D}} \mathcal{L}(x, \lambda, \nu), where
 g(\lambda, \nu): \mathbb{R}^m \times \mathbb{R}^q \rightarrow \mathbb{R}.

Lower bound property of the dual function

The dual function  g(\lambda, \nu) is jointly concave in  (\lambda, \nu). Moreover, it holds that
 g(\lambda, \nu) \leq p^{*}, \ \forall \lambda \geq 0, \forall \nu.

Dual optimization problem and weak duality

d  = \max_{\lambda, \nu} g(\lambda, \nu) s.t.  \lambda \geq 0.
It is remarkable that the dual problem is always a convex optimization problem, even when the primal problem is not convex. Weak duality property of the dual problem is: d
 \leqp*.

Strong duality and Slater's condition for convex programs

The first  k \leq m of the convex  f_{i} are affine. If there exist a point  x \in relint \mathcal{D} such that
 f_{1} \leq 0, ..., f_{k} \leq 0, f_{k+1}<0, ..., f_{m}<0
  h_{1} = 0, ..., h_{q} = 0,
then strong duality holds between the primal and dual problems, that is, p=d.

Complementary slackness

A primal and the corresponding dual inequality cannot be slack simultaneously.
 \lambda_{i} f_{i}(x)=0
If  f_{i}(x) \leq 0, then it must be  \lambda = 0. If  \lambda > 0, then it must be  f_{i}(x) = 0.

広告を非表示にする

線形代数 1

Properties of trace and rank

 tr(cA+dB) = ctr(A) + dtr(B)
 tr(AB) = tr(BA)
 x^{T}Ax = tr(x^{T}Ax)= tr(Axx^{T})
 tr(A) = \sum_{i=1}^{n} \lambda_i(A)
 rank(AB) \leq rank(A), rank(AB) \leq rank(B)  rank(AA^{T}) = rank(A^{T}A) = rank(A)

Norm

  • Definition
     ||x|| \geq 0, \forall x \in \chi
     ||x|| = iff x = 0
     ||x+y|| \leq ||x|| + ||y||, \forall x, y \in \chi
     ||\alpha x|| = |\alpha| ||x|| for any scaler  \alpha and any  x \in \chi.
  • Examples
     ||x||_{1} = \sum |x|_{k}
     ||x||_{\infty} = \max{|x|_{k}}

Matrix Norm

  • Frobenius Norm
     ||A||_{F} = \sqrt{trAA^{T}}= \sqrt{\sum \sum |a_{ij}|^{2}}=\sqrt{\sum \lambda_{i}(AA^{T})}
     \forall x \in \mathbb{R}^{n}, ||Ax||_{2} \leq ||A||_{F}||x||_{2}

Cauchy-Schwartz inequalities and definition of angles

 \forall \ x, y, \in \mathbb{R}^n: \ x^{T}y \leq |x|_{2}|y|_{2}
We can define the corresponding angle as  \theta such that
 \cos \theta = \frac{x^{T}y}{|x|_{2}|y|_{2}}.

Range, nullspace, and rank

 \mathit{R}(A) := \left \{Ax: x \in \mathbb{R}^{n} \right \}
 \mathit{N}(A) := \left \{x \in \mathbb{R}^{n}: Ax = 0 \right \}
The rank of matrix A is the dimension of its range.

Matrix description of subspaces

Symmetric matrix

 A^{T} = A
Examples
* Diagonal matrix
* Dyad  uu^{T}
Any quadratic function can be written as
 q(x) = x^{T}Ax, where  A \in \mathit{S}^{n}

Congruence transformations

For any matrix  A \in \mathbb{R}^{m, n} it holds that:
*  A^{T}A \succeq 0, and  A A^{T} \succeq 0;
*  A^{T}A \succ 0 if and only if  A is full-column rank, i.e.,  rank A  = n;
*  AA^{T} \succ 0 if and only if  A is full-row rank, i.e.,  rank A = m.

Eigenvalue and eigenvector

  •  \sum \lambda_i = tr(A)
  •  \prod \lambda_i = det(A)
  • Therefore  det(A) = 0 means at least one eigenvalue is 0. Also, A is invertible when  \lambda_i \neq 0 \ \forall \ i
  • Eigenvalues of symmetric matrices are nonnegative. If  \mathbf{A} is positive semi-definite, then  \lambda_{i} \geq 0
  • Eigenvalues of positive definite matrices are positive. If  \mathbf{A} is positive definite, then  \lambda_{i} > 0
  • From  \prod \lambda_i = det(A), matrix A is invertable if and only if A is positive definite.

Spectral decomposition (a.k.a. eigendecomposition) for symmetric matrix

Any symmetric matrix can be decomposed as a weighted sum of normalized dyads.  A \in \mathbb{R}^{n, n}
 A^{T} = A then A can be described by eigenvalues and eigenvectors of A.
 A = U \Lambda U^{T}
Or,  U^{T} A U = \Lambda

Singular value decomposition

In words, the singular value theorem states that any matrix can be written as a sum of simpler matrices (orthogonal dyads).
 A \in \mathbb{R}^{m, n}
 rank(A) = r
Then A can be described by singular values of  A (i.e. eigenvalues of  A^{T}A) and eigenvectors of  A^{T}A and  AA^{T} as follows.
 A = \tilde{U} \tilde{\Sigma} \tilde{V}^T
Or, in compact form,
 A = U \Sigma V^{T}=\sum_{i=1}^{r}\sigma_{i}u_{i}v_{i}^{T}.
 U \in \mathbb{R}^{m, n}, V \in \mathbb{R}^{n, r}, \Sigma \in \mathbb{R}^{r, r}
 \sigma_i^{2} = \lambda_i(AA^{T}) = \lambda_i(A^{T}A) \geq 0 (Because  AA^{T} and  A^{T}A are p.s.d..)
 V and  U are orthogonal matrices.
* SVD, range, and nullspace
The first  r columns of  U are the orthogonal basis of range space (columns space) of A.
The last  n-r columns of  V are the orthogonal basis of nullspace of A.

Cholesky decomposition of p.s.d. and p.d. matrices

If  \exists \mathbf{P} such that  \mathbf{A} = \mathbf{PP^{T}}, then  \mathbf{A} is positive semi-definite.
If  \mathbf{A} is positive semi-definite, then  \exists \mathbf{P} such that  \mathbf{A} = \mathbf{PP^{T}}. That is, any p.s.d. matrix can be written as a product  A = PP^{T}. P is not unique. If A is positive definite, then we can choose lower triangular matrix for the decomposition as  A = LL^{T}, where L is invertable.
Example of p.s.d. matrix
Variance-covariance matrix is a notable example of p.s.d. matrix.
 \Sigma = E(x - \hat{x})(x - \hat{x})^{T}, where  \hat{x} = E(x).

Rayleigh quotient

Given  A \in \mathbb{S}^{n}, it holds that
 \lambda_{min} \leq \frac{x^{T}Ax}{x^{T}x} \leq \lambda_{max}  \forall x \neq 0
Therefore we can solve optimization problem in quadratic form by finding eigenvalues.  \lambda_{max}(A) = \max_{||x||_2 = 1} x^{T}Ax

 \lambda_{min}(A) = \min_{||x||_2 = 1} x^{T}Ax

Properties of eigenvalues and eigenvectors

Type of matrix Eigenvalues Eigenvectors
Symmetric  A^{T}=A real \lambda's orthogonal  x_{i}^{T}x_{j}=0
Orthogonal  Q^{T} = Q^{-1} all  \lambda=1 orthogonal x_{i}^{T}x_{j}=0
Positive definite  x^{T}Ax all  \lambda > 0 orthogonal
Similar matrix  B=M^{-1}AM  \lambda(B) = \lambda(A)  x(B) = M^{-1}x(A)
Projections  P=P^{2}=P^{T}  \lambda = 1; 0 column space; nullspace
Every matrix A = U \Sigma V^{T} rank(A) = rank( \Sigma) eigenvectors of  A^{T}A, AA^{T} in  V, U