3 Advanced

2020-03-01 58浏览

1.Introduction to Operating Systems Lecture 3: Advanced Probabilistic Topics MING GAO SE@ecnu (for course related communications) mgao@sei.ecnu.edu.cn Sep. 30, 2016
2.Outline 1 Markov chain and random walk 2 Graphical models Directed Model Undirected Model 3 Tail Bounds MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 2 / 35
3.Markov chain and random walk Markov chain A stochastic processes {Xt t ∈ T } is a collection of random variables. The index t is often called time, as the process represents the value of a random variable changing over time. Let Ω be the set of values assumed by the random variables Xt . We call each element of Ω a state, as Xt represents the state of the process at time t. Definition of Markov property A process X0 , X1 , · · · satisfies the Markov property if P(Xn+1 = xn+1 X0 = x0 , · · · , Xn = xn ) = P(Xn+1 = xn+1 Xn = xn ) for all n and all xi ∈ Ω. Definition of Markov chain A stochastic process X0 , X1 , · · · of discrete time and discrete space is a Markov chain A random walk on a graph can be modeled as a Markov chain MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 3 / 35
4.Markov chain and random walk Transition matrix Definition (t+1) Let a Markov chain have Px,y = P[Xt+1 = y Xt = x], and the finite state space be Ω = [n]. This gives us a transition matrix P (t+1) at time t. The transition matrix is an N × N matrix of nonnegative entries such that the sum over each row of P (t) is 1, since ∀n and ∀xi ∈ Ω X (t+1) X Px,y = P[Xt+1 = y Xt = x] = 1 y y  1 2 1 3  For example, P (t+1) =   0 0 (t+1) P1,2 1 2 2 3 MING GAO (SE@ecnu) 3 4 1 4 0 0 = P[Xt+1 = 2 Xt = 1] = (t+1) P1,3 = P[Xt+1 P4 (t+1) =1 i=1 P1,i 0 0 1 2  0 0   1  4 3 4 and = 3 Xt = 1] = 0 Introduction to Operating Systems Sep. 30, 2016 4 / 35
5.Markov chain and random walk State distribution Definition Let π (t) be the state distribution of the chain at time t, that (t) πx = P[Xt = x]. (t) For a finite chain, πx is a vector of N nonnegative entries such that P (t) (t+1) = π (t) P (t+1) . We apply the law of x πx = 1. Then, it holds that π total probability X X (t) (t+1) (t+1) πy = P[Xt+1 = y ] = P[Xt+1 = y Xt = x]P[Xt = x] = πx Px,y x (t) Let πx = (0.4, 0.6, 0, 0) be (t+1) πx = (0.4, 0.6, 0, 0) (t) Let πx = (0, 0, 0.5, 0.5) be (t+1) πx = (0, 0, 0.5, 0.5) (t) Let πx = (0.1, 0.9, 0, 0) be (t+1) πx = (0.35, 0.65, 0, 0) MING GAO (SE@ecnu) x a state distribution, then a state distribution, then a state distribution, then Introduction to Operating Systems Sep. 30, 2016 5 / 35
6.Markov chain and random walk Stationary distributions Definition A stationary distribution of a finite Markov chain with transition matrix P is a probability distribution π such that πP = π For some Markov chains, no matter what the initial distribution is, after running the chain for a while, the distribution of the chain approaches the  stationary distribution  0.4 0.6 0 0  0.4 0.6 0 0  . The chain could converge to E.g., P 20 =   0 0 0.5 0.5  0 0 0.5 0.5 any distribution which is a linear combination of (0.4, 0.6, 0, 0) and (0, 0, 0.5, 0.5). We observe that the original chain P can be broken into two disjoint Markov chains, which have their own stationary distributions. We say that the chain is reducible MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 6 / 35
7.Markov chain and random walk Irreducibility Definition State y is accessible from state x if it is possible for the chain to visit state y if the chain starts in state x, in other words, P n (x, y ) > 0, ∀n. State x communicates with state y if y is accessible from x and x is accessible from y . We say that the Markov chain is irreducible if all pairs of states communicates. y is accessible from x means that y is connected from x in the transition graph, i.e., there is a directed path from x to y x communicates with y means that x and y are strongly connected in the transition graph A finite Markov chain is irreducible if and only if its transition graph is strongly connected The Markov chain associated with transition matrix P is not irreducible MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 7 / 35
8.Markov chain and random walk Irreducibility Definition State y is accessible from state x if it is possible for the chain to visit state y if the chain starts in state x, in other words, P n (x, y ) > 0, ∀n. State x communicates with state y if y is accessible from x and x is accessible from y . We say that the Markov chain is irreducible if all pairs of states communicates. y is accessible from x means that y is connected from x in the transition graph, i.e., there is a directed path from x to y x communicates with y means that x and y are strongly connected in the transition graph A finite Markov chain is irreducible if and only if its transition graph is strongly connected The Markov chain associated with transition matrix P is not irreducible MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 8 / 35
9.Markov chain and random walk Aperiodicity Definition The period of a state x is the greatest common divisor (gcd), such that dx = gcd{n (P n )x,x > 0}. A state is aperiodic if its period is 1. A Markov chain is aperiodic if all its states are aperiodic. For example, suppose that the period of state x is dx = 3. Then, starting from state x, chain x, , , , , , , , , , · · · , only the squares are possible to be x. In the transition graph of a finite Markov chain, (P n )x,x > 0 is equivalent to that x is on a cycle of length n. Period of a state x is the greatest common devisor of the lengths of cycles passing x. Theorem 1. If the states x and y communicate, then dx = dy . 2. We have (P n )x,x = 0 if n mod(dx ) 6= 0 MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 9 / 35
10.Markov chain and random walk Convergence of Markov chain Fundamental theorem of Markov chain Let X0 , X1 , · · · , be an irreducible aperiodic Markov chain with finite state space Ω, transition matrix P, and arbitrary initial distribution π (0) . Then, there exists a unique stationary distribution π such that πP = π, and limt→∞ π (0) P t = π.Existence:there exists a stationary distributionUniqueness:the stationary distribution is uniqueConvergence:starting from any initial distribution, the chain converges to the stationary distribution In fact, any finite Markov chain has a stationary distribution. Irreducibility and aperiodicity guarantee the uniqueness and convergence behavior of the stationary distribution MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 10 / 35
11.Markov chain and random walk Google’s PageRank Problem definition Given n interlinked webpages, rank them in order of “importance” in terms of importance scores x1 , x2 , · · · , xn ≥ 0 Keyinsight:use the existing link structure of the web to determine importance. A link to a page is like a vote for its importance Given a web with n pages, construct n × n matrix Aas:aij = j links to page i, 0 otherwise Sum of j−th column is 1, so A is a Markov matrix. − − − The ranking vector → x solves A→ x =→ x 1 nj if page Possible issues? Replace A with B = 0.85A + 0.15(matrix with every entry n1 ), where B is also a Markov chain A pages rank is the probability the random user will end up on that page, OR, equivalently MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 11 / 35
12.Graphical models The curse of dimensionality Modern machine learning is usually concerned with high-dimensional objects Consider learning a distribution over x ∈ {0, 1}N If N = 100, p(x) has 1267650600228229401496703205375 free parameters MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 12 / 35
13.Graphical models Why do we need graphical models? Graphs are an intuitive way of representing and visualising the relationships between many variables. (Examples:family trees, electric circuit diagrams, neural networks) A graph allows us to abstract out the conditional independence relationships between the variables from the details of their parametric forms. Thus we can answer questionslike:“Is A dependent on B given that we know the value of C ?” just by looking at the graph Graphical models allow us to define general message-passing algorithms that implement probabilistic inference efficiently. Thus we can answer queries like “What is p(A C = c)?” without enumerating all settings of all variables in the model Graphical models = statistics × graph theory × computer science MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 13 / 35
14.Graphical models Conditional independence The special structure graphical models assume is conditional independence If you would like to guess the value of some variable xi , then once you know the values of some “neighboring” variables xN (i) , then you get no additional benefit from knowing all other variables Turns out, this leads to factorized distributions Conditional independence X is independent of Y if “knowing Y doesn’t help you to guess X ” X ⊥ Y ↔ P(X , Y ) = P(X )P(Y ) X is independent of Y given Z if “once you know Z , knowing Y doesn’t help you to guess X ” X ⊥ Y Z ↔ P(X , Y Z ) = P(X Z )P(Y Z ) MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 14 / 35
15.Graphical models Representing knowledge through graphical models A graphical model is a probability distribution written in a factorized form. For example p(x) ∝ ψ(x1 , x3 )ψ(x2 , x3 )ψ(x3 , x4 ) Graph The two most common forms of graphical model are directed graphical models and undirected graphical models, based on directed acylic graphs and undirected graphs, respectively. Let G = (V , E ) be a graph, where V and E represent the sets of vertices and edges, respectively Vertices correspond to random variables Edges represent statistical dependencies between the variables MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 15 / 35
16.Graphical models Directed Model Outline 1 Markov chain and random walk 2 Graphical models Directed Model Undirected Model 3 Tail Bounds MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 16 / 35
17.Graphical models Directed Model Directed acyclic graphical models Bayesian network A DAG Model or Bayesian network corresponds to a factorization of the joint probability distribution P(A, B, C , D, E ) = P(A)P(B)P(C A, B)P(D B, C )P(E C , D) In general P(X1 , X2 , · · · , Xn ) = Πni=1 P(Xi Xpa(i) ), where pa(i) denotes the parents of vertex i. MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 17 / 35
18.Graphical models Directed Model How to do learning Maximum likelihood Given a fixed graph, how to do learning? Natural criterion arg max L(θ), L(θ) = θ D 1 X log P(xd θ) D d=1 Solution is empirical conditionals P(Xi = xi Xπ(i) = xπ(i) , θ) = MING GAO (SE@ecnu) #[Xi = xi , Xπ(i) = xπ(i) ] #[Xπ(i) = xπ(i) ] Introduction to Operating Systems Sep. 30, 2016 18 / 35
19.Graphical models Undirected Model Outline 1 Markov chain and random walk 2 Graphical models Directed Model Undirected Model 3 Tail Bounds MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 19 / 35
20.Graphical models Undirected Model Undirected graphs Which is true a. x1 ⊥ x3 x2 b. x1 ⊥ x3 x2,4 c. x1 ⊥ x3 x2,5 d. x1 ⊥ x6 x2,3,4,5 e. x1 ⊥ x6 x2,4 f. x1 ⊥ x6 x2 g. x1 ⊥ x6 x4 h. x1,6 ⊥ x3,5 x4 i. x1,6 ⊥ x3,5 x2,4 MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 20 / 35
21.Graphical models Undirected Model Undirected graphs Cont. Equivalently (when P(x) > 0), a graph asserts that P(xi x−i ) = P(xi xN(i) ), but what’s the formula for P(x)? Hammersley-Clifford theorem A positive distribution P(x) > 0 obeys the conditional independencies of a graph G when P(x)can be represented as 1 Πc∈C ψc (xc ) Z P where C is the set of all cliques, and Z = x Πc∈C ψc (xc ) is the “partition function” P(x) = This is not obvious and no direct probabilistic interpretation for φ It is easy to show that P(x) = Z1 Πc∈C ψc (xc ) obeys this conditional independence assumptions of a graph MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 21 / 35
22.Graphical models Undirected Model Exponential family An exponential family is a set of distributions p(x; θ) = 1 Exp(θT φ(x)) Z (θ) = Exp(θT φ(x) − A(θ)) P parameterized by θ ∈ Θ ⊂ Rd , Z (θ) = x Exp(θT φ(x)) and A(θ) = log Z (θ) is the “log-partition function”. We carebecause:(1) Many interesting properties; (2) Undirected models are an exponential family MING GAO (SE@ecnu) Introduction to Operating Systems Sep. 30, 2016 22 / 35
23.Graphical models Undirected Model Examples Examples for exponential family Bernoullidistribution:'>distribution: