# an HDP-HMM for Systems with State Persistence

## Contents

## Introduction

### The Big Picture

Hidden Markov Model is one of the most effective and widely used probabilistic models for time series data. A serious limitation of this model is the needs to define the number of states for the model on prior. This number cannot easily determine in real life applications. The usual way to define it is by a trying several different numbers of states and choose the one that gives the best results (trial and error approach). This limitation can be overcome by types of probabilistic models call “Bayesian Nonparametric Models”. Bayesian Nonparametric Models approach this problem by defining distributions that can have infinite number of parameters. However, the assumption here is that even though the distribution could has infinite number of parameters, only finite number of them is required to explain the observed data.

A huge breakthrough in the possible applications of “Bayesian Nonparametric Methods” occurred after a paper published in 2005 by Yee Whye The, Michael I. Jordan, et.al. The title of the paper was “Hierarchical Dirichlet Processes (HDP)”. This paper describes a way to define a nonparametric Bayesian prior that allows atoms (which can be seen as components of mixture models) to be shared between groups of data (draws from mixture models). One application of this model was an extension to the Hidden Markov Model that allows the number of stats not to be defined in prior. The new model, which named HDP-HMM, allows the number of stats to be infinite.

The paper that I’m reviewing here proposes an augmented version of the HDP-HMM that solves the problem of state persistence. The problem is also arises in the original Hidden Markov Model. The stats in situations where this problem could occur have the tendency not to change their settings, i.e. the transition probability to a new state is less than the probability of staying at the same stats. The authors did not only provide a solution for this problem, but they also provided a full Bayesian treatment for it.

### Speaker Diarization Problem

The task of segmenting or annotating an audio recording into different temporal segments where each segment corresponds to a specific event or speaker is called Speaker Diarization. One example of where Speaker Diarization could be useful is when analyzing a few minutes recorded audio from a radio broadcast. This few minutes could consist of commercials, intro music, a speech of the host, and a speech of the guest. The Speaker Diarization problem on such an audio recording could be defined as identifying speech and non-speech segments. Audio recording of meetings with known or unknown number of participants is another example the uses of Speaker Diarization. In this case the task is to answer the question of “who spoke when”.

The most common technique for solving the Speaker Diarization problem, which has also showed better results than others, is using Hidden Markov Models. In this setting, each speaker is associated with a specified state in the HMM. The transition among these stats represents the transition among speakers. However, a model like this suffers from a serious limitation. It requires the number of speakers to be known in advanced, which is needed to design the structure of the model.

The authors of this paper proposed a solution to the problem of Speaker Diarization on situations where a prior knowledge of the number of participants is unavailable. Their solution is based on a slightly modified version of an interesting Bayesian non-parametric model called “Hierarchical Dirichlet Process–Hidden Markov Model (HDP-HMM)”. The proposed modified version, which named “Sticky HDP-HMM”, imposes state persistence on the model.

### Nonparametric Bayesian Models

The Nonparametric Bayesian Models that we are considering in this discussion can be defined (informally) as follows. The word nonparametric does not mean that those models have no parameters, but in contrary it means that those models have infinite parameters spaces. And the fact that those models are Bayesian means that we are defining probability measures over them.

Dirichlet Process is an example on Nonparametric Bayesian Models. It is a generalization of the Dirichlet Distribution, where the number of parameters can be infinite. The Dirichlet Distribution can be seen as a distribution over distributions of N outcomes. Thus, a draw from a Dirichlet Distribution is itself a probability distribution. Dirichlet Process has the same characteristic; it is a distribution over distributions, but the difference is that the outcomes N can grow and go to infinity.

## Background

### Dirichlet Process

A Dirichlet process [math]DP(H,\alpha)[/math] is defined using two parameters. The first parameter, [math]H[/math], is a base distribution. This parameter can be considered as the mean of the Dirichlet Process. The second parameter, [math]\alpha[/math], is called the concentration parameter.

As described above, a draw from a Dirichlet process is discrete probability measure, regardless of wither the base distribution is discrete or continuous. This fact is one of the most important properties of Dirichlet Process; it assures that draws from the Dirichlet Process can be repeated.

A stick breaking construction of the Dirichlet Process was introduced in (Sethuraman, 1994), as follows:

[math]\, \beta_k = \beta_{‘}^{k} \prod_{l=1}^{k-1}(1-\beta_{l}^{'})[/math]

[math]\, \beta_{l}^{'}[/math]~[math]\, Beta(1,\gamma)[/math]

[math]\, G_o = \sum_{k=1}^{\infty} \beta_k \delta(\theta - \theta_k)[/math]

[math]\, \theta_k[/math]~[math]\, H[/math]

This specific construction is usually denoted by: [math]\, \beta[/math]~[math]\, GEM(\gamma)[/math]

### Hierarchical Dirichlet Processes

Two Dirichlet Process can be used together in a recursion way such that a draw from the first one uses as the base distribution of the second one. Formally,

[math]\, G [/math] ~ [math]\, DP(\alpha, G_0) [/math]

[math]\, G_0 [/math] ~ [math]\, DP(\gamma, H)[/math].

Doing so would allow us to share atoms. The analogy here is that we are modeling a grouped data and each group of them was modeled using a mixture model.