Implementing a Hidden Markov Model Toolkit

In this assignment, you will implement the main algorthms associated with Hidden Markov Models, and become comfortable with dynamic programming and expectation maximization. You will also apply your HMM for part-of-speech tagging, linguistic analysis, and decipherment.

All code is to be written in hmm.py. Functions with I/O essentials for interacting with the program are provided, as well as headers for the methods you need to define. Read and understand the provided code.

Test your implementatons by following the directions under the Verify labels.

You may work on the sections in any order as long as you complete the prerequisites, sketched in the dependency diagram below. (Sections b, c, d, and e are reliant on Section a, Section f is reliant on Section e, and Section g is reliant on all.)

a: read and write model
b: supervised learning c: random generation d: viterbi e: forward & backward
f: unsupervised learning
g: experimentation

Section a: Load+Write the model files

Models

We have provided the beginnings of a class called HMM to represent a Hidden Markov Model, parametrized by transition and emission probabilities.

HMM transition and emission parameters are specified in a pair of files, like models/two_english.trans and models/two_english.emit. Each row of a .trans file is of the form

fromstate tostate P(tostate|fromstate)
# is a special state that denotes the start state. Nothing is emitted from the # state.

Each row of a .emit file is of the form

state output P(output|state)

Write a method, HMM.load, to load the parameters from the model files.

Your constructor should take the basename of the file pair (such as models/two_english) as an argument.

Additionally, define the inverse method, HMM.dump, which takes a basename of the output file pair and writes the parameters of the model to the corresponding .trans and .emit files. For brevity, transition or emission probabilities that are 0 should not be written.

Observations (no code needed)

Observations are represented in a .obs file such as data/browntags.obs, where each observation is a list of 2 lines in order:

state1 state2 state3 ...
output1 output2 output3 ...
The first line is the space-delimited state sequence, and the second is the space-delimited output sequence. The first line may be empty when the state sequence is unknown, as in data/armenian.obs

Verify

Create a model by loading from models/two_english and then writing it two_english_test, like so:

model = HMM()
model.load(models/two_english)
model.dump(two_english_test)

Verify that models/two_english.emit and two_english_test.emit are identical (others than line ordering), as are models/two_english.trans and two_english_test.trans.

We have provided a class called Observation to represent each observation, as well as a load_observations function to load a list of observations from a file.

Section b: Supervised Learning

Define HMM.learn_supervised that takes a list of observations with known state sequences, such as the observations in data/browntags.obs, as well as booleans specifying whether the transition and emission paramaters should be "locked" (unchanged during training). The method should estimate the parameters of the HMM using maximum likelihood estimation. No smoothing is required.

Verify

python hmm.py models/partofspeech sup data/browntags.obs
which estimates the model using the state and output sequences in data/browntags.obs (a portion of the Brown corpus). The program writes the trained model to models/partofspeech.browntags.trained{.emit, .trans}. This model should be identical to gold/partofspeech.browntags.trained{.emit, .trans}.

Note: If you are skipping this section or leaving it for later, copy over partofspeech.browntags.trained{.trans, .emit} from the gold directory to the models directory, since you will need them for the following sections.

Section c: Random Observation Generation

Define a method, HMM.generate, that takes an integer n, and returns a random observation of length n, generated by sampling from the HMM.

Test the output of this method for the part-of-speech model you learned in section b by running

python hmm.py models/partofspeech.browntags.trained g generated.obs
which writes 20 random observations generated by the HMM into generated.obs

Here are two example observations generated by our run. (Of course, due to randomness, yours will differ.)

DET ADJ . ADV ADJ VERB ADP DET ADJ NOUN VERB ADJ NOUN
the semi-catatonic , quite several must of an western bridge cannot spectacular analyses
DET NOUN ADP NOUN CONJ DET VERB DET NOUN NOUN NOUN ADP DET NOUN
whose light for wall and the learned the hull postmaster trash in his peters

Section d: Viterbi Algorithm for the Best State Sequence

Define a method, HMM.viterbi, that implements the Viterbi algorithm to find the best state sequence for the output sequence of a given observation. The method should set the state sequence of the observation to be this Viterbi state sequence.

Your algorithm can break ties however it likes. (In practice, HMM parameter values tend to differ enough that you won't run into many ties.)

Verify

python hmm.py models/partofspeech.browntags.trained v data/ambiguous_sents.txt
This uses the HMM parameters in models/partofspeech.{trans,emit} to compute the best sequence of part-of-speech tags for each sentence in data/ambiguous_sents.obs, and writes it to data/ambiguous_sents.tagged.obs.

Compare the output file to gold/ambiguous_sents.tagged.obs. They should be identical.

Section e: Forward and Backward Algorithms

Define methods HMM.forward and HMM.backward to implement the forward and backward algorithms respectively. Both methods should take an output sequence, and return a data structure containing the values for \(\alpha_i(t)\) and \(\beta_i(t)\) respectively.

Additionally, define HMM.forward_probability and HMM.backward_probability which return the probability of a given output sequence using the values computed by the above methods.

Verify

python hmm.py models/partofspeech.browntags.trained f data/ambiguous_sents.obs
computes the total probability of each sentence in data/ambiguous_sents.obs using HMM.forward, and writes it to data/ambiguous_sents.forwardprob.

Similarly,

python hmm.py models/partofspeech b data/ambiguous_sents.obs
computes the total probability of each sentence using HMM.backward, and writes it to data/ambiguous_sents.backwardprob.

The numbers in both these files should be the same, and identical to gold/ambiguous_sents.prob.

Section f: Unsupervised Learning with Baum Welch

Define HMM.learn_unsupervised that takes a list of observations where the state sequences may be unknown, a convergence threshold, booleans specifying whether the transition and emission paramaters should be "locked", and the number of random restarts. This method should use the Baum Welch EM algorithm (which draws upon HMM.forward and HMM.backward) to estimate the model parameters, starting from the current model.

The function should also return the log likehood of the trained model.

If the number of restarts is greater than 0, re-initialize the model with random values (keeping the locked parameters the same) and run EM again. Keep the model with the best log likelihood from all the restarts.

Verify

python hmm.py models/two_english unsup data/english_words.obs
which will print out the trained model's log likelihood
The final model's log likelihood is -152860.669251
The program will also write the trained model to models/two_english.english_words.trained{.emit,.trans}.

Check them against gold/two_english.english_words.trained{.emit, .trans}.

Section g: Experimentation and Analysis

You're done! Now on to play with your cool new toy.