Unix & Me: April 2019

Sunday, April 21, 2019

N-Gram ARPA Model

This post is to summarize how the probability is calculated from ARPA model.

Consider test.arpa and the following sequences of words:
look beyond
more looking on

Let's consider look beyond first. The log10 probability of seeing beyond conditioned upon look, i.e., log10(P(beyond | look)) = -0.2922095. This is directly from the test.arpa file, line 78.

What is, then, the probability of seeing look beyond? Well, this is by the chain rule of conditional probabilities

log10(P(look beyond))
= log10(P(look) * P(beyond | look))
= log10(P(look)) + log10(P(beyond | look))
= -1.687872 + -0.2922095 = -1.980081558227539,

which can be verified with python code

import kenlm
model = kenlm.LanguageModel('test.arpa')
print(model.score('look beyond', eos=False, bos=False)

Let's try the next sequence more looking on. Let us start with the chain rule

log10(P(more looking on))
= log10(P(more)) + log10(P(looking | more)) + log10(P(on | more looking))

The first term on the RHS is easy: log10(P(more)) = -1.206319 from line 34

The second term is a bit tricky, because we cannot find the bi-gram more looking from the model. Hence, we use the following formula:
P(looking | more) = P(looking) * BW(more)
where log10(P(looking)) = -1.285941 from line 33, and log10(BW(more)) = -0.544068 is the back-off weight, which can be read off from line 34.

Lastly, the third term is again not present in the model, so we reduce it to
P(on | more looking) = P(on | looking) * BW(looking | more)
where the first term is -0.4638903 from line 80, and the second term is assumed to be 1, because the bigram more looking does not exist in the model

Thus, we get log10(P(more looking on)) = -(1.206319 + 1.285941 + 0.544068 + 0.4638903) = -3.5

For more details, refer to this document. I also find this answer very helpful.