Package edu.berkeley.nlp.lm.io
Class KneserNeyLmReaderCallback<W>
- java.lang.Object
-
- edu.berkeley.nlp.lm.io.KneserNeyLmReaderCallback<W>
-
- Type Parameters:
W
-
- All Implemented Interfaces:
ArrayEncodedNgramLanguageModel<W>
,LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>
,LmReaderCallback<LongRef>
,NgramOrderedLmReaderCallback<LongRef>
,NgramLanguageModel<W>
,java.io.Serializable
public class KneserNeyLmReaderCallback<W> extends java.lang.Object implements NgramOrderedLmReaderCallback<LongRef>, LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>, ArrayEncodedNgramLanguageModel<W>, java.io.Serializable
Class for producing a Kneser-Ney language model in ARPA format from raw text. Confusingly, this class is both aLmReaderCallback
(called fromTextReader
, which reads plain text), and aLmReader
, which "reads" counts and produces Kneser-Ney probabilities and backoffs and passes them on anArpaLmReaderCallback
- Author:
- adampauls
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel
ArrayEncodedNgramLanguageModel.DefaultImplementations
-
Nested classes/interfaces inherited from interface edu.berkeley.nlp.lm.NgramLanguageModel
NgramLanguageModel.StaticMethods
-
-
Field Summary
Fields Modifier and Type Field Description protected static float
DEFAULT_DISCOUNT
protected int
lmOrder
protected HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts>
ngrams
protected ConfigOptions
opts
protected static long
serialVersionUID
protected int
startIndex
protected WordIndexer<W>
wordIndexer
This array represents the discount used for each ngram order.
-
Constructor Summary
Constructors Constructor Description KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder)
KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addNgram(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words, boolean justLastWord, long[][] scratch)
void
call(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words)
Called for each n-gramvoid
call(W[] ngram, LongRef value)
void
callJustLast(W[] ngram, LongRef value, long[][] scratch)
void
cleanup()
Called once all reading is done.static double[]
defaultDiscounts()
static double[]
defaultMinCounts()
protected float
getDiscountForOrder(int ngramOrder)
protected float
getHighestOrderProb(int[] ngram, int startPos, int endPos)
int
getLmOrder()
Maximum size of n-grams stored by the model.float
getLogProb(int[] ngram)
Equivalent togetLogProb(ngram, 0, ngram.length)
float
getLogProb(int[] ngram, int startPos, int endPos)
Calculate language model score of an n-gram.float
getLogProb(java.util.List<W> ngram)
Scores an n-gram.protected float
getLowerOrderBackoff(int[] ngram, int startPos, int endPos)
protected float
getLowerOrderProb(int[] ngram, int startPos, int endPos)
long
getTotalSize()
WordIndexer<W>
getWordIndexer()
Each LM must have a WordIndexer which assigns integer IDs to each word W in the language.void
handleNgramOrderFinished(int order)
Called when all n-grams of a given order are finishedvoid
handleNgramOrderStarted(int order)
Called when n-grams of a given order are startedprotected float
interpolateProb(int[] ngram, int startPos, int endPos)
void
parse(ArpaLmReaderCallback<ProbBackoffPair> callback)
float
scoreSentence(java.util.List<W> sentence)
Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols.void
setOovWordLogProb(float logProb)
Sets the (log) probability for an OOV word.
-
-
-
Field Detail
-
serialVersionUID
protected static final long serialVersionUID
- See Also:
- Constant Field Values
-
DEFAULT_DISCOUNT
protected static final float DEFAULT_DISCOUNT
- See Also:
- Constant Field Values
-
lmOrder
protected final int lmOrder
-
wordIndexer
protected final WordIndexer<W> wordIndexer
This array represents the discount used for each ngram order. The original Kneser-Ney discounting (-ukndiscount) uses one discounting constant for each N-gram order. These constants are estimated as D = n1 / (n1 + 2*n2) where n1 and n2 are the total number of N-grams with exactly one and two counts, respectively. For simplicity, our code just uses a constant discount for each order of 0.75. However, other discounts can be specified.
-
ngrams
protected final HashNgramMap<KneserNeyCountValueContainer.KneserNeyCounts> ngrams
-
opts
protected final ConfigOptions opts
-
startIndex
protected final int startIndex
-
-
Constructor Detail
-
KneserNeyLmReaderCallback
public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder)
- Parameters:
wordIndexer
-maxOrder
-inputIsSentences
- If true, input n-grams are assumed to be sentences, and all sub-ngrams of up to ordermaxOrder
are added. If false, input n-grams are assumed to be atomic.
-
KneserNeyLmReaderCallback
public KneserNeyLmReaderCallback(WordIndexer<W> wordIndexer, int maxOrder, ConfigOptions opts)
-
-
Method Detail
-
call
public void call(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words)
Description copied from interface:LmReaderCallback
Called for each n-gram- Specified by:
call
in interfaceLmReaderCallback<W>
- Parameters:
ngram
- The integer representation of the words as given by the provided WordIndexervalue
- The value of the n-gramwords
- The string representation of the n-gram (space separated)
-
addNgram
public void addNgram(int[] ngram, int startPos, int endPos, LongRef value, java.lang.String words, boolean justLastWord, long[][] scratch)
- Parameters:
ngram
-startPos
-endPos
-value
-words
-
-
interpolateProb
protected float interpolateProb(int[] ngram, int startPos, int endPos)
-
getHighestOrderProb
protected float getHighestOrderProb(int[] ngram, int startPos, int endPos)
-
getLowerOrderProb
protected float getLowerOrderProb(int[] ngram, int startPos, int endPos)
-
getLowerOrderBackoff
protected float getLowerOrderBackoff(int[] ngram, int startPos, int endPos)
-
getDiscountForOrder
protected float getDiscountForOrder(int ngramOrder)
-
cleanup
public void cleanup()
Description copied from interface:LmReaderCallback
Called once all reading is done.- Specified by:
cleanup
in interfaceLmReaderCallback<W>
-
defaultDiscounts
public static double[] defaultDiscounts()
-
defaultMinCounts
public static double[] defaultMinCounts()
-
parse
public void parse(ArpaLmReaderCallback<ProbBackoffPair> callback)
- Specified by:
parse
in interfaceLmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>>
-
getWordIndexer
public WordIndexer<W> getWordIndexer()
Description copied from interface:NgramLanguageModel
Each LM must have a WordIndexer which assigns integer IDs to each word W in the language.- Specified by:
getWordIndexer
in interfaceNgramLanguageModel<W>
- Returns:
-
handleNgramOrderFinished
public void handleNgramOrderFinished(int order)
Description copied from interface:NgramOrderedLmReaderCallback
Called when all n-grams of a given order are finished- Specified by:
handleNgramOrderFinished
in interfaceNgramOrderedLmReaderCallback<W>
-
handleNgramOrderStarted
public void handleNgramOrderStarted(int order)
Description copied from interface:NgramOrderedLmReaderCallback
Called when n-grams of a given order are started- Specified by:
handleNgramOrderStarted
in interfaceNgramOrderedLmReaderCallback<W>
-
getLmOrder
public int getLmOrder()
Description copied from interface:NgramLanguageModel
Maximum size of n-grams stored by the model.- Specified by:
getLmOrder
in interfaceNgramLanguageModel<W>
- Returns:
-
scoreSentence
public float scoreSentence(java.util.List<W> sentence)
Description copied from interface:NgramLanguageModel
Scores a complete sentence, taking appropriate care with the start- and end-of-sentence symbols. This is a convenience method and will generally be inefficient.- Specified by:
scoreSentence
in interfaceNgramLanguageModel<W>
- Returns:
-
getLogProb
public float getLogProb(java.util.List<W> ngram)
Description copied from interface:NgramLanguageModel
Scores an n-gram. This is a convenience method and will generally be relatively inefficient. More efficient versions are available inArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)
andContextEncodedNgramLanguageModel.getLogProb(long, int, int, edu.berkeley.nlp.lm.ContextEncodedNgramLanguageModel.LmContextInfo)
.- Specified by:
getLogProb
in interfaceNgramLanguageModel<W>
-
getLogProb
public float getLogProb(int[] ngram, int startPos, int endPos)
Description copied from interface:ArrayEncodedNgramLanguageModel
Calculate language model score of an n-gram. Warning: if you pass in an n-gram of length greater thangetLmOrder()
, this call will silently ignore the extra words of context. In other words, if you pass in a 5-gram (endPos-startPos == 5
) to a 3-gram model, it will only score the words fromstartPos + 2
toendPos
.- Specified by:
getLogProb
in interfaceArrayEncodedNgramLanguageModel<W>
- Parameters:
ngram
- array of words in integer representationstartPos
- start of the portion of the array to be readendPos
- end of the portion of the array to be read.- Returns:
-
getLogProb
public float getLogProb(int[] ngram)
Description copied from interface:ArrayEncodedNgramLanguageModel
Equivalent togetLogProb(ngram, 0, ngram.length)
- Specified by:
getLogProb
in interfaceArrayEncodedNgramLanguageModel<W>
- See Also:
ArrayEncodedNgramLanguageModel.getLogProb(int[], int, int)
-
getTotalSize
public long getTotalSize()
-
setOovWordLogProb
public void setOovWordLogProb(float logProb)
Description copied from interface:NgramLanguageModel
Sets the (log) probability for an OOV word. Note that this is in general different from the log prob of theunk
tag probability.- Specified by:
setOovWordLogProb
in interfaceNgramLanguageModel<W>
-
-