Class LmReaders
- java.lang.Object
-
- edu.berkeley.nlp.lm.io.LmReaders
-
public class LmReaders extends java.lang.Object
This class contains a number of static methods for reading/writing/estimating n-gram language models. Since most uses of this software will use this class, I will use this space to document the software as a whole.This software provides three main pieces of functionality:
(a) estimation of a language models from text inputs
(b) data structures for efficiently storing large collections of n-grams in memory
(c) an API for efficient querying language models derived from n-gram collections. Most of the techniques in the paper are described in "Faster and Smaller N-gram Language Models" (Pauls and Klein 2011).This software supports the estimation of two types of language models: Kneser-Ney language models (Kneser and Ney, 1995) and Stupid Backoff language models (Brants et al. 2007). Kneser-Ney language models can be estimated from raw text by calling
createKneserNeyLmFromTextFiles(List, WordIndexer, int, File, ConfigOptions)
. This can also be done from the command-line by callingmain()
inMakeKneserNeyArpaFromText
. See theexamples
folder for a script which demonstrates its use. A Stupid Backoff language model can be read from a directory containing n-gram counts in the format used by Google's Web1T corpus by callingreadLmFromGoogleNgramDir(String, boolean, boolean)
. Note that this software does not (yet) support building Google count directories from raw text, though this can be done using SRILM.Loading/estimating language models from text files can be very slow. This software can use Java's built-in serialization to build language model binaries which are both smaller and faster to load.
MakeLmBinaryFromArpa
andMakeLmBinaryFromGoogle
providemain()
methods for doing this. See theexamples
folder for scripts which demonstrate their use.Language models can be read into memory from ARPA formats using
readArrayEncodedLmFromArpa(String, boolean)
andreadContextEncodedLmFromArpa(String)
. The "array encoding" versus "context encoding" distinction is discussed in Section 4.2 of Pauls and Klein (2011). Again, since loading language models from textual representations can be very slow, they can be read from binaries usingreadLmBinary(String)
. The interfaces for these language models can be found inArrayEncodedNgramLanguageModel
andContextEncodedNgramLanguageModel
. For examples of these interfaces in action, you can have a look atPerplexityTest
.We implement the HASH,HASH+SCROLL, and COMPRESSED language model representations described in Pauls and Klein (2011) in this release. The SORTED implementation may be added later. See
HashNgramMap
andCompressedNgramMap
for the implementations of the HASH and COMPRESSED representations.To speed up queries, you can wrap language models with caches (
ContextEncodedCachingLmWrapper
andArrayEncodedCachingLmWrapper
). These caches are described in section 4.1 of Pauls and Klein (2011). You should more or less always use these caches, since they are faster and have modest memory requirements.This software also support a java Map wrapper around an n-gram collection. You can read a map wrapper using
readNgramMapFromGoogleNgramDir(String, boolean, WordIndexer)
.ComputeLogProbabilityOfTextStream
provides amain()
method for computing the log probability of raw text.Some example scripts can be found in the
examples/
directory.- Author:
- adampauls
-
-
Constructor Summary
Constructors Constructor Description LmReaders()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static <W> void
createKneserNeyLmFromTextFiles(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, java.io.File arpaOutputFile, ConfigOptions opts)
Estimates a Kneser-Ney language model from raw text, and writes a file (in ARPA format).static <W> ArrayEncodedProbBackoffLm<W>
readArrayEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile, boolean compress, WordIndexer<W> wordIndexer, ConfigOptions opts)
Reads an array-encoded language model from an ARPA lm file.static ArrayEncodedProbBackoffLm<java.lang.String>
readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress)
static <W> ArrayEncodedProbBackoffLm<W>
readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress, WordIndexer<W> wordIndexer)
static <W> ArrayEncodedProbBackoffLm<W>
readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress, WordIndexer<W> wordIndexer, ConfigOptions opts, int lmOrder)
static <W> ContextEncodedProbBackoffLm<W>
readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts)
Builds a context-encoded LM from raw text.static <W> ContextEncodedProbBackoffLm<W>
readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts, java.io.File tmpFile)
static <W> ContextEncodedProbBackoffLm<W>
readContextEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile, WordIndexer<W> wordIndexer, ConfigOptions opts)
static ContextEncodedProbBackoffLm<java.lang.String>
readContextEncodedLmFromArpa(java.lang.String lmFile)
static <W> ContextEncodedProbBackoffLm<W>
readContextEncodedLmFromArpa(java.lang.String lmFile, WordIndexer<W> wordIndexer)
static <W> ContextEncodedProbBackoffLm<W>
readContextEncodedLmFromArpa(java.lang.String lmFile, WordIndexer<W> wordIndexer, ConfigOptions opts, int lmOrder)
Reads a context-encoded language model from an ARPA lm file.static <W> StupidBackoffLm<W>
readGoogleLmBinary(java.lang.String file, WordIndexer<W> wordIndexer, java.lang.String sortedVocabFile)
Reads in a pre-built Google n-gram binary.static StupidBackoffLm<java.lang.String>
readGoogleLmBinary(java.lang.String file, java.lang.String sortedVocabFile)
static <W> ArrayEncodedProbBackoffLm<W>
readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, boolean compress, ConfigOptions opts, java.io.File tmpFile)
static <W> ArrayEncodedProbBackoffLm<W>
readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts, boolean compress)
Builds an array-encoded LM from raw text.static <W> NgramLanguageModel<W>
readLmBinary(java.lang.String file)
Reads a binary file representing an LM.static ArrayEncodedNgramLanguageModel<java.lang.String>
readLmFromGoogleNgramDir(java.lang.String dir, boolean compress, boolean kneserNey)
static <W> ArrayEncodedNgramLanguageModel<W>
readLmFromGoogleNgramDir(java.lang.String dir, boolean compress, boolean kneserNey, WordIndexer<W> wordIndexer, ConfigOptions opts)
Reads a stupid backoff lm from a directory with n-gram counts in the format used by Google n-grams.static NgramMapWrapper<java.lang.String,LongRef>
readNgramMapFromBinary(java.lang.String binary, java.lang.String vocabFile)
static <W> NgramMapWrapper<W,LongRef>
readNgramMapFromBinary(java.lang.String binary, java.lang.String sortedVocabFile, WordIndexer<W> wordIndexer)
static NgramMapWrapper<java.lang.String,LongRef>
readNgramMapFromGoogleNgramDir(java.lang.String dir, boolean compress)
static <W> NgramMapWrapper<W,LongRef>
readNgramMapFromGoogleNgramDir(java.lang.String dir, boolean compress, WordIndexer<W> wordIndexer)
static <W> void
writeLmBinary(NgramLanguageModel<W> lm, java.lang.String file)
Writes a binary file representing the LM using the built-in serialization.
-
-
-
Method Detail
-
readContextEncodedLmFromArpa
public static ContextEncodedProbBackoffLm<java.lang.String> readContextEncodedLmFromArpa(java.lang.String lmFile)
-
readContextEncodedLmFromArpa
public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedLmFromArpa(java.lang.String lmFile, WordIndexer<W> wordIndexer)
-
readContextEncodedLmFromArpa
public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedLmFromArpa(java.lang.String lmFile, WordIndexer<W> wordIndexer, ConfigOptions opts, int lmOrder)
Reads a context-encoded language model from an ARPA lm file. Context-encoded language models allow faster queries, but require an extra 4-bytes of storage per n-gram for the suffix offsets (as compared to array-encoded language models).- Type Parameters:
W
-- Parameters:
lmFile
-compress
-wordIndexer
-opts
-lmOrder
-- Returns:
-
readContextEncodedLmFromArpa
public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile, WordIndexer<W> wordIndexer, ConfigOptions opts)
-
readArrayEncodedLmFromArpa
public static ArrayEncodedProbBackoffLm<java.lang.String> readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress)
-
readArrayEncodedLmFromArpa
public static <W> ArrayEncodedProbBackoffLm<W> readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress, WordIndexer<W> wordIndexer)
-
readArrayEncodedLmFromArpa
public static <W> ArrayEncodedProbBackoffLm<W> readArrayEncodedLmFromArpa(java.lang.String lmFile, boolean compress, WordIndexer<W> wordIndexer, ConfigOptions opts, int lmOrder)
-
readArrayEncodedLmFromArpa
public static <W> ArrayEncodedProbBackoffLm<W> readArrayEncodedLmFromArpa(LmReader<ProbBackoffPair,ArpaLmReaderCallback<ProbBackoffPair>> lmFile, boolean compress, WordIndexer<W> wordIndexer, ConfigOptions opts)
Reads an array-encoded language model from an ARPA lm file.- Type Parameters:
W
-- Parameters:
lmFile
-compress
- Compress the LM using block compression. This LM should be smaller but slower.wordIndexer
-opts
-lmOrder
-- Returns:
-
readNgramMapFromGoogleNgramDir
public static NgramMapWrapper<java.lang.String,LongRef> readNgramMapFromGoogleNgramDir(java.lang.String dir, boolean compress)
-
readNgramMapFromGoogleNgramDir
public static <W> NgramMapWrapper<W,LongRef> readNgramMapFromGoogleNgramDir(java.lang.String dir, boolean compress, WordIndexer<W> wordIndexer)
-
readNgramMapFromBinary
public static NgramMapWrapper<java.lang.String,LongRef> readNgramMapFromBinary(java.lang.String binary, java.lang.String vocabFile)
-
readNgramMapFromBinary
public static <W> NgramMapWrapper<W,LongRef> readNgramMapFromBinary(java.lang.String binary, java.lang.String sortedVocabFile, WordIndexer<W> wordIndexer)
- Parameters:
sortedVocabFile
- should be the vocab_cs.gz file from the Google n-gram corpus.- Returns:
-
readLmFromGoogleNgramDir
public static ArrayEncodedNgramLanguageModel<java.lang.String> readLmFromGoogleNgramDir(java.lang.String dir, boolean compress, boolean kneserNey)
-
readLmFromGoogleNgramDir
public static <W> ArrayEncodedNgramLanguageModel<W> readLmFromGoogleNgramDir(java.lang.String dir, boolean compress, boolean kneserNey, WordIndexer<W> wordIndexer, ConfigOptions opts)
Reads a stupid backoff lm from a directory with n-gram counts in the format used by Google n-grams.- Type Parameters:
W
-- Parameters:
dir
-compress
-wordIndexer
-opts
-- Returns:
-
readContextEncodedKneserNeyLmFromTextFile
public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts)
Builds a context-encoded LM from raw text. This call first builds and writes a (temporary) ARPA file by calling#createKneserNeyLmFromTextFiles(List, WordIndexer, int, File)
, and the reads the resulting file. Since the temp file can be quite large, it is important that the temp directory used by java (java.io.tmpdir
).- Type Parameters:
W
-- Parameters:
files
-wordIndexer
-lmOrder
-opts
-- Returns:
-
readKneserNeyLmFromTextFile
public static <W> ArrayEncodedProbBackoffLm<W> readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts, boolean compress)
Builds an array-encoded LM from raw text. This call first builds and writes a (temporary) ARPA file by calling#createKneserNeyLmFromTextFiles(List, WordIndexer, int, File)
, and the reads the resulting file. Since the temp file can be quite large, it is important that the temp directory used by java (java.io.tmpdir
).- Type Parameters:
W
-- Parameters:
files
-wordIndexer
-lmOrder
-opts
-- Returns:
-
readContextEncodedKneserNeyLmFromTextFile
public static <W> ContextEncodedProbBackoffLm<W> readContextEncodedKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, ConfigOptions opts, java.io.File tmpFile)
-
readKneserNeyLmFromTextFile
public static <W> ArrayEncodedProbBackoffLm<W> readKneserNeyLmFromTextFile(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, boolean compress, ConfigOptions opts, java.io.File tmpFile)
-
createKneserNeyLmFromTextFiles
public static <W> void createKneserNeyLmFromTextFiles(java.util.List<java.lang.String> files, WordIndexer<W> wordIndexer, int lmOrder, java.io.File arpaOutputFile, ConfigOptions opts)
Estimates a Kneser-Ney language model from raw text, and writes a file (in ARPA format). Probabilities are in log base 10 to match SRILM.- Type Parameters:
W
-- Parameters:
files
- Files of raw text (new-line separated).wordIndexer
-lmOrder
-arpaOutputFile
-
-
readGoogleLmBinary
public static StupidBackoffLm<java.lang.String> readGoogleLmBinary(java.lang.String file, java.lang.String sortedVocabFile)
-
readGoogleLmBinary
public static <W> StupidBackoffLm<W> readGoogleLmBinary(java.lang.String file, WordIndexer<W> wordIndexer, java.lang.String sortedVocabFile)
Reads in a pre-built Google n-gram binary. The user must supply thevocab_cs.gz
file (so that the corpus cannot be reproduced unless the user has the rights to do so).- Type Parameters:
W
-- Parameters:
file
- The binarywordIndexer
-sortedVocabFile
- thevocab_cs.gz
vocabulary file.- Returns:
-
readLmBinary
public static <W> NgramLanguageModel<W> readLmBinary(java.lang.String file)
Reads a binary file representing an LM. These will need to be cast down to eitherContextEncodedNgramLanguageModel
orArrayEncodedNgramLanguageModel
to be useful.
-
writeLmBinary
public static <W> void writeLmBinary(NgramLanguageModel<W> lm, java.lang.String file)
Writes a binary file representing the LM using the built-in serialization. These binaries should load much faster than ARPA files.- Type Parameters:
W
-- Parameters:
lm
-file
-
-
-