Class NgramExtractor


  • public class NgramExtractor
    extends Object
    Class for extracting n-grams out of a text.
    Author:
    Fabian Kessler
    • Method Detail

      • gramLength

        public static NgramExtractor gramLength​(int gramLength)
      • textPadding

        public NgramExtractor textPadding​(char textPadding)
        To ensure having border grams, this character is added to the left and right of the text.

        Example: when textPadding is a space ' ' then a text input "foo" becomes " foo ", ensuring that n-grams like " f" are created.

        If the text already has such a character in that position (eg starts with), it is not added there.

        Parameters:
        textPadding - for example a space ' '.
      • getGramLengths

        public List<Integer> getGramLengths()
      • extractGrams

        @NotNull
        public @NotNull List<String> extractGrams​(@NotNull
                                                  @NotNull CharSequence text)
        Creates the n-grams for a given text in the order they occur.

        Example: extractSortedGrams("Foo bar", 2) => [Fo,oo,o , b,ba,ar]

        Parameters:
        text -
        Returns:
        The grams, empty if the input was empty or if none for that gramLength fits.
      • extractCountedGrams

        @NotNull
        public @NotNull Map<String,​Integer> extractCountedGrams​(@NotNull
                                                                      @NotNull CharSequence text)
        Returns:
        Key = ngram, value = count The order is as the n-grams appeared first in the string.