Interface ISequenceEncoder

  • All Known Implementing Classes:
    NoEncoder, TrimInfixAndSuffixEncoder, TrimPrefixAndSuffixEncoder, TrimSuffixEncoder

    public interface ISequenceEncoder
    The logic of encoding one sequence of bytes relative to another sequence of bytes. The "base" form and the "derived" form are typically the stem of a word and the inflected form of a word.

    Derived form encoding helps in making the data for the automaton smaller and more repetitive (which results in higher compression rates).

    See example implementation for details.

    • Method Detail

      • encode

        ByteBuffer encode​(ByteBuffer reuse,
                          ByteBuffer source,
                          ByteBuffer target)
        Encodes target relative to source, optionally reusing the provided ByteBuffer.
        Parameters:
        reuse - Reuses the provided ByteBuffer or allocates a new one if there is not enough remaining space.
        source - The source byte sequence.
        target - The target byte sequence to encode relative to source
        Returns:
        Returns the ByteBuffer with encoded target.
      • prefixBytes

        @Deprecated
        int prefixBytes()
        Deprecated.
        The number of encoded form's prefix bytes that should be ignored (needed for separator lookup). An ugly workaround for GH-85, should be fixed by prior knowledge of whether the dictionary contains tags; then we can scan for separator right-to-left.
        See Also:
        "https://github.com/morfologik/morfologik-stemming/issues/85"