Class SrxTextIterator

  • All Implemented Interfaces:
    Iterator<String>, TextIterator

    public class SrxTextIterator
    extends AbstractTextIterator
    Represents text iterator splitting text according to rules in SRX file. The algorithm idea is as follows:
     1. Rule matcher list is created based on SRX file and language. Each rule 
        matcher is responsible for matching before break and after break regular 
        expressions of one break rule.
     2. Each rule matcher is matched to the text. If the rule was not found the 
        rule matcher is removed from the list. 
     3. First rule matcher in terms of its break position in text is selected.
     4. List of exception rules corresponding to break rule is retrieved. 
     5. If none of exception rules is matching in break position then 
        the text is marked as split and new segment is created. In addition 
        all rule matchers are moved so they start after the end of new segment 
        (which is the same as break position of the matched rule). 
     6. All the rules that have break position behind last matched rule 
        break position are moved until they pass it.
     7. If segment was not found the whole process is repeated.
     
    In streaming version of this algorithm character buffer is searched. When the end of it is reached or break position is in the margin (break position > buffer size - margin) and there is more text, the buffer is moved in the text until it starts after last found segment. If this happens rule matchers are reinitialized and the text is searched again. Streaming version has a limitation that read buffer must be at least as long as any segment in the text. As this algorithm uses lookbehind extensively but Java does not permit infinite regular expressions in lookbehind, so some patterns are finitized. For example a* pattern will be changed to something like a{0,100}.
    Author:
    loomchild
    • Field Detail

      • MARGIN_PARAMETER

        public static final String MARGIN_PARAMETER
        Margin size. Used in streaming splitter. If rule is matched but its position is in the margin (position > bufferLength - margin) then the matching is ignored, and more text is read and rule is matched again.
        See Also:
        Constant Field Values
      • BUFFER_LENGTH_PARAMETER

        public static final String BUFFER_LENGTH_PARAMETER
        Reader buffer size. Segments cannot be longer than this value.
        See Also:
        Constant Field Values
      • MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER

        public static final String MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER
        Maximum length of a regular expression construct that occurs in lookbehind.
        See Also:
        Constant Field Values
      • DEFAULT_MARGIN

        public static final int DEFAULT_MARGIN
        Default margin size.
        See Also:
        Constant Field Values
      • DEFAULT_BUFFER_LENGTH

        public static final int DEFAULT_BUFFER_LENGTH
        Default size of read buffer when using streaming version of this class. Any segment cannot be longer than buffer size.
        See Also:
        Constant Field Values
      • DEFAULT_MAX_LOOKBEHIND_CONSTRUCT_LENGTH

        public static final int DEFAULT_MAX_LOOKBEHIND_CONSTRUCT_LENGTH
        Default max lookbehind construct length parameter.
        See Also:
        Constant Field Values
    • Constructor Detail

      • SrxTextIterator

        public SrxTextIterator​(SrxDocument document,
                               String languageCode,
                               CharSequence text,
                               Map<String,​Object> parameterMap)
        Creates text iterator that obtains language rules form given document using given language code. This constructor version is not streaming because it receives whole text as a string. Supported parameters: MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER.
        Parameters:
        document - SRX document
        languageCode - text language code of text used to retrieve the rules
        text -
        parameterMap - additional segmentation parameters
      • SrxTextIterator

        public SrxTextIterator​(SrxDocument document,
                               String languageCode,
                               Reader reader,
                               Map<String,​Object> parameterMap)
        Creates text iterator that obtains language rules from given document using given language code. This is streaming constructor - it reads text from reader using buffer with given size and margin. Single segment cannot be longer than buffer size. If rule is matched but its position is in the margin (position > bufferLength - margin) then the matching is ignored, and more text is read and rule is matched again. This is needed because incomplete rule can be located at the end of the buffer and never matched. Supported parameters: BUFFER_LENGTH_PARAMETER, MARGIN_PARAMETER, MAX_LOOKBEHIND_CONSTRUCT_LENGTH_PARAMETER.
        Parameters:
        document - SRX document
        languageCode - text language code of text used to retrieve the rules
        reader - reader from which read the text
        parameterMap - additional segmentation parameters
      • SrxTextIterator

        public SrxTextIterator​(SrxDocument document,
                               String languageCode,
                               Reader reader)
        Creates streaming text iterator with no additional parameters.
        Parameters:
        document - SRX document
        languageCode - text language code of text used to retrieve the rules
        reader - reader from which read the text
        See Also:
        SrxTextIterator(SrxDocument, String, Reader, Map)
    • Method Detail

      • next

        public String next()
        Finds the next segment in the text and returns it.
        Returns:
        next segment or null if it doesn't exist
        Throws:
        IllegalStateException - if buffer is too small to hold the segment
        IORuntimeException - if IO error occurs when reading the text
      • hasNext

        public boolean hasNext()
        Returns:
        true if there are more segments