Class WordTokenizer

  • All Implemented Interfaces:
    Tokenizer

    public class WordTokenizer
    extends Object
    implements Tokenizer
    Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (like http://foobar.org).
    Author:
    Daniel Naber
    • Constructor Detail

      • WordTokenizer

        public WordTokenizer()
    • Method Detail

      • getProtocols

        public static List<String> getProtocols()
        Get the protocols that the tokenizer knows about.
        Returns:
        currently http, https, and ftp
        Since:
        2.1
      • isUrl

        public static boolean isUrl​(String token)
        Since:
        3.0
      • isEMail

        public static boolean isEMail​(String token)
        Since:
        3.5
      • getTokenizingCharacters

        public String getTokenizingCharacters()
        Returns:
        The string containing the characters used by the tokenizer to tokenize words.
        Since:
        2.5