Package org.languagetool.tokenizers
Class WordTokenizer
- java.lang.Object
-
- org.languagetool.tokenizers.WordTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class WordTokenizer extends Object implements Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (likehttp://foobar.org
).- Author:
- Daniel Naber
-
-
Constructor Summary
Constructors Constructor Description WordTokenizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static List<String>
getProtocols()
Get the protocols that the tokenizer knows about.String
getTokenizingCharacters()
static boolean
isEMail(String token)
static boolean
isUrl(String token)
protected List<String>
joinEMails(List<String> list)
protected List<String>
joinEMailsAndUrls(List<String> list)
protected List<String>
joinUrls(List<String> l)
List<String>
tokenize(String text)
-
-
-
Method Detail
-
getProtocols
public static List<String> getProtocols()
Get the protocols that the tokenizer knows about.- Returns:
- currently
http
,https
, andftp
- Since:
- 2.1
-
isUrl
public static boolean isUrl(String token)
- Since:
- 3.0
-
isEMail
public static boolean isEMail(String token)
- Since:
- 3.5
-
getTokenizingCharacters
public String getTokenizingCharacters()
- Returns:
- The string containing the characters used by the tokenizer to tokenize words.
- Since:
- 2.5
-
-