Class LanguageIdentifier


  • public class LanguageIdentifier
    extends Object
    Identify the language of a text. Note that some languages might never be detected because they are close to another language. Language variants like en-US or en-GB are not detected, the result will be en for those. By default, only the first 1000 characters of a text are considered. Email signatures that use \n-- \n as a delimiter are ignored.
    Since:
    2.9
    • Constructor Detail

      • LanguageIdentifier

        public LanguageIdentifier()
      • LanguageIdentifier

        public LanguageIdentifier​(int maxLength)
        Parameters:
        maxLength - the maximum number of characters that will be considered - can help with performance. Don't use values below 100, as this would decrease accuracy.
        Throws:
        IllegalArgumentException - if maxLength is less than 10
        Since:
        4.2
    • Method Detail

      • enableFasttext

        public void enableFasttext​(File fasttextBinary,
                                   File fasttextModel)
      • detectLanguage

        @Nullable
        public @Nullable Language detectLanguage​(String text)
        Returns:
        language or null if language could not be identified
      • detectLanguage

        @Nullable
        public @Nullable DetectedLanguage detectLanguage​(String text,
                                                         List<String> noopLangsTmp,
                                                         List<String> preferredLangsTmp)
        Parameters:
        noopLangsTmp - list of codes that are detected but will lead to the NoopLanguage that has no rules
        Returns:
        language or null if language could not be identified
        Since:
        4.4 (new parameter noopLangs, changed return type to DetectedLanguage)