Lang Analyzer

A set of analyzers aimed at analyzing specific language text. The following types are supported: arabic, armenian, basque, bulgarian, brazilian, catalan, chinese, cjk,, czech, danish, dutch, english, finnish, french, galician, german, greek**, persian, hindi, hungarian, indonesian, italian, norwegian, portuguese, romanian, russian**, spanish, swedish, turkish, thai.

All analyzers support setting custom stopwords either internally in the config, or by using an external stopwords file by setting stopwords_path.

Arabic Analyzer

The arabic analyzer is built on top of arabic_letter tokenizer, and lowercase, stop, arabic_normalizer and arabic_stem filters.

Brazilian Analyzer

The brazilian analyzer is built on top of standard tokenizer, and lowercase, standard, stop, and brazilian_stem filters.

Chinese Analyzer

The chinese analyzer is built on top of chinese tokenizer and chinese filter.

Cjk Analyzer

The cjk analyzer is built on top of cjk tokenizer and stop filter.

Czech Analyzer

The czech analyzer is built on top of standard tokenizer, and standard, lowercase, stop and czech_stem filters. It comes with default stopwords but they can be set.

Dutch Analyzer

The dutch analyzer is built on top of standard tokenizer, and standard, stop and dutch_stem filters.

French Analyzer

p .The french analyzer is built on top of standard tokenizer, and standard, stop, french_stem and lowercase filters.

German Analyzer

The german analyzer is built on top of standard tokenizer, and standard, lowercase, stop, german_stem filters.

Greek Analyzer

The greek analyzer is built on top of standard tokenizer, and greek_lowercase, stop filters.

Persian Analyzer

The persian analyzer is built on top of arabic_letter tokenizer and lowercase, arabic_normalization, persian_normalization and stop filters.

Russian Analyzer

The russian analyzer is built on top of russian_letter tokenizer and lowercase, stop and russian_stem filters. It comes with default stopwords but they can be set.

Thai Analyzer

The thai analyzer is built on top of standard tokenizer, and standard, thai_word, stop filters.