Standard Tokenizer¶

A tokenizer of type standard providing grammar based tokenizer that is a good tokenizer for most European language documents. It splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token. It also splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split. It recognizes email addresses and internet hostnames as one token.

The following are settings that can be set for a standard tokenizer type:

Setting	Description
max_token_length	The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.

Standard Tokenizer¶

Previous topic

Next topic

This Page

Navigation

Standard Tokenizer¶

Previous topic

Next topic

This Page

Quick search

Navigation