An analyzer of type pattern that can flexibly separate text into terms via a regular expression. Accepts the following settings:
The following are settings that can be set for a pattern analyzer type:
Setting | Description |
---|---|
lowercase | Should terms be lowercased or not. Defaults to true. |
pattern | The regular expression pattern, defaults to W+. |
flags | The regular expression flags. |
IMPORTANT: The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, eg `CASE_INSENSITIVE|COMMENTS”. Check “Java Pattern API <http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#field_summary>`_ for more details about flags options.
In order to try out these examples, you should delete the test index before running each example:
curl -XDELETE localhost:9200/test
curl -XPUT 'localhost:9200/test' -d '
{
:ref:`settings <es-guide-reference-index-modules-analysis-{>` s <es-guide-reference-index-modules-analysis-{>`
:ref:`analysis <es-guide-reference-index-modules-analysis>` is <es-guide-reference-index-modules-analysis>` {
:ref:`analyzer <es-guide-reference-index-modules-analysis>` er <es-guide-reference-index-modules-analysis>` {
:ref:`whitespace <es-guide-reference-index-modules-analysis-{>` e <es-guide-reference-index-modules-analysis-{>`
:ref:`type <es-guide-reference-index-modules-analysis>` pe <es-guide-reference-index-modules-analysis>` "pattern",
:ref:`pattern <es-guide-reference-index-modules-analysis-"\\s+">` -guide-reference-index-modules-analysis-"\\s+">`
}
}
}
}
}'
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=whitespace' -d 'foo,bar baz'
# "foo,bar", "baz"
curl -XPUT 'localhost:9200/test' -d '
{
:ref:`settings <es-guide-reference-index-modules-analysis-{>` s <es-guide-reference-index-modules-analysis-{>`
:ref:`analysis <es-guide-reference-index-modules-analysis>` is <es-guide-reference-index-modules-analysis>` {
:ref:`analyzer <es-guide-reference-index-modules-analysis>` er <es-guide-reference-index-modules-analysis>` {
:ref:`nonword <es-guide-reference-index-modules-analysis-{>` d <es-guide-reference-index-modules-analysis-{>`
:ref:`type <es-guide-reference-index-modules-analysis>` pe <es-guide-reference-index-modules-analysis>` "pattern",
:ref:`pattern <es-guide-reference-index-modules-analysis-"[^\\w]+">` ide-reference-index-modules-analysis-"[^\\w]+">`
}
}
}
}
}'
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'foo,bar baz'
# "foo,bar baz" becomes "foo", "bar", "baz"
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=nonword' -d 'type_1-type_4'
# "type_1","type_4"
curl -XPUT 'localhost:9200/test?pretty=1' -d '
{
:ref:`settings <es-guide-reference-index-modules-analysis-{>` s <es-guide-reference-index-modules-analysis-{>`
:ref:`analysis <es-guide-reference-index-modules-analysis>` is <es-guide-reference-index-modules-analysis>` {
:ref:`analyzer <es-guide-reference-index-modules-analysis>` er <es-guide-reference-index-modules-analysis>` {
:ref:`camel <es-guide-reference-index-modules-analysis-{>` l <es-guide-reference-index-modules-analysis-{>`
:ref:`type <es-guide-reference-index-modules-analysis>` pe <es-guide-reference-index-modules-analysis>` "pattern",
:ref:`pattern <es-guide-reference-index-modules-analysis-"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])">` )|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])">`
}
}
}
}
}'
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=camel' -d '
MooseX::FTPClass2_beta
'
# "moose","x","ftp","class","2","beta"
The regex above is easier to understand as:
([^\\p{L}\\d]+) # swallow non letters and numbers,
| (?<=\\D)(?=\\d) # or non-number followed by number,
| (?<=\\d)(?=\\D) # or number followed by non-number,
| (?<=[ \\p{L} && [^\\p{Lu}]]) # or lower case
(?=\\p{Lu}) # followed by upper case,
| (?<=\\p{Lu}) # or upper case
(?=\\p{Lu} # followed by upper case
[\\p{L}&&[^\\p{Lu}]] # then lower case
)