There are existing filters that do this. For instance the keep_types
token filter can do exactly that.
If you leverage the <NUM>
type, your custom token filter is going to only let numeric tokens through and filter out all others.
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "keep_types",
"types": [ "<NUM>" ]
}
],
"text": "1 quick fox 2 lazy dogs"
}
Result:
[1, 2]
You can achieve a similar result with the pattern_capture
token filter as well.
But if you really want to go the Java way, then you're best best is to clone an existing analysis plugin and roll your own.