Combine multiple tokenizers in Solr

I’m trying to combine LetterTokenizerFactory with WhitespaceTokenizerFactory and not able to find how to do it without copying content using copyField.

Let me describe my idea:

  • I have two entries in text, e.g. H&M and Hewlett-Packard
  • User should be able to find H&M entering h&m – I use WhitespaceTokenizerFactory for this purpose, no need to split tokens on special chars
  • User should be able to find Hewlett-Packard entering ‘packard’ – LetterTokenizerFactory serves this case, tokens are splitted on special characters
  • Now I want to combine both this tokenizers

How can I achieve it without declaring 2 different types with different tokenizer factories and then copying value to field with second type?

You can use the WhitespaceTokenizerFactory as the main tokenizer, and then add the WordDelimiterGraphFilter to split your tokens further up into smaller tokens.

From the example for the WordDelimiterGraphFilter (previously named WordDelimiterFilter, but that’s deprecated now – so the name will depend on which Solr version you’re using):

Non-alphanumeric characters (discarded): “hot-spot” -> “hot”, “spot”

That would allow packard to match hewlett. Be advised that this will also allow ‘m’ to match h&m, since you’re splitting on non-alphanumeric characters. You can either use the protected setting for the filter to specify a list of words that should not be touched, or even better, if you want everything with & to remain untouched, use the types parameter to redefine what type & should be considered as.