KinoSearch::Docs::TutoUser:Contributed)KinoSearch::Docs::Tutorial::Analysis(3)NAMEKinoSearch::Docs::Tutorial::Analysis - How to choose and use Analyzers.
DESCRIPTION
Try swapping out the PolyAnalyzer in our Schema for a Tokenizer:
my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
my $type = KinoSearch::Plan::FullTextType->new(
analyzer => $tokenizer,
);
Search for "senate", "Senate", and "Senator" before and after making
the change and re-indexing.
Under PolyAnalyzer, the results are identical for all three searches,
but under Tokenizer, searches are case-sensitive, and the result sets
for "Senate" and "Senator" are distinct.
PolyAnalyzer
What's happening is that PolyAnalyzer is performing more aggressive
processing than Tokenizer. In addition to tokenizing, it's also
converting all text to lower case so that searches are case-
insensitive, and using a "stemming" algorithm to reduce related words
to a common stem ("senat", in this case).
PolyAnalyzer is actually multiple Analyzers wrapped up in a single
package. In this case, it's three-in-one, since specifying a
PolyAnalyzer with "language => 'en'" is equivalent to this snippet:
my $case_folder = KinoSearch::Analysis::CaseFolder->new;
my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
my $stemmer = KinoSearch::Analysis::Stemmer->new( language => 'en' );
my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $case_folder, $tokenizer, $stemmer ],
);
You can add or subtract Analyzers from there if you like. Try adding a
fourth Analyzer, a Stopalizer for suppressing "stopwords" like "the",
"if", and "maybe".
my $stopalizer = KinoSearch::Analysis::Stopalizer->new(
language => 'en',
);
my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $case_folder, $tokenizer, $stopalizer, $stemmer ],
);
Also, try removing the Stemmer.
my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
analyzers => [ $case_folder, $tokenizer ],
);
The original choice of a stock English PolyAnalyzer probably still
yields the best results for this document collection, but you get the
idea: sometimes you want a different Analyzer.
When the best Analyzer is no Analyzer
Sometimes you don't want an Analyzer at all. That was true for our
"url" field because we didn't need it to be searchable, but it's also
true for certain types of searchable fields. For instance, "category"
fields are often set up to match exactly or not at all, as are fields
like "last_name" (because you may not want to conflate results for
"Humphrey" and "Humphries").
To specify that there should be no analysis performed at all, use
StringType:
my $type = KinoSearch::Plan::StringType->new;
$schema->spec_field( name => 'category', type => $type );
Highlighting up next
In our next tutorial chapter, KinoSearch::Docs::Tutorial::Highlighter,
we'll add highlighted excerpts from the "content" field to our search
results.
COPYRIGHT AND LICENSE
Copyright 2008-2010 Marvin Humphrey
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
perl v5.14.12011-KinoSearch::Docs::Tutorial::Analysis(3)