KinoSearch::Analysis::Tokenizer man page on Pidora

KinoSearch::Analysis::Tokenizer man page on Pidora

Man page or keyword search:
man Server 31170 pages
apropos Keyword Search (all sections)
Output format

KinoSearch::Analysis::UsernContributed Perl KinoSearch::Analysis::Tokenizer(3)

NAME
       KinoSearch::Analysis::Tokenizer - Split a string into tokens.

SYNOPSIS
	   my $whitespace_tokenizer
	       = KinoSearch::Analysis::Tokenizer->new( pattern => '\S+' );

	   # or...
	   my $word_char_tokenizer
	       = KinoSearch::Analysis::Tokenizer->new( pattern => '\w+' );

	   # or...
	   my $apostrophising_tokenizer = KinoSearch::Analysis::Tokenizer->new;

	   # Then... once you have a tokenizer, put it into a PolyAnalyzer:
	   my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
	       analyzers => [ $case_folder, $word_char_tokenizer, $stemmer ], );

DESCRIPTION
       Generically, "tokenizing" is a process of breaking up a string into an
       array of "tokens".  For instance, the string "three blind mice" might
       be tokenized into "three", "blind", "mice".

       KinoSearch::Analysis::Tokenizer decides where it should break up the
       text based on a regular expression compiled from a supplied "pattern"
       matching one token.  If our source string is...

	   "Eats, Shoots and Leaves."

       ... then a "whitespace tokenizer" with a "pattern" of "\\S+"
       produces...

	   Eats,
	   Shoots
	   and
	   Leaves.

       ... while a "word character tokenizer" with a "pattern" of "\\w+"
       produces...

	   Eats
	   Shoots
	   and
	   Leaves

       ... the difference being that the word character tokenizer skips over
       punctuation as well as whitespace when determining token boundaries.

CONSTRUCTORS
   new( [labeled params] )
	   my $word_char_tokenizer = KinoSearch::Analysis::Tokenizer->new(
	       pattern => '\w+',    # required
	   );

       ·   pattern - A string specifying a Perl-syntax regular expression
	   which should match one token.  The default value is
	   "\w+(?:[\x{2019}']\w+)*", which matches "it's" as well as "it" and
	   "O'Henry's" as well as "Henry".

INHERITANCE
       KinoSearch::Analysis::Tokenizer isa KinoSearch::Analysis::Analyzer isa
       KinoSearch::Object::Obj.

COPYRIGHT AND LICENSE
       Copyright 2005-2010 Marvin Humphrey

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

perl v5.14.1			  2011-06-20KinoSearch::Analysis::Tokenizer(3)

[top]

List of man pages available for Pidora

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]

Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................

Vote for polarhome