IXPARSE(1)IXPARSE(1)NAMEixparse - generate and convert text processing information files
SYNOPSIS
/usr/bin/ixparse [ -aAbChHfgnNprUvwWx ] [ -ttype ] [ -Dfile ]
[ -Ffile ] [ -Sfile ] [ -Llanguage ] [ -M# ] [ -P# ] [ -ystring ]
[ file ... ]
DESCRIPTION
Given a list of files, or a stream on standard input, ixparse generates
one of four types of profiling information on standard output. With
the -v option, ixparse can also generate profiling information for each
input file; the output is put into separate files named by adding an
extension to the input file's name (see below for the extensions).
The four types of profile are: weighting domain (a binary format
defined by the Indexing Kit's IXWeightingDomain class), histogram,
description, and Attribute Reader Format. The binary weighting domain
format is undocumented. A description is a short summary that can be
derived from some file formats, such as UNIX manual pages. Attribute
Reader Format is described in the Indexing Kit documentation in the
NEXTSTEP General Reference. Histogram format is described below.
Weighting domain files can be used with ixbuild(1) or again with
ixparse to alter the weighting of tokens in the index or profile. For
example, a weighting domain could be generated for all the source files
in a development project:
ixparse-w *.[cm] >project.weight
and that file could be used again with ixparse:
ixparse-Hp -Dproject.weight MyObject.m
The result would be a histogram where the weights of words are skewed
such that if two words occur the same number of times in MyObject.m,
those occurring less frequently in the entire set of source files (that
is, in the domain file project.weight) have higher weights.
In addition to generating profiling information for text files, ixparse
can read existing profiles in weighting domain format, histogram
format, and NEXTSTEP Release 2 Word Frequency Table (WFTable) format,
converting that information to one of the other formats.
HISTOGRAM FORMAT
Each line of a file in histogram format has the form:
token weight rank
token is the token or word in the index, weight is its weight
(frequency) in the domain, and rank is its cardinal rank in the domain
(1 == most common, 2 = second most common, and so on). rank is only
present in histograms produced by converting from weighting domains.
The fields of the line are separated by single spaces; be sure to
search backward from the end of a line to find the token, as it is
possible for the token to contain embedded spaces or tabs.
OPTIONS-- List these options.
The following options select input and output formats. Only one of the
input options -t, -h, -w, and -x and one of the output options -H, -g,
-W, and -b can be specified.
-ttype Interpret input as of file type type (for exampe, -trtf for
Rich Text Format). By default, ixparse attempts to
determine the file type for each file automatically.
-w Interpret input as weighting domain format.
-h Interpret input as histogram format.
-x Interpret input as NEXTSTEP Release 2 WFTable format.
-H Generate output in histogram format. This is the default.
-g Generate output as descriptions of file contents.
-W Generate output in weighting domain format.
-b Generate output in Attribute Reader Format.
-v Vector mode. Generate an output file for each input file.
Histogram and Attribute Reader Format files have an
extension of .histogram (this is a bug; Attribute Reader
Format files should use .arf). weighting domain format
files have an extension of .weight. Description files have
an extension of .description.
The remaining options control other parsing switches and weighting
calculations.
-a Use absolute weighting. The weight of a token (word) is its
number of occurrences in the input.
-A Don't fold plural word forms. The default is to do plural
folding.
-C Don't fold case to lower case. The default is to fold case.
-Dfile Use the supplied weighting domain file (default
.index.domain). This is used for generating peculiarity
weighting.
-f Use frequency weighting (number of occurrences / total
tokens).
-Ffile Use the supplied file type table file (default
.index.ftype). See the ixbuild(1) manual page for more
information on file type tables.
-Llanguage Parse files as though they contain text in the language
language. If no language is specified, the system default
language is used.
-M# Use the supplied minimum weight; words below this weight are
dropped from the index. The default is no minimum weight.
This option excludes use of the -P option.
-n Sort histogram output by name rather than weight.
-N Do not sort histogram output.
-p Use peculiarity weighting in conjunction with a weighting
domain (see -D).
-P# Use the supplied percentage passed; words below this
percentage are dropped from the index. The default is 100%
passed. This option excludes use of the -M option.
-r Reduce words to stems; writer -> write. The default is not
to do this.
-Sfile Use the supplied stop words file (default .index.swords).
See the ixbuild(1) manual page for more information on stop
words files.
-U Disable uniquing in Attribute Reader Format. See the
Attribute Reader Format documentation for more information.
-ystring Use the supplied punctuation string to delimit words; for
example, -y".,; ".
SEE ALSOixbuild(1), ixsearch(1), Indexing Kit Documentation in NEXTSTEP General
Reference
BUGSixparse doesn't read data in Attribute Reader Format.
ixparse filters files from various formats during parsing. It should
make the intermediate filtered formats available as output options.
Sorting options don't apply when converting from domain to histogram
formats.
Output files generated by vector mode in Attribute Reader Format should
use .arf as their extension, not .historam.
NeXT Computer, Inc. August 24, 1993 IXPARSE(1)