DtSearchQuery(library call) DtSearchQuery(library call)
NAMEDtSearchQuery — Perform a DtSearch database search for a specified
query
SYNOPSIS
#include <Dt/Search.h>
int DtSearchQuery(
void *qry,
char *dbname,
int search_type,
char *date1,
char *date2,
DtSrResult **results,
long *resultscount,
char *stems,
int *stemcount);
DESCRIPTIONDtSearchQuery is the DtSearch API search function.
DtSearchQuery is passed a query string and some search options, per‐
forms the requested search, and if successful returns a linked list of
DtSrResult structures representing the documents satisfying the search.
The results list contains information about the documents that can be
used for subsequent retrievals, as well as information suitable for
display to an end user.
Search Types
DtSearchQuery supports three types of searches: P, W, and S.
Type P Search Query Strings
Query strings for search type P have the simplest syntax, namely a
sequence of words separated by ASCII whitespace. Punctuation and
invalid words are silently discarded by the search engine. The only
possible syntax error is that all query words happen to be invalid in
the language of the database.
Search type P is often used to implement a limited Query-by-Example
(QBE) search paradigm. In this scenario, users typically paste document
text from whatever source into a query string text field. Their expec‐
tation is that the search engine will return the documents in the data‐
base that are "most similar" to the text of the query string, and the
statistical sort of the results list usually satisfies that expecta‐
tion.
Note that although search type P does not use boolean syntax, it is
actually implemented as a stemmed search (type S search) with implied
boolean ORs between words.
Types S and W Boolean Query Strings
Query strings for search types S (stemmed boolean) and W (exact word
boolean) must be syntactically valid boolean expressions as described
below. Any string that does not match a valid expression rule is
invalid and will fail with an error message.
Query words for all search types may be entered in any codeset for a
supported DtSearch language, including multibyte languages. Words may
be identified as invalid by the language module of the database for a
number of reasons including any words that would not have been indexed
because they are too short, too long, on the stop list, etc. With one
exception, linguistically invalid words result in a syntax error. The
exception is in the case of an "all ANDs" query, where invalid words
and valid words that happen not to be in the database are silently
erased from the query string.
The boolean query operators are the ASCII metacharacters: '&' for AND,
'|' for OR, '~' for NOT, '(' and ')' for open and close parentheses
respectively, and '@ nnn' for collocation expressions.
All expression tokens are separated by ASCII whitespace. Typically this
i 1 or more space or tab characters. Omitting whitespace separators is
legal if it can be done unambiguously. For example "word1&word2" is a
legal expression but "word1word2" would be interpreted as a single word
token.
The ASCII "at" sign (@) marks a special boolean collocation operator.
The collocation operator has the syntax "@n...", the ASCII "at" sign
followed by one or more ASCII numeric digits, representing an integer
with value greater than zero. Collocation is a variation of the AND
search where a user can specify the maximum distance in bytes between
any two words. In most languages a byte is equivalent to a character
position. For example to find "ice" and "cream" separated by no more
than five characters, the search query "ice @5 cream" may be used.
Unlike other boolean operators, the collocation operator can apply only
to naked word tokens, not other expressions. Searches including collo‐
cation operators are slower than searches without them, and can be much
slower for common words.
There are a maximum of 8 distinct word tokens. Collocation operators
count as part of the 8. There is no limit to the number of operators,
as long as they match the syntax rules.
Note:
Collocation operators are only supported for "Austext flavor"
databases. The default flavor of database created by dtsrcreate
is "Dtinfo flavor," which does not support collocation.
Boolean Query Syntax Rules
There are only 6 syntax rules and the rules are recursive. Ambiguity is
resolved by precedence and associativity rules.
1. valid_expression := word_token
A valid expression can be just a valid naked word token.
Semantically, the expression returns all documents containing
the specified word. The word_token must be a valid word in
the language of the database being searched.
2. valid_expression := valid_expression '&' valid_expression
The ASCII ampersand character is the AND character. Semanti‐
cally, it returns all documents satisfying both the first and
second expressions (boolean intersection). AND is also the
"implied" boolean operator in the following sense: the query
parser will insert an ampersand between words or expressions
that otherwise would be separated only by whitespace. For
example "word1 word2" becomes "word1 & word2".
3. valid_expression := valid_expression '|' valid_expression
The ASCII virgule (vertical slash) character is the OR char‐
acter. It means return all documents satisfying either the
first or the second expression (boolean union).
4. valid_expression := '(' valid_expression ')'
Valid expressions may be recursively nested in ASCII open and
close parentheses characters. The query parser "forgives" two
common human errors. It will automatically discard excessive
close parentheses characters, and it will automatically gen‐
erate close parentheses characters if necessary at the end of
a query. For example, "aaa | (bbb & ccc)))))) ddd" becomes
"aaa | ( bbb & ccc) & ddd", and "aaa ((bbbb" becomes "aaa ( (
bbb ) )".
5. valid_expression := '~' valid_expression
The ASCII tilde character is the unary NOT operator. It
returns every document in the database that is not in the set
satisfying the expression.
6. valid_expression := word_token collocation_operator word_token
Collocation operators are permitted only between words, not
expressions. Each of the word tokens and the collocation
operator itself occupy slots in the table of 8 maximum word
tokens.
Boolean Associativity and Precedence Table
In order from highest precedence to lowest:
Associativity Operator Example
(none) COLLOC
right NOT "aaa~bbb" resolved as "aaa &
(~(bbb)"
left AND "aaa bbb ccc" resolved as "(aaa &
bbb) & ccc"
left OR "aaa|bbb|ccc" resolved as "(aaa |
bbb) | ccc"
(none) naked word
Example Boolean Queries
aaa bbb ccc
Returns all records that contain at least one occurrence of all three
words.
aaa | (bbb ~ccc)
Retrieves all records containing "aaa" and also all records containing
"bbb", but not "ccc".
aaa ~(aaa @1 bbb)
Returns all records containing "aaa" but omits those where "aaa" is one
character away from "bbb".
It is possible to formulate a query that requires retrieving all
records in the database that contain none of the query words (for exam‐
ple, ~aaa. Users should be warned that in a large database such a
search can take a very long time.
Using the implied associativity and precedence rules, the ambiguous
query string aaa ~bbb | ccc ~ddd @10 eee is disambiguated as (aaa &
(~bbb)) | (ccc & (~(ddd @10 eee))).
ARGUMENTS
search_type
Specifies the type of search to perform. Valid values are P,
W, and S.
Search type P indicates that the query string is a sequence
of words separated by ASCII whitespace. It requests that the
words be stemmed prior to searching, that all documents con‐
taining any of the words be returned, that the results list
be statistically sorted, and that no more than the top MaxRe‐
sults list items be returned where MaxResults is the current
value returned from DtSearchGetMaxResults. Note that a type P
search is identical to a type S boolean search with an
implied boolean OR between words.
Search types W and S are boolean query searches. They indi‐
cate that the query string is a sequence of words and boolean
operators matching the syntax described under "Types S and W
Boolean Query Strings" above.
Type S requests that words be stemmed prior to searching.
Type 'W' requests that words be left unstemmed. Both types
request that all documents containing the combinations of
query words specified by the boolean operations be returned,
that the results list be statistically sorted if possible,
and that no more than the top MaxResults list items be
returned whereMaxResults is the current value returned from
DtSearchGetMaxResults.
dbname Specifies which database is to be searched. It is any one of
the database name strings returned from DtSearchInit or
DtSearchReinit. If dbname is NULL, the first database name
string is used.
Within the specified database, searches will be restricted to
those documents whose DtSrKeytype.is_selected field is
nonzero.
date1 and date2" 10 Specify a range of document dates to use for the
search. Only documents within the specified range will be
returned on the results list.
date1 is the older end of the range and if not NULL, requests
DtSearch to return only those records younger than (that is,
after) the specified date.
date2 is the younger end of the range and if not NULL,
requests DtSearch to return only those records older than
(that is before) the specified date.
It is valid to specify just one of the arguments.
Undated documents always qualify for a results list regard‐
less of search date strings. The format of a valid date
string is described in DtSearchValidDateString(3).
stems and stemscount" 10 Specify a character buffer to hold parsed and
stemmed words and a variable to receive the number of stored
words. stems and stemscount are optional; they can be NULL.
However, if either is specified, they must both be specified.
If specified stemsmust point to a character buffer large
enough to hold DtSrMAX_STEMCOUNT by DtSrMAXWIDTH_HWORD bytes.
An array of parsed and stemmed query words will be stored
here by the API for use by a later call to DtSearchHighlight.
The size of the array will be stored in stemscount.
results and
resultscount" 10 Specify where a pointer to the results list
will be stored and a variable to receive the number of items
on the list.
Results lists can be manipulated with several utility func‐
tions.
In DtSearch, frequency of occurrence information is main‐
tained for words across the whole database and within docu‐
ments. For most queries, results lists are sorted by this
statistical information and presented to the user as a "prox‐
imity" number for each document on the list. Proximity is
meant to appear to a user as a distance, or a measure of the
nearness of the query to the document. Conceptually, the
smaller the proximity the "closer" the document is to the
query and the more likely it will be valuable to the user
DtSearch searches only one database at a time and returns
only results lists for that single database. However,
browsers often provide the illusion of simultaneous searches
in multiple databases, merging the results lists by proximity
when completed. Since the domain of knowledge and density of
words and records may vary from database to database, the
value of proximity numbers may similarly vary, and some data‐
bases may be underrepresented on merged results lists.
RETURN VALUE
This function has three common return codes.
DtSrOK is returned, as well as a results list and stems array, when the
search was completely successful.
DtSrNOTAVAIL is returned when the query was valid but the search was
unsuccessful (that is, no set of documents matched the query). There
are usually no messages with DtSrNOTAVAIL.
DtSrFAIL is returned when the search was unsuccessful, usually because
of an invalid query, and user messages on the MessageList explain why.
Any API function can also return DtSrREINIT and the return codes for
fatal engine errors at any time.
SEE ALSODtSrAPI(3), DtSearchReinit(3), DtSearchGetMaxResults(3), DtSearchSet‐
MaxResults(3), DtSearchGetKeytypes(3), DtSearchValidDateString(3),
DtSearchSortResults(3), DtSearchFreeResults(3), DtSearchHighlight(3)
DtSearchQuery(library call)