This file documents the Trivial Database Search Facility (TDBSF).

This manual documents the TDBSF (version 3.2.2, 29 March 2013), a small search engine for “databases” stored in a very simple (trivial) format.

Search expression syntax

The TDBSF (Trivial Database Search Facility) is a small search engine for “databases” stored in a very simple (trivial) format.

The engine is written in Python and as of March 2009, there is only one really usable user interface, written in ELisp for Emacs. However, the TDBSF is designed to ease the addition of interfaces, so this is only a matter of whether there are people wanting to use it outside Emacs or not.

The databases that can be used with the TDBSF consist of a set of text files divided into records. Each record consists of four parts—a header, a body and two parts used to identify them: the beginning-of-header (boh) and the end-of-header (eoh).

The beginning of a header is identified by a regular expression (whose syntax is defined in the “re” module for the Python programming langage). Another regular expression is used to define where it ends and therefore where the body of the record begins. The body ends where the next beginning-of-header regular expression matches or at end of the file. For instance, a record could look like this (the last two lines belong to the second record):

     Header of the first record
     As many lines as necessary
     Header of the next record...

A TDBSF database file is a file containing one or more records like that in sequence.

The TDBSF tests a search expression on each record of a database defined by its set of database files and allows the user to view/edit (depending on the interface used) the matching records.

I wrote this software because my father used during years an old shareware to retreive some data stored in “databases” as described above, which has severe limitations, is not easily extensible and only works on MS-DOS and some versions of MS Windows. The TDBSF enables him to solve these problems while keeping its “databases” unchanged. Besides, it is much more powerful and fast than the original tool.

1 How to use the TDBSF

Supposing you have followed the instructions in the INSTALL file, here is what you have to do in order to use the TDBSF with the Emacs interface1:

  1. Type your search expression in an Emacs buffer of your choice and set the region (characters between the Emacs point and mark) around it (this can be done by selecting the expression with the left mouse button).

    A search expression consists of regular expressions combined with optional parentheses and the following operators: & (logical and), | (logical or), ! (logical not) and near expressions (for instance, foo ~4 bar ~2 baz ~3 quux is a near expression that matches if and only if ‘foo’ and ‘bar’ are found within—at most—4 words, this occurrence of ‘bar’ being found within 2 words of an occurrence of ‘baz’ itself found within 3 words of ‘quux’). Simple words and quoted groups of words "like this" are valid regular expressions. Example:

              swallow ~4 carry ~3 coconut & "air-speed velocity" & !European

    which is the same as:

              (swallow ~4 carry ~3 coconut) & "air-speed velocity" & (!European)

    By the way, such expressions would probably be more useful with flags making the regular expressions case-insensitive. See Search expression syntax, for more information.

    Note: after being successfully parsed, a search expression is converted into a tree and this tree is optimized in order to maximize the search speed for this expression. This is done by attributing a weight to each subexpression of the search expression (a regular expression will have a weight related to its length; for near expressions, the weights of its elements will be multiplied together; for & and | operators, the weights of the operands are added, etc.). Then, the order in which the operations are performed is chosen so that the “lighter” operations are more likely to be performed than the “heavier” ones (base idea: in an expression such as a & b & c, if a is false, then neither b nor c needs to be evaluated: the whole expression is false; similar considerations can be done with the | operator and with near expressions).

  2. Type M-x tdbsf which will ask you all the other info needed, namely:

The results are presented in a summary buffer named *TDBSF Results Summary* put in TDBSF Summary mode. There is one reference to a matching record per line, displaying the file name where it was found and the first line of its header. Just hit <RET> on a line in this buffer (or click with the middle button of the mouse) and another buffer will be popped, visiting the file and showing the record in TDBSF Database File mode, where some highlighting is done to show why the record matched2.

Sometimes, one can be surprised that not all matches of a regexp that is part of the search expression are highlighted in a matching record. This is expected, because the way search expressions are evaluated in each record is optimized for speed. For instance, suppose the search expression is "cat%I" | "donkey%I" | "turkey%I" and the search engine looks for matches of "cat%I" first (which is likely since it is the shortest of the three regular expressions). If such matches are found, the two other animals won't be searched for at all, much less highlighted, since it is logically sufficient that "cat%I" matches for the whole search expression to match. Moreover, if the search string is header or body (remember the four options) and the search expression matches the header of a given record, then the body of that record won't be searched at all, and therefore not highlighted in any way.

In TDBSF Summary mode, you can type C-c C-n or C-c C-p to jump to the next or previous matching record, respectively, visiting a new file if needed.

The following two paragraphs deal with details that can be skipped on the first reading of this manual.

Emacs markers are set in TDBSF Database File mode so that inserting or deleting text in the file will not cause any problem when jumping to the previous/next record or to any record of that file from the *TDBSF Results Summary* buffer. Also, the fontification is performed with overlays which also have markers to specify their beginning and end, therefore fontification doesn't get screwed in case of insertion/deletion in a buffer in TDBSF Database File mode.

Because of these mechanisms, TDBSF refuses to use a buffer (in order to display a record) if it is marked as modified. This is because information such as the positions where each record starts is stored into buffer-local variables when TDBSF performs a search and becomes outdated when buffers visiting TDBSF databse files are modified. If you edit a database file and want accurate jumps from the *TDBSF Results Summary* buffer to the modified buffer (through <RET> or middle click), the simplest way is to save the modified buffer and launch the search again.

2 Search expression syntax

A search expression is an expression which can be checked against a string (e.g. the body of a record) to see if it matches that string or not.

It is a combination of simple elements that match or don't match the string, using the boolean and (&), or (|) and not (!) operators as well as parentheses to force the order in which the operations are performed (a match is considered as a boolean “true” and a non-match as a “false”). In the TDBSF terminology, the simplest of these elements are called atoms and are either regular expressions or near-expressions (near-expressions will be defined later).

Any whitespace (spaces, tabulations or newlines) between atoms and operators is ignored and can therefore be used to improve the readability of a search expression.

2.1 Regular expressions in a TDBSF search expression

The regular expression syntax used by the TDBSF is that of the “re” module of the Python programming langage. As of February 2013, its documentation can be found directly at In any case, you will find it starting from This syntax is very powerful, therefore a bit complex and cannot be detailed here: it would result in a useless, bad duplicate.

There are two ways to embed a re-module-style regular expression (regexp) in a TDBSF search expression:

2.2 Near expressions

A near expression is a combination of two or more regular expressions following the syntax:

     r1 ~p1 r2 ~p2 ... ~pn-1 rn

where r1, ..., rn represent regular expressions and p1, ..., pn-1 integers (written in decimal notation).

Such an expression is said to match if and only if a match of r1 is found within p1 words of a match of r2, itself found within p2 words of a match of r3, etc. The anchors used to count words are at the beginning of words (a single regexp can match several words). Example:

     nice ~2 cat

This will not only match “nice cat”, but also “nice little cat”, “nice black cat” and “cat is nice”, it will even match “nice mean cat” (!). However, it won't match “nice black and white cat”, nor “nice fat black cat”. Another example:

     nice ~4 little ~1 cat

This will of course match “nice little cat”, but also “nice black and white little cat”, “nice, little cat”, “cat little and nice” and many others. A final example:

     "knights%I" ~3 say ~2 "Ni|Nee%I" ~10 "shrubber(y|ies)"

In addition to the constraints defined by the ~3, ~2 and ~10 operators, this search expression will accomodate for some uncertainty in the writing of “Ni” versus “Nee” (or any variation in case) or “shrubbery” versus “shrubberies”.

Any whitespace (spaces, tabs or newlines) between the regexps and the ~pk operators is ignored and can therefore be used to improve the readability of a near expression.

For now, a word is defined as a contiguous sequence of characters that are either a hyphen (‘-’) or underscore (‘_’), or for which the Python unicodedata.category function returns a string that starts with “L” (letter) or “N” (number).

2.3 Operators precedence

The ~pk operator used in near expressions has the highest priority. Then comes the ! operator, then the & and finally the |.

In other words, near expressions grab as many regexps as possible, then ! negation operators directly apply on the following expression. & and | behave as the multiplication and the addition of numbers respectively, in terms of priority.

If you don't want to ask you such questions, just use parentheses to force operators precedence!

2.4 A complex search expression example

Here is a search expression example using all the features available (well, not all available regarding the re-module-style regexps!):

     ("bridge of death%I" | "gorge of eternal peril%I") & bridgekeeper
     & (three ~3 questions) & !hesitat

No, this is not a really useful example. And yes, the parenthesis around three ~3 questions are only here to improve readability.

3 International

Since version 2.0, TDBSF has supported international character sets and encodings in database files. This is done through the use of Unicode in the search engine. The search expression can of course contain any Unicode character supported by Python's “re” module.

In order to use non-ASCII characters in a database file, the encoding must be specified in an encoding declaration (this is a sane practice that has been common for long among Emacs users and is now also required for Python source files, except when using the UTF-8 encoding). TDBSF refuses to work with a database file that contains non-ASCII characters and has no encoding declaration. (Actually, a UTF-8 Byte Order Mark is accepted as a substitute for an encoding declaration, but this practice is discouraged.)

The encoding declaration uses a format that is recognized by both Emacs and Python. It consists of a single line at the beginning of each database file, such as the following:

     -*- coding: utf-8 -*-

(no need to put any space before the magic delimiter -*-)

This declaration line should be followed by a blank line, then by the header of the first record of the database file. For instance:

     -*- coding: utf-8 -*-
     Header of the first record
     As many lines as necessary
     Body of the first record
     Header of the second record
     Body of the second record

Any encoding supported by Python can be used in the declaration, as long as it is ASCII-compatible (e.g., iso-8859-1, iso-8859-15, utf-8): this is necessary, because the line specifying the encoding is read before the encoding is known!

The list of valid encodings can be found at html/lib/standard-encodings.html in the documentation shipped with your Python installation (online version for Python 3 at Of course, if you use Emacs to view or edit the database files, you should make sure that the encoding name you choose is also recognized by Emacs. That is why utf-8 is preferable to utf_8. This is usually easy, because both Python and Emacs recognize a few aliases for common encodings, and Python considers spelling alternatives of encoding names that only differ in case or use a hyphen instead of an underscore as equivalent.

Different database files may use different encodings; everything is converted to Unicode in the engine before it starts searching through the contents of each file.

The command-line arguments of tdbsf-front-end are automatically decoded by Python from the user's “preferred encoding” (cf. locale.getpreferredencoding). Its output is also encoded in the same encoding (tdbsf-front-end uses the default behavior of Python 3 with respect to encoding).

The Emacs interface offers three variables to specify the encoding it uses when communicating with tdbsf-front-end: tdbsf-coding-system-for-read, tdbsf-coding-system-for-write (both defaulting to 'utf-8) and tdbsf-pythonioencoding (which defaults to "utf-8:strict"). It should not be necessary to modify these variables, as UTF-8 can encode any Unicode character.

4 Using the search engine from a Python program

The Python tdbsf package provides a Crawler class that allows any Python program to use the TDBSF search engine. See the file for details.

5 Future improvements

It must be easy to improve the method of specifying a set of database files, for instance by reading a set of glob-module-style patterns or allowing to explore directories recursively. The main changes would affect the interfaces. If I get any request, I'll probably improve this aspect.

Table of Contents


[1] No other interface is really functional as of February 2013, but the TDBSF is written so that adding interfaces is as easy as possible. :-)

[2] The fontification process tries to reflect the structure of the branch of the search expression that triggered the match by for instance giving all the elements of a matching near expression the same face (a face in Emacs is a combination of a font, a color and other such attributes).

[3] The combination of the escape character and the following one forms an escape sequence.