LogolMatch

Introduction

LogolMatch is a Logol interpreter and a pattern search tool. It takes as input a biological sequence, DNA, RNA or protein, and a grammar file. The grammar is a Logol grammar that describes a pattern to be found in the input sequence. LogolMatch analyses the grammar and executes a program to match the pattern on the sequence. It returns a result file containing the matches with all required details.

Logol is a highly descriptive language dedicated to biological sequence analysis. It defines sequence information with mutations and morphism. Models and variables can be used to express a specific pattern and can be used to find repetition of a pattern along the sequence. The language definition comes from a paper work from Symbiose team (http://www.irisa.fr/symbiose/) at INRIA: "Logol : Modelling evolving sequence families through a dedicated constrained string language" .

Each variable can be defined with a number of error or a distance, it can also be transformed to get for example the reverse complement before testing the match etc...

It also take into account the parental information between two variables. At least it can even manage overlaps between matches.

Logolmatch is open source, and free of use, but it runs above other tools that may require fees or specific license agreements:

- Sicstus Prolog (only required to compile on unsupported architecture)

- Vmatch tool (The VMATCH large scale sequence analysis software [http://www.vmatch.de], S. Kurtz)

Running modes

The software is designed to run on a single computer, with one or several CPU, or on a cluster. It makes use of a configurable number of CPU to do parallel treatments, when possible. When several sequences require analysis for the same grammar (against a personal or public bank), the program can either serialize the analysis or use a cluster help with the Java DRMAA library.

Model components

Different components are in charge of specialized functions and some of them are run in parallel, while other synchronize them.

Illustration 1: LogolMatch main components

 

  1. The grammar interpreter reads a grammar file, or a model file, and generate an executable ready to search for the pattern on an input sequence, e.g. the sequence parser. The grammar file can also be a model grammar designed by a model designer.

  2. The multi-sequence manager takes a group of sequence as input, and run a treatment for each sequence. According to the mode, sequences are analysed sequentially or sent as cluster jobs. The multi-sequence manager is built upon an job manager interface, making it easy to add new managers. If only one sequence is used, the program can be called directly without using the multi-sequence manager.

  3. The Job Manager analyse the sequence and some configuration parameters to see if file can be split to run several analyse in parallel. According to this, one or several process will be executed, each generating its own result file. Finally, the Job Manager re-assemble the files in a unique result file.

  1. The result file can then be studied in a graphical analyser.

The main program, in Java, is responsible of the interpretation of the grammar and of the work coordination among the different sequences or sub-sequences (if they can be cut). It executes sub programs, in prolog. Those sub-programs parse one sequence as a sequence of alphabet characters to match some defined patterns.

The result analyser is a separate program independent of the match analysis. It takes an xml file as input and offer a graphical interface to check for the results. The xml file can be used directly if required.

Physical implementation

 

Logol structure

The logol grammar is structured with the following:

Grammar structure

Model designer

A model designer is also available, via a web browser, to graphically build a grammar and submit it to the tool.

Logol Analyser

The analyser displays the results in a web browser, for a list of matches. It can sort the results, select part of them to get them in fasta format, or analyse the structure of a specific match.

Indeed, the program keeps the whole pattern info of the match. This means that it is possible to known which sub-part of the sequence match correspond to which part of the original pattern. For example, for a pattern like:

(X, Y)

Result will provide result for the (X,Y) but will also give information for X and for Y separatly. The whole is saved in a tree strcuture to keep track of the whole information.