(Alignment of Multiple Protein Sequences)
by Geoff Barton
Laboratory of Molecular Biophysics
South Parks Road
Oxford
SEE ALSO THE AMPS Notes from the AMBO training course held at PERI, Osaka Japan in March 1994. It highlights the basic steps for multiple alignment using the AMPS package.
Introduction to Version D 2.0
-----------------------------
This is the first major update to the AMPS package since its initial
distribution in 1988 (Version D 1.0). No major changes have been made to the
basic multiple alignment functions, except the addition of a dendrogram drawing
facility in programme ORDER and the provision of identifier codes of up to 20
characters (including the '>' symbol) (program PATT is limited to 10
characters). Existing users should not experience problems - I hope... The
major addition is the availability of the flexible pattern matching and
database scanning features of the programmes described in Barton, 1990 and
Barton & Sternberg 1990. Sections have been added to this guide to explain how
these features may be used.
All development of this package is now carried out on a Sun Microsystems SPARCstation under the Unix operating system. The most stable version is therefore the SPARCstation one. With this release is the first option for Silicon Graphics Personal Iris and compatibles. The core multiple alignment programs will compile under VAX/VMS - see the instructions in BUILD.COM.
If you have problems with the programmes and your local Guru cannot solve them, then e-mail me on gjb@bioch.ox.ac.uk stating exactly what the problem is, together with example files.
RELATED PROGRAMS
----------------
ALSCRIPT: Takes the output of AMPS (block file format) and allows prettyfication
of the alignment: boxing, shading etc.
AMAS: Performs a hierarchical analysis of a multiple alignment.
STAMP: Does multiple alignment of protein three dimensional structures.
See our anonymous ftp server for further details of these programs.
ftp geoff.biop.ox.ac.uk
login as ftp
password your email address.
----- oOo -----
CONTENTS
Introduction to Version D 2.0:
Background:
Introduction: Needleman and Wunsch Algorithm.
Predicting the likely quality of alignment.
Multiple alignment strategy.
Description of the AMPS package:
MULTALIGN: (1) Pairwise alignment.
Time considerations.
(2) Multiple alignment just using sequence information
(a) Single order alignment.
(b) Tree based alignment.
(c) Multiple alignment - building on an existing alignment.
Examples:
(a) Aligning without additional information.
(b) Secondary structure dependent gap penalties.
(c) Defining explicit weights to align ONE extra sequence.
Lookup tables.
ORDER:
FLEXIBLE PATTERN MATCHING AND DATABASE SCANNING:
Introduction:
PROGRAM LIMITATIONS:
1. Defining a Pattern.
2. Defining the scoring scheme.
3. Alternative scoring schemes.
Sorting the results of a scan using PROGRAM SORTER.
1. To sort the output file and store the entire sorted list.
2. To sort the output and store only the top N scoring results.
3. Extracting the IDentifier codes for the top scoring sequences.
4. Specifying a cutoff score.
5. Generating Prolog Clauses.
6. Generating a Histogram of the score distribution.
7. Getting Help
Extracting Sequences from a PIR Database Using PROGRAM SELECT.
Generating Alignments of the Pattern With Interesting Sequences.
Generating multiple pattern alignments with each sequence.
APPENDIX I: Summary of Commands for MULTALIGN.
APPENDIX II: MULTALIGN File Formats.
APPENDIX III: VAX/VMS installation guide.
APPENDIX IV: Sun/Unix installation guide.
References.
Known Bugs.
--- oOo ---
Background
----------
AMPS is a suite of programs designed for the multiple alignment of protein sequences and flexible pattern matching. Initial development of the algorithms programmed in this suite was performed whilst I was in the Department of Crystallography at Birkbeck College, University of London. The programs as distributed are the result of refinements and improvements on the the original Birkbeck programs during my time at the Imperial Cancer Research Fund Laboratories in London and further development at the Laboratory of Molecular Biophysics, Oxford. These programs are available for academic, teaching and non-commercial purposes. For commercial use, a commercial licence is required. Please contact the author at the above address for details.
Introduction
------------
Needleman and Wunsch Algorithm:
One of the most widely applied sequence alignment algorithms is the method of Needleman and Wunsch (1970). When supplied with two protein sequences and a matrix of pair-scores (eg. an identity matrix, or Dayhoff's mutation data matrix) the algorithm identifies the best alignment(s) between the two sequences and a score for the alignment. Whilst this approach may be used with virtually any pair of sequences, extending the method to more than 3 sequences is impractical. This computer package implements a heuristic approach to multiple sequence alignment that has been shown to be effective for many protein families. (See Barton and Sternberg 1987b for a description of the algorithm and assesment of accuracy, See Zvelebil et al, 1987, for an application in secondary structure prediction)
Predicting the likely quality of alignment:
Barton and Sternberg (1987a,b) considered families of proteins for which the three dimensional structure of at least two members had been solved by x-ray crystallography. The alignment obtained by comparing the tertiary structures was used as a 'standard' against which to test the Needleman and Wunsch (1970) sequence alignment algorithm.
In summary; if the significance score obtained for the pairwise comparison of two sequences is greater than 5-6 S.D. then there is a high probability that the sequences will be correctly aligned over most of the regions that belong to core secondary structures. Whilst this observation does not mean that alignments of sequence pairs scoring less than 5.0 S.D. are meaningless, alignments in this region should be treated with caution and additional evidence in support of the particular alignment should be sought. Conversely, if two sequences score greater than 15.0 S.D. then the alignment is likely to be correct over most of its length.
Multiple alignment strategy:
1. All pairwise comparisons between the sequences are performed. The information obtained may be used to construct a dendrogram that visualises the groups of similar residues.
2. Given a group of similar sequences, the order in which they should be aligned is determined (ie most similar pair through to least similar).
3. The multiple alignment algorithm is then applied to the sequences. Firstly the most similar pair are aligned, then the next most similar sequence is aligned to the alignment of the most similar pair. Then the next most similar sequence is aligned and so on..... An optional process is to re-align each sequence with the completed alignment in order to refine the alignment.
Description of the AMPS package
-------------------------------
Version D2.0 consists of the following programs.
MULTALIGN (AMULT/FSCAN) performs a number of functions. These include pairwise sequence alignment and assessment of statistical significance, multiple alignment by the methods of Barton and Sternberg (1987b) and a method similar to that of Feng and Doolittle (1987), and additional functions that allow the inclusion of variable gap-penalties and specific weighting schemes. Each aspect of this program and its use is described below. ORDER uses the output from a MULTALIGN pairwise run to perform cluster analysis (tree generation), ordering of the sequences by similarity and general format changes.
DAYB takes a sequence in seq_file format and creates a file in block_file format. Just type DAYB and follow the instructions.
SORTER takes the output of a database scan and sorts the results into descending-score order. SORTER also allows construction of histograms and extraction of the identifying codes for the top scoring proteins. SELECT takes a list of protein identifier codes and a sequence database and extracts the sequences from the database that correspond to the codes. Unlike some programs (e.g. PSQ) the sequences are output in the same order as the codes are presented. SELECT is normally used in conjunction with SORTER. PATT Processes the output of a flexible pattern analysis to produce a more readable output.
----- oOo -----
M U L T A L I G N
-----------------
Instructions for use under VAX/VMS
(For installation guide see Appendix 3)
(SUN Unix users see appendix 4)
############ IMPORTANT ###############
THE EXAMPLES SHOWN IN THIS MANUAL INCLUDE VAX/VMS COMMANDS. SIMPLY DELETE
THE LINES STARTING WITH THE DOLLAR SIGN FOR USE WITH UNIX.
UNIX USERS: PLEASE SEE THE APPENDICES!!
I HOPE TO RE-WRITE THIS GUIDE WITH A UNIX FLAVOUR SOMETIME SOON...
######################################
The executable image for MULTALIGN is contained in the file AMULT.EXE.
The program is run from a command file that contains a series of keyword
commands that tell the program which files to use for what, how to format the
output and above all, which type of alignment to perform.
Keyword commands are all followed by an equals sign '=' which may in turn be
followed by a string of arguments in free format. The precise formats for each
command are described in APPENDIX 1. Keyword commands may be in upper or lower
case or mixed case. See APPENDIX 2 for a full description of input/output file
formats.
(1) Pairwise Alignment
-----------------------
An example command file for this mode is globin_pairs.com. This is shown
below:
Commands Explanation number
-------- ------------------
$set default direc 1.
$multalign 2.
output_file=globin_pairs.out 3.
mode=pairwise 4.
matrix_file=ampsdir:md.mat 5.
pairwise_random=100,100,1 6.
gap_penalty=8.0 7.
constant = 8 8.
seq_file=globin.seq 9.
1. Instruction to the VAX/VMS operating system to change the current working
directory to the one named 'direc'. You should replace this with the
name of the directory in which you are currently working.
2. Instruction to VMS to run the program MULTALIGN which resides in the
directory defined by the logical name ampsdir.
3. Command to amult. set the file for output of results to 'globin_pairs.out'
This command must ALWAYS be the first. Note: The output file should NOT be set
to the log file name.
4. Specify that pairwise mode is to be used. If this command is not included,
the program defaults to mode=multiple.
5. Define the matrix file to be used in the comparisons. This file is the
Dayhoff mutation data matrix or similar file containing pairscore values for
each amino acid pair.
6. Specify that 100 randomizations of each sequence pair are to be performed
in order to estimate the statistical significance of the alignment obtained.
7. define a gap penalty of 8.0
8. define a constant of 8 to be added to the matrix defined in 5.
9. specify the file containing the sequences to be aligned pairwise.
This command file causes all pairwise comparisons to be performed on the
sequences in the file 'globin.seq'. In other words, for the 7 sequences
sequence 1 is aligned with 2, 1 with 3 and so on. For N sequences there are
N*(N-1)/2 comparisons performed.
For each sequence pair (eg. 1 and 2), a full Needleman and Wunch sequence
comparison is performed. Then the sequences are shuffled and recompared ( in
this case 100 times) in order to find the expected distribution of scores that
would be obtained if the sequences were unrelated but have the same length and
composition as 1 and 2. Various statistics on the comparisons are then output
to the specified output_file (globin_pairs.out).
The output file from this sequence alignment run contains the following
information:
A banner describing the program name, source and limitations of use.
Information on the maximum length and number of sequences allowed.
Information on the commands specified to the program, files etc.
A list of the sequences defined in the sequence file.
Finally, a set of numerical results of the run in a series of fields as
follows.
Field Description
----- -----------
1 and 2 number describing the sequences aligned on this row.
3 and 4 The lengths of the sequences described in 1 and 2.
5 The match score obtained for the comparison of the two
sequences.
6 The number of internal gaps in the alignment (overhangs
at the ends are not counted)
7 The number of positions at which two amino acids are
aligned.
8 The number of positions at which identical amino acids are
aligned
9 Percentage identity (8/7)
10 Normalised Alignment score - The match score divided by
the number of aligned positions * 100. (5/7)*100.
11 Alternative Normalised Alignment score - The match score
minus the number of gaps times the gap penalty all divided
into the number of aligned positions.
(7/((5-6)*gap_penalty))*100.
12 Number of randomizations performed
13 mean score for the randomizations
14 standard deviation of the random scores
15 Significance score for the alignment.
Given by the mean random score minus the match score all
all divided by the standard deviation.
(13-5)/14.
16 Comparison number.
This output file is in the correct format for input by the program ORDER.
Clearly, however the actual alignments have not been output. If the
alignments are required you must include the command line(s)
print_horizontal=
(for horizontal format)
print_vertical=
(for vertical format)
Note that a file containing alignments cannot be directly read by ORDER.
(the alignments would first have to be deleted using a text editor).
Time considerations:
--------------------
Performing randomizations for all pairs of sequences can be very time
consuming, particularly if a large number of sequences are to be compared prior
to multiple alignment. A good guide to the relative similarity between the
sequences is given by the normalised alignment scores. These do not require
randomization in order to calculate and so considerably speed the process.
Simply delete the line 'pairwise_random' from the command file to avoid
performing randomizations.
----- oOo -----
(2) Multiple Alignment
Just using sequence information
-------------------------------
(a) Single order alignment
--------------------------
This is the method described in Barton and Sternberg (1987b). The sequences
are aligned progressively in one order. If the sequences are all clearly
similar to each other - say > 50% identity, then the actual order will make
very little difference to the final alignment. In general however, it is
better to establish the order by first performing all pairwise comparisons for
the sequences as described in the previous section, then using program ORDER
to define an order (described below) before proceeding to multiple alignment.
The ORDER program generates an order_file. This specifies the order in which
the sequences present in the seq_file are to be aligned. A typical command
file for multiple alignment is given in globin_mult.com and shown below.
Commands Explanation number
-------- ------------------
$set default ampsdir 1.
$multalign 2.
output_file=globin_mult.out 3.
mode=multiple 4.
matrix=file=ampsdir:md.mat 5.
gap_penalty=8.0 6.
constant = 8 7.
consplot=mz 8.
print_vertical= 9.
seq_file=globin.seq 10.
order_file=globin.ord 11.
1. Instruction to the VAX/VMS operating system to change the current working
directory to the one named 'ampsdir'. You should replace this with the
name of the directory in which you are currently working.
2. Instruction to VMS to run the program MULTALIGN.
3. Command to MULTALIGN. set the file for output of results to
'globin_mult.out' This command must ALWAYS be the first. Note: The output
file should NOT be set to the log file name.
4. Specify that multiple mode is to be used. If this command is not included,
the program defaults to mode=multiple.
5. Define the matrix file to be used in the comparisons. This file is the
Dayhoff mutation data matrix or similar file containing pairscore values for
each amino acid pair.
6. define a gap penalty of 8.0
7. define a constant of 8 to be added to the matrix defined in 5.
8. optional request to perform a conservation analysis using the algorithm of
Zvelebil et al (1987) on the resulting alignment (only works if the
print_vertical command is also present).
9. specify vertical format output
10. define the file containing sequences to be aligned as 'globin.seq'
11. define the order file obtained from running theprogram ORDER on the
results of a pairwise sequence comparison run. If this command is absent then
the program aligns the sequences in the order that they are present in the
seq_file. Note. the order_file command must always appear after the seq_file
command.
{Optionally a process of iteration can be performed. Once all the sequences
have been added to the alignment, the first sequence can be realigned with the
ALIGNMENT of the sequences 2-N. Then the second sequence is aligned with 1,3-N
and so on. To specify that iteration should be performed include the command
'iterations=int' where int is an integer greater than 0. In general iteration
values of 3 can refine the alignments slightly and are of greatest use if there
are relatively few sequences to be aligned (say <10).}
The result of a run using this command file is illustrated in the file
globin_mult.out.
(b) Tree based alignment
------------------------
This option allows the alignment to be performed in a manner similar to that
described by Feng and Doolittle (1987). Rather than using a single order, the
sequences are aligned according to a defined evolutionary tree. In my
implementation, the tree can be obtained by running program ORDER, this
produces two files, an order_file and a tree_file. The order file specifies an
order that preserves the clustering of the sequences by similarity. The
tree_file gives specific instructions to the program as to which comparisons to
make and in which order.
For example, given four sequences A,B,C,D. the tree_file may specify that A
and B should be aligned first, then C with D and finally the AB alignment is
aligned with the CD alignment using average scores at each amino acid position.
Example command file. globin_mult_tree.com
Commands Explanation number
-------- ------------------
$set default ampsdir 1.
$multalign 2.
output_file=globin_mult_tree.out 3.
mode=multiple 4.
matrix=file=ampsdir:md.mat 5.
gap_penalty=8.0 6.
constant = 8 7.
consplot=mz 8.
print_vertical= 9.
seq_file=globin.seq 10.
order_file=globin_pairs_tree.ord 11.
tree_file=globin_pairs.tree 12.
This is the same as globin_mult.com with the exception of items 11 and 12.
11. order_file defines the cluster order. This is usually different to the
'simple' order defined in globin_pairs.ord.
12. tree_file defines the tree_file generated by program ORDER, or manually
typed in.
The example output file from this run is globin_mult_tree.out.
Comments
--------
It is currently not possible to say which approach to multiple alignment gives
the best results. The two methods give approximately equivalent alignments when
the sequences are quite similar (as judged by significance scores from pairwise
comparisons). If the sequences consisted of two distinct families then the
tree based alignment will give better alignments WITHIN the families but
roughly equivalent alignments BETWEEN the families. In the current
implementation the single order method will run faster so is probably the
general method of choice.
A study is in progress to compare the two methods over a number of sequence
families and should provide guidelines in the near future.
(c) Multiple alignment - building on an existing alignment
----------------------------------------------------------
Frequently more is known about some of the sequences to be aligned than others.
For example, if two or more proteins in the family of interest have had their
three-dimensional structures solved by x-ray crystallography, then they may be
aligned by inspection of the three dimensional models with much greater
confidence than by using sequence information alone. MULTALIGN therefore allows
one or more extra sequences to be added to a preexisting alignment.
At a lesser level, certain residues may be expected to align on functional
grounds - for example binding studies may suggest that a particular Histidine
residue is involved in the catalytic mechanism of two or more of the proteins
to be aligned. In this situation it is important to be able to include this
information in the alignment. This is particularly important if the overall
similarity between the sequences is low since it is likely to lead to a
biologically more reasonable alignment.
Certain regions of the sequence(s) may be expected to experience fewer gap
events than others. For example the core secondary structural elements (beta
strands and alpha helices) generally do not tolerate large insertions or
deletions. This feature can be encoded when using MULTALIGN by specifying a
variable gap-penalty. (See Barton and Sternberg 1987a for a discussion of the
benefits of secondary structure dependent gap-penalties).
Examples
--------
General:
In order to align sequences to an existing alignment (in this context, an
existing alignment may consist of a minimum of only one sequence) the alignment
must be in 'block_file' format. Essentially this is the same as the file
produced by the print_vertical command and may even contain conservation
information which is ignored when the file is read in. The minimum
requirements for a block_file are given in appendix 2.
(a) aligning without additional information
-------------------------------------------
Example: aligning myoglobin sequences with an existing alignment derived from
three dimensional structure superposition.
Command file: bash_mult.com
$set def ampsdir
$multalign
output_file=bash_mult.out
matrix=file=ampsdir:md.mat
gap_penalty=8.0
constant=8
print_vertical=
block_file=bash.bloc,1
seq_file=myoglobins.seq
These commands will mostly be familiar by now!! The only new addition is the
block_file command. This defines a block file to be read, the filename must
be followed by an integer that identifies the iteration number specified in
the block_file. In this example, the iteration number is set to 1.
This block_file contains the alignment described by Bashford et al (1987) for 7
globin sequences aligned on structural grounds. The command file aligns all 60+
myoglobin sequences with this structure based alignment. See file bash_mult.out
for the result.
(b) Secondary structure dependent gap_penalties
-----------------------------------------------
Example: Aligning with an alignment derived from three-dimensional structure
superposition but including secondary structure dependent gap_penalties.
Secondary structure dependent penalties are included by inserting range
delimiting characters into the block_file. This is illustrated by the
block_file 'bash_vg.bloc'. A start of range is indicated by the '>' character
following the alignment, an end of range by the '#' character.
A typical command file that takes advantage of this information is
'bash_mult_vg.com'. In this example, the delimited ranges correspond to the
conserved alpha helices in the proteins, the gap penalty is therefore
INCREASED within the ranges. (An alternative strategy would be to define the
ranges as the non-helix bits and then decrease the gap penalty)
Command file: bash_mult_vg.com
output_file=bash_mult_vg.out
matrix=file=ampsdir:md.mat
gap_penalty=8.0
constant=8
gap_factor=100
print_vertical=
block_file=bash_vg.bloc,1
seq_file=myoglobins.seq
The additional command is gap_factor. This defines a value by which the
standard gap_penalty is multiplied WITHIN the defined ranges. Clearly, a
number greater than one decreases the likelihood of a gap within the range, a
number less than one increases the likelihood of a gap.
In this example there is little advantage in this approach since the alignment
is very good without the use of secondary structural dependent gaps. When
aligning less similar sequences, the benefits become more apparent.
(c) Defining explicit weights to align ONE extra sequence.
-----------------------------------------------------------
One further refinement that may be incorporated when aligning to a block_file
is to include specific residue weights. Before explaining how this is
achieved it is necessary to understand some of the program's entrails.
Lookup tables
-------------
When the program reads a block_file it generates a lookup table that consists
of an array of dimensions (length of the alignment)*(the number of amino acid
types). Each element of the lookup table represents the weight assigned to
aligning a particular amino acid at that position. MULTALIGN allows you to
print out this Table and modify it to increase/decrease the weights for
particular amino acids at particular alignment positions.
The Table may then be read in by the program and used to subsequently align
another amino acid sequence.
Note. This option is meant for application when using MULTALIGN to scan
sequence databases with flexible patterns, hence it is only possible to align
ONE further sequence with the block_file. (The subject of flexible patterns and
database scanning is the basis of a paper shortly to be submitted. The
necessary commands to make use of this feature will be added as an appendix
once the paper is in print.)
The command 'save_lookup=bash_day.look' causes the lookup table to be written
out to the file bash_day.look.
The Table may be modified, then read in by the command
'read_lookup=bash_day.look'
An example command file to generate and save a lookup table is shown below
(file 'bash_look.com')
output_file=bash.out
matrix=file=ampsdir:md.mat
gap_penalty=8.0
constant=8
gap_factor=100
save_lookup=bash_day.look
block_file=bash_vg.bloc,1
non_alignment=
seq_file=globins.seq
The seq_file command must be present, and at least one sequence present in the
seq_file.
The Lookup_table (bash_day.look) might be edited to increase specific residue
weights before using it to align another sequence to the block.
An example command file is 'bash_rlook.com'
output_file=bash2.out
gap_penalty=8.0
gap_factor=100
read_lookup=bash_day.look
block_file=bash_vg.bloc,1
seq_file=myo.seq
This will cause the lookup table to be read and used to score the alignment of
the sequence in file myo.seq with the block_file. If more than one sequence
is defined in the seq_file then a garbage alignment will be generated for
sequences 2-N.
-------------------------------------------------------------------------------
User Guide to the program
ORDER
------------------
Order is a program designed to be run interactively. Its functions are to
process the output of a MULTALIGN in pairwise mode and produce the following
files.
1. Order_file.
2. tree_file/order_file. (cluster analysis)
3. graphical output - a simple dendrogram is produced.
Type ORDER
1. The program will then ask you for the name of a file produced by MULTALIGN
in pairwise mode ( but not including any alignments).
2. Does the input file have timings? If you specified pairwise_timer= when
running MUTALIGN then answer Yes, otherwise N (Default=N).
3. Enter filename for comparisons to exclude - press <return>.
4. Options: Five possible results of pairwise comparison may be used to
perform the ordering or cluster analysis.
NGAPS(1) (Use the number of gaps - totally useless)
PIDENT(2) Use percentage identities - possibly useful.
NAS(3) Normalised alignment score (see description of Multalign pairwise
mode for explanation)
NASAL(4) As for NAS
RSCORE(5) Use S.D. scores calculated from randomizations.
Answer with an integer 1-5 (Usually 3 or 5) 5 is best if you've calculated it.
5. Produce an order file? [Y]
Answering Yes to this option prompts for a filename, then exits the program.
The order is defined in a manner suitable for the single order multiple
alignment algorithm.
6. Answering No to 5. gives the question
Perform cluster analysis ? [Y]
Answer Yes then you get
6a. Save full cluster details? (ie do we want to generate a tree_file
and a compatible order_file for subsequent use by MULTALIGN to
perform tree based multiple alignment?? The default is YES.
If we answer Yes, then we are next prompted for an order filename
and tree filename to be written.
(This is followed by a request for plotting details - note,
there is a bug in the programme, you MUST supply a title and
axis title, but these won't appear on the output - The page width
requested, refers to the number of plotting characters typically
80 or 132).
7. Answering No to 6. gives the question 'Give ADDTREE filename'
This allows the pairwise scores to be written out in a form that can be
read by the cluster analysis program ADDTREE (See Corter(1982))
------------------------------------------------------------------------------
FLEXIBLE PATTERN MATCHING AND DATABASE SCANNING
-----------------------------------------------
Introduction
------------
This section describes how to use AMPS to perform flexible pattern matching by
the method of Barton and Sternberg (1990). This approach can help identify
weaker similarities between proteins that would otherwise be missed by
conventional sequence comparison methods. AMPS may also be used to scan the
database with a multiple sequence alignment using the Needleman-Wunsch
algorithm - normally this is not as effective as deriving a pattern and
scanning with that.
PLEASE NOTE: It is important that you are familiar with the AMPS alignment
features described above, before attempting flexible pattern matching.
AMPS allows the following operations to be performed:
1. Definition of a pattern representative of a particular protein fold
including the explicit description of allowed flexibility in gap length between
defined regions.
2. A variety of scoring systems for each element of the pattern. eg. based on
frequency weights, Dayhoff's matrix, conservation or fully user defined
weights.
3. Scanning of the pattern against a database of protein sequences, subsequennt
rank ordering and display of the results of the scan.
4. Detailed analysis of a single sequence for the presence of multiple
occurences of the pattern. Calculation of the significance of the best
matching pattern by reference to randomized sequences.
These points will be illustrated by reference to examples.
A typical pattern analysis might follow the following steps:
1. Define a pattern and choose a scoring scheme.
2. Scan the pattern against the database (PROGRAM MULTALIGN).
3. Sort the results (PROGRAM SORTER).
4. Get ID's of interesting proteins (PROGRAM SORTER).
5. Extract sequences of interesting proteins from the database (PROGRAM
SELECT).
6. Align pattern to the proteins - including alternative alignments
(PROGRAM MULTALIGN).
7. Produce compressed output for inspection (PROGRAM PATT)
Not every stage need be performed. For example, if we already know the subset
of the protein database that is interesting, then steps 1-4 can be avoided.
PROGRAM LIMITATIONS:
-------------------
The normal MULTALIGN program is limited to up to 500 sequences not exceeding
1200 amino acids length. As a consequence, the maximum sequence length
considered in a database scan is also 1200. Sequences of greater than 1200
amino acids in length will be IGNORED. The program reports how many sequences
were actually compared in the scan at the end of the output file.
IMPORTANT: THE AMULT PROGRAM MUST HAVE BEEN RECOMPILED FOR LONG SEQUENCES TO
USE THIS OPTION. THIS IS NOT THE DEFAULT INSTALLATION. PLEASE SEE THE
NOTES IN THE FILE CALLED BUILD.
1. Defining a Pattern
---------------------
The flexible pattern is defined around the block_file format described in
APPENDIX II. The only additions necessary are if flexible gaps are to be
included, when the following syntax is used.
>test this is a test flexible pattern
>test2 this is also a flexible pattern
*iteration 1
ag each line in this file
vl <23:28 defines a pattern "ELEMENT"
pg
ww <1:3
kr
.
. further pattern elements here
.
etc
*
This file means: score against ag at position 1, vl at position 2 then allow
between 23 and 28 gaps (inclusive) then score against pg, then ww, then allow a
gap of 1 to 3 residues, then score kr etc etc.
2. Defining the scoring scheme
------------------------------
When the pattern is read by the program, a lookup table is defined that
specifies the score for aligning each possible amino acid with each element of
the pattern. There are several possible ways of deriving this table, each of
which is defined by commands to the program.
The most flexible scoring system is to define a LOOKUP table with the
READ_LOOKUP command (see above). A simpler approach is to calculate a
lookup table based upon the observed amino acids in the pattern.
Example command file (bash_ge_4_scan1.com) to scan with the pattern defined in
the file bash_ge_4.bloc.
output_file=bash_ge_4_scan1.out 1. output file
mode = scan 2. set the mode to scan
block_file/pattern=bash_ge_4.bloc,1 3. define the pattern file
matrix_file=md.mat 4. define dayhoff matrix scoring
database=protein.seq 5. define the database to scan
This scan takes 977 seconds (16 minutes) on a Sun SPARCstation 1 (6721
sequences scanned out of a total of 6858). The result of this scan is shown in
file bash_ge_4_scan1.out.
The new commands are:
MODE=SCAN This tells the program to compare a pattern or alignment to the
DATABASE rather than perform multiple or pairwise sequence
alignment.
/PATTERN in the BLOCK_FILE command. This tells the program to use the flexible
pattern matching algorithm rather than a conventional Needleman and Wunsch.
DATABASE: This defines the sequence database to be scanned (must be
in PIR format as defined in APPENDIX II.
By specifying the md.mat MATRIX file we are defining DAYHOFF scoring
for the scan with simple averages used to define the lookup table.
3. Alternative scoring systems:
_______________________________
IDENTITY: change the matrix_file command to:
MATRIX_FILE=ampsdir:UP.MAT
the presence of an amino acid in an element will indicate a score of
1 with the same amino acid, 0 with all others.
FREQUENCY: delete the MATRIX_FILE command and insert:
FREQSCORE=
The frequency of occurence of each amino acid will be used to score.
CONSERVATION: delete FREQSCORE and replace with CONSCORE.
WEIGHT: (DOdd and Egan's approach). define FREQFILE=fname where fname is
name of a file containing amino acid abundancies.
EXPLICIT: Write a LOOKUP file, then read this into the program
with the READ_LOOKUP=fname command.
-- o O o --
Sorting the results of a scan using
PROGRAM SORTER
___________________
The results of a scan are written to the OUTPUT_FILE. However, the
results are in database order, so need to be sorted with the highest
scoring (most interesting) sequences first. The program SORTER
performs this function.
EXAMPLES: Type sorter to start the program. Some messages appear - to get
help, enter /help. Some times for operations will also be printed to the
screen (not shown in the examples below).
1. To sort the output file and store the entire sorted list:
-------------------------------------------------------------
ENTER FILE TO PROCESS: bash_ge_4_scan1.out
ENTER FILE FOR SORTED OUTPUT [.sorted]: <return>
1=MATCH, 2=NAS, 3=RMEAN, 4=RSD, 5=SCORE
ENTER CHOICE :1 (Always answer 1)
READING DATA FILE
6721 DATASETS READ IN
PERFORMING SORT
DATA SORTED
WRITING SORTED FILE
SORTED FILE WRITTEN
Generate prolog clauses? [N]: <return>
This Example created the file bash_ge_4_scan1.sorted. The file is identical in
format to bash_ge_4_scan1.out, but has the comparisons sorted in decreasing
order of similarity to the pattern.
2. To sort the output and store only the top N scoring results:
---------------------------------------------------------------
(For the sake of brevity in the following example, we only store the top 20)
ENTER FILE TO PROCESS: bash_ge_4_scan1.out/S=20
I WILL REPORT THE TOP 20 SCORES
ENTER FILE FOR SORTED OUTPUT [.sorted]: bash_ge_4_scan1.top20
1=MATCH, 2=NAS, 3=RMEAN, 4=RSD, 5=SCORE
ENTER CHOICE :1 (Always answer 1)
READING DATA FILE
6721 DATASETS READ IN
CALCULATING SCORE DISTRIBUTION
USING ITEM 1
STATISTICS COMPLETE
NUMBER OF POINTS 6721
MEAN 40.8484
SD 26.2812
SKEW 2.19682
KURTOSIS 6.44562
CALCULATING SIGSCORES
PERFORMING SORT
DATA SORTED
WRITING SORTED FILE
SORTED FILE WRITTEN
Generate prolog clauses? [N]: <return>
The /S= option calculates various statistics on the scores obtained, and
expresses each reported score in SD units from the mean. For example part of
the file bash_ge_4_scan1.top20:
>P1;HZPG : Hemoglobin zeta chain - Pig
46 141 153.14 3.33 0.00 0.00 4.27
A B C D E F G
The numbers shown after each protein identifier and title line refer to the
following:
A Number of elements in the pattern (46).
B The length of the sequence being scanned (141).
C The score for the best alignment of the pattern
and the sequence (153.14).
D the score divided by the number of pattern elements (3.33).
E The mean (if randomisations are performed).
F The Standard Deviation (if randomisations are performed).
G The Distance the score (C) is from the mean of the distribution in
standard deviation units (i.e. (153.14 - 40.84)/26.28).
If randomisations are performed using the DATABASE,N option where N= the number
of randomisations to be performed, then E,F and G refer to the mean and s.d.
of the comparisons to randomised sequences.
Although S.D. values are shown, these are not normally very useful. It is
generally better to look at the distribution of scores visually using the /hist
option described below.
3. Extracting the IDentifier codes for the top scoring sequences
----------------------------------------------------------------
The /ID option allows the identifier codes to be extracted from a file.
The resulting file contains SCORE ID pairs with the score expressed in an
integer form suitable for input to the program SELECT.
For example:
ENTER FILE TO PROCESS: bash_ge_4_scan1.top20/ID
ENTER NUMBER OF IDS TO WRITE OUT: 20
ENTER FILE FOR ID'S: bash_ge_4_scan1.top20id
1=MATCH, 2=NAS, 3=RMEAN, 4=RSD, 5=SCORE
ENTER CHOICE :1
USING: 1
READING DATA FILE
20 DATASETS READ IN
PERFORMING SORT
DATA SORTED
WRITING SORTED FILE
SORTED FILE WRITTEN
Generate prolog clauses? [N]: <return>
Giving the following file: bash_ge_4_scan1.top20id
15314 HZPG
15257 HZCZ
15257 HZHU
15142 HACHPE
14957 HADKP
14900 HEMSY2
14800 HBRB3
14728 HBPY
14714 HBFG3T
14700 HBMSH0
14686 HBTG
14686 HBTTP
14671 HGMQJ
14671 HGMQP
14671 HGMQR
14671 HGBAY
14671 HGHUA
14671 HGMKS
14657 HBOR
14657 HEGT1
4. Specifying a cutoff score
----------------------------
All results above some cutoff score may be selected, rather than sorting the
entire file:
ENTER FILE TO PROCESS: bash_ge_4_scan1.out/cutoff
ENTER CUTOFF [real]: 101.0 (for example)
I will report sequences scoring .gt. : 101.00
ENTER FILE FOR SORTED OUTPUT [.sorted]: bash_ge_4_scan1.gt101
1=MATCH, 2=NAS, 3=RMEAN, 4=RSD, 5=SCORE
ENTER CHOICE :1
USING: 1
READING DATA FILE
327 DATASETS READ IN
PERFORMING SORT
DATA SORTED
WRITING SORTED FILE
SORTED FILE WRITTEN
Generate prolog clauses? [N]: <return>
5. Generating Prolog Clauses
-----------------------------
The output of any scan may be converted to Prolog clauses by answering Y to the
Generate prolog clauses? question. An example of the format is shown in the
file: bash_ge_4_scan1.pl.
6. Generating a Histogram of the score distribution
----------------------------------------------------
The score distribution may be examined using the /HIST option as follows:
ENTER FILE TO PROCESS: bash_ge_4_scan1.out/HIST
1=MATCH, 2=NAS, 3=RMEAN, 4=RSD, 5=SCORE
ENTER CHOICE :1
USING: 1
READING DATA FILE
6721 DATASETS READ IN
ENTER Graph FILENAME [.graph]: <return>
Rescale Graph 1-100 [Y]: N
Enter the bucket interval required
10
The rescale graph option, scales the values reported to a 1-100 unit scale to
allow comparison between different scan runs. The bucket interval refers to
the range for each bucket in the histogram, you must choose a suitable value to
give a clear representation of the scan
The result of the above analysis are shown below and in the file:
bash_ge_4_scan1.graph.
Bucket Score Frequency
1 9.430 384 |===========
2 19.430 614 |=================
3 29.430 1016 |=============================
4 39.430 1486 |==========================================
5 49.430 1737 |==================================================
6 59.430 910 |==========================
7 69.430 209 |======
8 79.430 26 |
9 89.430 1 |
10 99.430 7 |
11 109.430 16 |
12 119.430 17 |
13 129.430 36 |=
14 139.430 85 |==
15 149.430 171 |====
16 159.430 5 |
7. Getting Help
---------------
Brief help is available by typing /help at the program prompt.
--- oOo ---
Extracting Sequences from a PIR Database Using
PROGRAM SELECT
----------------------------------------------
If you have access to a database management program (e.g. NBRF PSQ) then you
may use that to extract the sequences of interest from the database. However,
PROGRAM SELECT provides a method of extracting sequences from any PIR format
sequence file. For example, given the file: 'bash_ge_4_scan1.top20id' obtained
above and the PIR database file 'protein.seq'.
Program S E L E C T
Extracts sequences from PIR database
Author: G. J. Barton (1990)
Maximum Allowed Sequence Length: 8000
Maximum Allowed Number of Sequences: System Dependent
Enter name of file containing SCORE ID pairs: bash_ge_4_scan1.top20id
Opening File: bash_ge_4_scan1.top20id
Enter Database Filename: protein.seq
Opening File: protein.seq
Just Extract Identifiers/titles (no sequences) ?[Y/N]: N
Enter Output Filename: bash_ge_4_scan1.top20seq
Opening File: bash_ge_4_scan1.top20seq
Searching for: 20 Sequences
Found: HACHPE 1
Found: HADKP 2
Found: HZHU 3
Found: HZCZ 4
Found: HZPG 5
Found: HGHUA 6
Found: HGMQP 7
Found: HGMQR 8
Found: HGMQJ 9
Found: HGBAY 10
Found: HGMKS 11
Found: HBRB3 12
Found: HBMSH0 13
Found: HEGT1 14
Found: HEMSY2 15
Found: HBTG 16
Found: HBOR 17
Found: HBPY 18
Found: HBTTP 19
Found: HBFG3T 20
Extracted: 20 Sequences
The information displayed to the screen signals when each sequence is found,
the sequences are sorted into descending score order and output to the chosen
file (bash_ge_4_scan1.top20seq).
--- oOo ---
Generating Alignments of the Pattern
With Interesting Sequences
------------------------------------
We now have a pattern and a file containing sequences with which it gives a
good score. In order to see the alignments of the pattern and each sequence,
we must re-run the MULTALIGN program with the print_horizontal (or _vertical)
command. This produces a VERY verbose output.
For example, the command file bash_ge_4_scan1_top20.com:
output_file=bash_ge_4_scan1_top20.alig
mode = scan
block_file/pattern=bash_ge_4.bloc,1
matrix_file=md.mat
database=bash_ge_4_scan1.top20seq
print_horizontal=
produces the output shown in file bash_ge_4_scan1_top20.alig.
A far more compact form of output may be obtained by using the
PRINT_HORIZONTAL/PATTERN=fname command. This produces a file in a special
format for the program PATT.
output_file=bash_ge_4_scan1_top20.out
mode = scan
block_file/pattern=bash_ge_4.bloc,1
matrix_file=md.mat
database=bash_ge_4_scan1.top20seq
print_horizontal/pattern=bash_ge_4_scan1_top20.patt
We can now run the program PATT on the resulting file:
------------------------
Program P A T T E R N
------------------------
Processes FSCAN print_horizontal/pattern output
Author: Geoff Barton
Maximum Pattern Length: 2000
Maximum Pattern Hits + Pattern to Display: 2000
Enter Pattern file: bash_ge_4_scan1_top20.patt
Enter Output file: bash_ge_4_scan1_top20.pattout
Enter page width for horizontal output (>50, Def:132):
Output width: 132
Reading Pattern Description
---Initializing
---Done
Reading Pattern Alignment
153.14 >HZPG 141 1 1 Hemoglobin zeta chain - Pig
152.57 >HZCZ 142 2 1 Hemoglobin zeta-1 chain - Chimpanzee
152.57 >HZHU 141 3 1 Hemoglobin zeta chain - Human
151.43 >HACHPE 141 4 1 Hemoglobin pi' chain - Chicken
149.57 >HADKP 141 5 1 Hemoglobin pi' chain - Muscovy duck
149.00 >HEMSY2 146 6 1 Hemoglobin epsilon-y2 chain - Mouse
148.00 >HBRB3 147 7 1 Hemoglobin gamma (beta-3) chain - Rabbit
147.29 >HBPY 146 8 1 Hemoglobin beta chain - Pigeon
147.14 >HBFG3T 146 9 1 Hemoglobin beta chain - Bullfrog tadpole
147.00 >HBMSH0 147 10 1 Hemoglobin beta-h0 chain - Mouse
146.86 >HBTG 146 11 1 Hemoglobin beta chain - Australian echidna
146.86 >HBTTP 146 12 1 Hemoglobin beta chain - Western painted turt
146.71 >HGMQJ 146 13 1 Hemoglobin gamma chain - Japanese macaque
146.71 >HGMQR 146 14 1 Hemoglobin gamma chain - Rhesus macaque
146.71 >HGBAY 146 15 1 Hemoglobin gamma chain - Yellow baboon
146.71 >HGMQP 146 16 1 Hemoglobin gamma chain - Pig-tailed macaque
146.71 >HGHUA 146 17 1 Hemoglobin gamma chains - Human and chimpanz
146.71 >HGMKS 146 18 1 Hemoglobin gamma chain - Spider monkey
146.57 >HBOR 146 19 1 Hemoglobin beta chain - Duckbill platypus
146.57 >HEGT1 147 20 1 Hemoglobin epsilon-I chain - Goat
Sort the scores? [Y]
Formatting Alignments
Adding Flexible Gap Details
Writing Vertical Format Alignment
Writing Horizontal Format Alignment
The only options in this program are the output width for the results and the
option to sort the scores that are displayed. The sorting option is necessary
if multiple patterns per sequence are calculated using the PATTERN_LEVEL=N
option described below.
The output file from the program PATT contains vertical and horizontal format
multiple alignments of the pattern with all the sequences in the list. This is
a considerable compression of data over the print_horizontal= format (file
bash_ge_4_scan1_top20.alig). The vertical format output may be used to define
a further pattern for database scanning.
Generating multiple pattern alignments
with each sequence
--------------------------------------
In the previous examples, only the best scoring alignment of the pattern with
the sequence entries is output. However, a pattern may match equally well at
more than one position in a sequence (for example, if a repeat is present).
The following example, illustrates the principal of generating multiple pattern
alignments, though the results are of limited interest.
(file bash_ge_4_scan1_top20_patt2.com)
output_file=bash_ge_4_scan1_top20.junk2
mode = scan
block_file/pattern=bash_ge_4.bloc,1
matrix_file=md.mat
pattern_level=5
database=bash_ge_4_scan1.top20seq
print_horizontal/pattern=bash_ge_4_scan1_top20.patt2
The only new command necessary is the PATTERN_LEVEL=N command. This specifies
that the top N ways of aligning the pattern to EACH sequence will be output.
As before, the program PATT may be used to compress the results and store them
in the file bash_ge_4_scan1_top20.patt2out. Inspection of this file shows that
in this example, the alternative pattern alignments have identical scores and
differ only in the positioning of the first group of pattern elements.
APPENDIX I
Summary of Commands for MULTALIGN
---------------------------------
Commands may be entered in mixed case. All commands must include the '='
delimiter. All commands may use a maximum of 80 characters including their
arguments, all commands are free format.
output_file=STRING
STRING is the filename to be used for output of results and error
messages. Must be the first command issued.
mode=OPTIONS
OPTIONS are one of MULTIPLE, PAIRWISE or SCAN
MULTIPLE is the default - signifies 'normal' multiple alignment and
tree based alignments may be used.
PAIRWISE signifies conventional pairwise comparisons to be made for
all sequences in the seq_file.
SCAN specifies that an alignment or pattern are to be compared to a
group of sequences.
matrix_file=STRING
STRING is the filename of the pairscore matrix chosen (one of MD.MAT,
UP.MAT, GC.MAT etc)
seq_file=STRING
STRING is the filename of the file containing sequences.
order_file=STRING
STRING gives the name of the order file
constant=INT
INT is a number added to each element of the matrix defined by
matrix_file. (Default=8)
gap_penalty=REAL
REAL is a number used as the length independent gap penalty.
(Default=8.0).
iterations=INT
INT defines the number of iterations performed in single order
multiple alignment.
print_vertical=
Switch to specify vertical format output of alignments.
print_horizontal=
Switch to specify horizontal format output of alignments.
print_horizontal/pattern=STRING
Specify creation of an output file with a suitable format for program
PATT. STRING is the name of the file.
max_horiz=INT
INT defines the number of residues output per line when the
print_horizontal flag is set. (Default=100).
show_matrix=
Flag to specify that the pairscore matrix should be written to the
output file after any constant has been added.
block_file=STRING,INT
Specifies the block_file name defined by STRING. INT specifies which
iteration to read from the block_file. (INT MUST BE PRESENT OR AN
ERROR WILL RESULT).
block_file/pattern=STRING,INT
Specifies a block file with flexible pattern ranges.
consplot=OPTION
Signals that a conservation profile should be plotted. Only works if
the print_vertical command is set.
OPTIONS are MZ for conservation profiles according to
Zvelebil et al (1987).
GB for a profile that is based upon the N*(N-1)/2 comparisons possible
between all pairs of residues at a given alignment position.
gap_factor=REAL
When aligning to a block_file, this command specifies the factor to be
applied to the gap_penalty within the defined regions.
conscore=
Flag to specify that the alignment will be scored using conservation
values (a la Zvelebil et al 1987) instead of using the specified
pairscore matrix.
show_lookup=
Flag to specify that the lookup_table should be written out at each
cycle of the alignment. The format produced cannot subsequently be
read back in by the program.
save_lookup=STRING
STRING is the name of a file to save the first generated lookup table
to. Once saved, the program stops. The lookup table saved in this
way may be read in by the program (see read_lookup) for subsequent
processing.
s_lookup_comment=STRING
May be used in conjunction with save_lookup to specify a title
for the saved lookup Table.
read_lookup=STRING
Defines the filename of a lookup table to be read in by the program.
The Table overrides any table generated by pairscore matrices.
pairwise_random=INT1,INT2,INT3
Specifies the number of randomizations to be performed in pairwise
per sequence pair in pairwise mode.
INT1 = initial number of randomizations,
INT2 = max number of randomizations,
INT3 = increment.
Normally one only requires a single value for randomizations - say 100.
This would be defined thus: pairwise_random=100,100,1
pairwise_timer=
Flag, if present then system statistics are recorded for each pairwise
alignment.
pairwise_self=
Flag, if present then in pairwise mode a self comparison is also made
for each pair of sequences. (eg. for ABC, AvA, BvB etc.).
pairwise_align_file=STRING
Specifies that alignments produced in pairwise mode be written to a
file other than the output_file.
tree_file=STRING
Specifies the name of the file containing the tree to follow
for tree based multiple alignment. Also tells the program to
perform a tree alignment rather than a single order alignment.
numerate=
If print_vertical has been defined, then this will number each
aligned position of the alignment.
non_alignment=
Flag, signals that no alignment is performed. This can be useful to
switch formats from vertical to horizontal, calculate conservation
values or numerate the sequences. Note, if a block file is being read
in then a seq_file must also be defined).
database,N= STRING
STRING specifies the name of the database to scan using the SCAN mode.
N specifies how many randomisations to perform for each database entry
- normally do not include a value for N (very expensive in CPU and not
very useful).
gap_char_dash=
Specifies that all gaps will be output as dashes '-' rather than
spaces.
gap_char_dot=
Specifies that all gaps will be output as dots '.' rather than
spaces.
use_end_pen=
Weights gaps at the ends of the alignment (overhangs). Only effective
in tree_file alignments.
last_it=
In single order mode with multiple iterations, this requests that only
the last iteration alignment is output. (Normally alignments for
each iteration are displayed).
scanps_output=
In scan mode, this outputs a more condensed version of the output file
which only has a score, id pair on each line rather than the
verbose output produced by default. This file must be sorted using
the Unix sort utility rather than SORTER. (e.g. for a file
called RESULTS.SCAN type:
sort +0 -1 -n -r < RESULTS.SCAN > RESULTS.SORTED
If you take the top N results from the .SORTED file, you can
then extract the sequences using the SELECT program.
-------------------------------------------------------------------------------
APPENDIX II
(MULTALIGN file formats)
------------------------
Where possible, MULTALIGN reads files in free format. In some cases however,
files are fixed format- usually because these are generated by programs such
as ORDER and would not normally be typed in by hand.
seq_file
--------
This file contains one or more protein sequence in ONE LETTER CODE. The
format follows that of the NBRF (PIR) databases .SEQ file as follows.
>P1;IDENT
this is the title line
a s dhjAALLDKHGDK(D,K,L).P
L
WWPGS*
>P1;IDENT2
this is another title line
a R G S DF SDSDDDSSDAKKKFG
*
etc...
Each sequence entry starts with a '>'. This is followed by an identification
code (max 10 characters).
The next record is a title line (max 500 characters).
The following record(s) give the one letter code sequence.
Maximum of 500 characters to a line. The end of the sequence is identified by a
'*' character. Note that lowercase letters are put into uppercase, and only
alphabetic characters are read on the sequence lines, any alphabetic characters
that are not standard one_letter codes are translated as 'X' (unknown).
Minimum requirements are a record with '>', a title line and one residue
followed by a '*'.
Examples: globin.seq, myoglobins.seq.
Source: PSQ 'COPY' command, or typed in manually.
matrix_file
-----------
This follows the conventions of the NBRF (PIR) programs.
line1: Title of matrix
line2: 23 characters representing the one-letter codes and defining the order
that the matrix is stored in
lines3 to 25 are format(23I3).
MULTALIGN reads the one_letter code line and uses this as an index into the
matrix that follows.
Examples: md.mat,up.mat,gc.mat.
Source: the above example files.
order_file
----------
This is Fixed format file that defines the order of alignment.
Each line gives three numbers, A, B, C.
A is an index to which sequence in the seq_file should be aligned next,
B gives the alignment score that was obtained with...
C which is the sequence that gave the score with A.
Format(1x,i10,f20.2,i10)
MULTALIGN must have already read the seq_file before the order file is
defined. the sequences are then re-ordered within the program to conform to
the order specified by the order_file.
Examples: globin_pairs.ord
Source: output from program ORDER as a result of processing the results of a
pairwise sequence comparison.
tree_file
---------
This defines the order in which a tree based multiple alignment is performed.
The format is FIXED and very simple. The aim of tree based alignment at each
stage is to perform a pairwise alignment on two clusters of sequences that
have already been aligned, or are individual sequences. The tree_file defines
which sequences belong to each cluster at each stage of the alignment.
For example, we may have 5 sequences 1,2,3,4,5. At each stage we are aligning
two clusters of sequences A and B. The tree file might look like this.
The tree (my comments - not in the file)
1 number of seqs in cluster A
3 seq 3
1 number of seqs in cluster B
4 seq 4 (A and B are now aligned)
1 Number of seqs in next cluster A
1 seq 1
1 Number of seqs in next cluster B
5 seq 5 (new A and B are now aligned)
2 Number of seqs in next cluster A
3 4 seqs 3 and 4
2
1 5 seqs 1 and 5 ( now aliged to 3 and 4)
1 .
2 .
4 .
3 4 1 5 Finally seq 2 is aligned to 3,4,1,5.
This tree can be gerated by program ORDER, or might be input from a more
sophisticated clustering program.
Format(1x,20i5)
Example: globin_pairs.tree
Note. The sequence numbers identified in the tree_file point to the sequences
as stored in the order internally by MULTALIGN. Normally an order_file would
be used in conjuction with the tree_file so that similar sequences are
clustered together on output. See the documentation on program ORDER for
details of producing compatible tree and order files.
block_file
----------
This defines a multiple alignment in vertical format. The print_vertical
command produces a file in block_file format.
The minimum requirements for a block_file for N aligned sequences are
1. N '>comment line(s)'
2. '* iteration int'
3. 'N or more vertically aligned sequences'
4. '*'
1. The comment lines define the sequence identifiers and the number of
'>' characters preceding the first '* iteration int' line define the number of
sequences that are defined in the sequence lines.
2. This line specifies the beginning of the alignment to be read. The '*'
character specifies the column in which the alignment begins. The 'iteration
int' specifier identifies the particular alignment within this block_file.
Several alignments may follow each other providing they are identified by a
different iteration number (eg. 1,2,3).
3. The alignment is ended by a '*' character which should be in the same
column as the '*' character that started the alignment.
Simple example:
This is a block file containing two alternative alignments of three sequences.
The comments that I an writing here may appear in the block file, but are
ignored by MULTALIGN when the file is read. The only proviso is that no
'greater than' or 'star' characters must be present.
>first this is sequence A
>second this is sequence B
>third This is sequence C
* iteration 1
a
a p
avg
llg
lcr
g
pg
www
s
*
*iteration 2
a
a p
avg
llg
lcr
gpg
www
s
*
Examples: See bash.bloc, bash_vg.bloc and globin.bloc.
The block_file may also encode regions to apply variable gap_penalties within.
See bash_vg.bloc for example syntax.
Save_lookup format
------------------
First line is a comment line - default written by MULTALIGN is the filename.
Subsequent lines are all free format position specific residue weights.
Eg.
> 1 (weights versus first posn. of block_file)
a 3.0,g 4.0 p,2.0, (Alanine gly and pro given weights - others score 0)
>2 (weights for second position of block file
g 100,
p 100,
> 3
a 33,
.. etc
See bash_day.look for a complete example. Note that if a residue is not
defined at a position then its score at that position is zero.
Format for Input to program PATT
--------------------------------
A rather bizarre but flexible format as generated by the print_horizontal/pattern
option in multalign!!
#Pattern !start of pattern description
Nelm, Nchar !Number of elements in pattern/number of characters/element(10,9)
E1 !Pattern element 1 (eg. AAAGGGCCC) (80a1)
F1 !Flexible Gap range 1 (eg. 0,3)
"
"
etc
EN !Nth pattern element - this ends pattern description
#Details !Details of the first recorded pattern match
ID !Protein Identifier 10 chars
Title !Protein title - 80 chars
Len !Protein Length - 80 chars max
An,Pn !Sequence No, Pattern alignement Number
Score !Real number - max 80 chars
#Sequence !Start of sequence identifier
one letter code sequence 80a1 format
#Match !Description of the match follows
1,Pos !N,Pos - N is the pattern element, Pos is the sequence element
2,Pos !which it is aligned. If the pattern overhangs, then N is 0.
N,Pos
#Details !Start of next pattern alignment, or end of file
This format file is output by MSCAN in response to the
print_horizontal/pattern=fname command
Note: The An,Pn refer to the An(th) sequence aligned to the pattern,
and the Pn(th) alignment of the pattern to this sequence - thus
1,1 refers to the top scoring pattern match with sequence 1,
1,2 the second best pattern match to seq 1 etc.
------ oOo ------
APPENDIX III
------------
VAX installation guide
All the programs and datafiles should reside in the same directory. This
should be logically defined in the command file AMPS.COM. Simply change the
assign statements to indicate the directory that you have put AMPS in.
Executing the file AMPS.COM will set up the necessary logical names.
Thereafter, all the above examples as described should work. However, you
will have to modify the command files as supplied to agree with the examples
shown above.
If in doubt, consult your local VAX expert who should be able to sort things
Out.
---- oOo ---
APPENDIX 4
Unix Installation Guide
All files should be read into one directory using the tar utility, then
symbolic links set up to a suitable /bin directory for the executable files
amult, dayb, sorts, sorter, patt and select. All the example files are
contained in the subdirectory /test. For example, assuming the amps
programs documentation and example files have been read into
/local/progs/amps and that you use a directory /local/bin for storing local
executables. Log on a super-user, then type:
cd /local/bin
ln -s /local/amps/bin/multalign multalign
ln -s /local/amps/bin/order order
ln -s /local/amps/bin/dayb dayb
ln -s /local/amps/bin/sorter sorter
ln -s /local/amps/bin/patt patt
ln -s /local/amps/bin/select select
Oxford nmr installation
-----------------------
NOTE: these notes also apply to other installations, though the directories will be different. This document and the example files hide in /home/nmra/gjb/amps. In order to run the programs you need to add the directory /home/nmra/gjb/bin to the path in your .cshrc file. Once this is done, typing multalign will run multalign, and typing order will run the program order. Any files referred to that are not in your working directory, must be given their FULL PATH NAMES. eg. the file md.mat is in /home/nmra/gjb/amps, so when specifying the matrix, the command is: matrix_file=/home/nmra/gjb/amps/md.mat and NOT: matrix_file=ampsdir:md.mat or similar.
Running the program, eg. with the command file bash_look.com is achieved by simply typing multalign < bash_look.com
The program may be run in batch using the "at" utility - see the manual page (ie type man at). To use at, you will need to create a simple shell script to run the program, for example, this might look like this: multalign < bash_look.com > bash_look.log or similar.
NOTE:
amult program limits in this version are 500 sequences of up to 1200 residues.
Any queries, send e-mail with problem files to gjb@bioch.ox.ac.uk or GEOFF@BIOP.OX.AC.UK
--- oOo ---
References
Needleman, S. B. and Wunsch, C. D. (1970). J. Mol. Biol. 48, 443-453.
Barton, G. J. and Sternberg, M. J. E. (1987a). Protein Engineering 1, 89-94.
Barton, G. J. and Sternberg, M. J. E. (1987b). J. Mol. Biol. 198, 327-337.
Barton, G. J. and Sternberg, M. J. E. (1990). J. Mol. Biol. 212, 389-402.
Barton, G. J. (1990), Methods. Enzymol., 183, 403-428.
Zvelebil, M. J., Barton, G. J., Taylor, W. R. and Sternberg, M. J. E. (1987). J. Mol. Biol. 195, 957-961.
Bashford, D., Chothia, C., and Lesk, A.M. (1987). J. Mol. Biol. (1987).190-216.
Corter, J. E. (1982). Behaviour Research Methods and Implementation 14,353-354.
Feng, D. F. and Doolittle, R. F. (1987). J. Mol. Evol. 25, 351-360.
--- oOo ---
Known Bugs
----------
Please send bug reports to gjb@bioch.ox.ac.uk. Include example files. I will try to deal with the problems as they arise.
16th May 1991: Under VAX/VMS the print_vertical option is limited to 130 sequences.
13th Sep 1991: In Silicon Graphics version, program SELECT does not find four character id's in database - reason is obscure..
THIS BUG NOW FIXED.