Help for the ExPASy BLAST Interface
Query sequence
Enter a query protein sequence in raw format (no fasta header, use one-letter amino acid codes)
or a UniProt Knowledgebase (Swiss-Prot or TrEMBL) accession number.
Output format
HTML - BLAST native output format with hyperlinks and some formatting.
NiceBlast - View with full descriptions and organism sources.
Plain Text - Text format with no links.
BLAST program and databases
Programs available on ExPASy
|
| blastp |
compares a protein query sequence against a protein
sequence database. |
| tblastn |
compares a protein query sequence against a nucleotide
sequence database dynamically translated in all reading frames. |
Programs available elsewhere
|
| blastn |
compares a nucleotide query sequence against a nucleotide sequence database.
Available at EMBnet Switzerland
|
| blastx |
compares a nucleotide query sequence translated in all
reading frames against a protein sequence database.
Available at EMBnet Switzerland
|
| tblastx |
compares the six-frame translations of a nucleotide query sequence against
the six-frame translations of a nucleotide sequence database.
Available at EMBnet Switzerland
|
| PSI-BLAST |
Position Specific Iterative BLAST detects weak homologs
by building a profile from a multiple alignment of the highest scoring hits
in an initial BLAST search.
Available at NCBI |
| PHI-BLAST |
Pattern-Hit Initiated BLAST combines matching of regular expressions
with local alignments surrounding the match.
Available at NCBI
|
Databases
Protein Databases
| UniProt Knowledgebase (UniProtKB) |
UniProt (Universal Protein Resource) is a central
repository of protein sequence and function created by joining the
information contained in Swiss-Prot, TrEMBL, and PIR. The UniProt Knowledgebase
consists of two sections: Swiss-Prot, containing manually-annotated records
with information extracted from literature and curator-evaluated
computational analysis, and TrEMBL, a section with computationally analyzed
records that await full manual annotation. Updated biweekly and includes
splice variants.
Since UniProtKB contains a huge number of sequences, it may be useful to restrict
the search using the following criteria:
|
UniRef100, UniRef90 and UniRef50 |
The UniProt Non-redundant Reference
(UniRef) databases combine closely related sequences into a single record
to speed searches. The UniRef100 database combines identical sequences and
sub-fragments of the UniProt Knowledgebase (from any species) into a single UniRef entry, displaying the
sequence of a representative protein, the accession numbers of all the merged
UniProt entries, and links to the corresponding UniProt and UniParc records.
UniRef90 and UniRef50 are built by clustering UniRef100 sequences with 11 or
more residues such that each cluster is composed of sequences that have at
least 90% or 50% sequence identity, respectively, to the representative
sequence. UniRef90 and UniRef50 yield a database size reduction of
approximately 40% and 65%, respectively, providing for significantly faster
sequence searches.
|
| PDB |
Protein Data Bank for protein 3D structures. Sequences extracted from the PDB SEQRES lines are processed into a non-redundant set where identical sequences are merged into a single record.
|
| Translated EST |
Protein sequences derived from EST sequencing data (human,
mouse, rat, zebrafish, drosophila, bovine, arabidopsis). This database contains
many potential errors because of the low quality of the data. |
DNA Databases (for tblastn)
All databases are subdivided into taxonomic sections, selectable from the Taxonomic
groups drop-down list.
| All EMBL + GSS |
All entries from the EMBL database (equivalent to GenBank
and DDBJ). |
| HTG |
Unverified data from high-throughput genomic sequencing.
Usually in the form of cosmids. |
| dbEST |
Expressed sequence tag database from the NCBI. |
| EST contigs |
Database of contigs based on EST clusters from Unigene
(human, mouse, rat, bovine, zebrafish) and SwissClusters (Drosophila melanogaster,
Arabidopsis thaliana). |
| Unigene EST |
Database of EST clusters (list of ESTs known to match
the same cDNA) from the NCBI (updated occasionally). This database contains
also useful information like STS matches, tissue distribution, or transcript
map. |
| Complete genomes |
Genomes released in the form of a complete, assembled
sequence. |
| Select a microbial genome |
One of the genomes released in the form of a complete, assembled
sequence. |
E-mail address
Enter your e-mail address to receive the results by e-mail. Otherwise, they will
arrive interactively in your browser. The e-mail option is recommended for tblastn
searches on big databases such as EMBL. If your interactive search is too long,
you will receive an error message requiring you to resubmit via e-mail.
Options
Comparison matrix
The matrix assigns a probability score for each position in an alignment. The
BLOSUM matrix assigns a probability score for each position in an alignment that
is based on the frequency with which that substitution is known to occur among
consensus blocks within related proteins. BLOSUM62 is among the best of the available
matrices for detecting weak protein similarities. The PAM set of matrices is also
available.
If the "Auto-select" option is selected (default), the matrix will be selected
depending on the query sequence length, based on the following (
empirically
constructed) table:
| Query length |
Substitution matrix |
| <35 | PAM-30 |
| 35-50 | PAM-70 |
| 50-85 | BLOSUM-80 |
| >85 | BLOSUM-62 |
Setting the E threshold
The expectation value (E) threshold is a statistical measure of the number
of expected matches in a random database. The lower the e-value, the more likely
the match is to be significant. E-values between 0.1 and 10 are generally dubious,
and over 10 are unlikely to have biological significance. In all cases, those
matches need to be verified manually. You may need to increase the E threshold
in the following cases :
- if you have a very short query sequence
- to detect very weak similarities, or similarities in a short region
- if your sequence has a low complexity region and you use the masking option
Filter the sequence for low-complexity regions
Low-complexity regions (e.g. stretches of cysteine in CSP_DROME (
Q03751),
hydrophobic regions in membrane proteins) tend to produce spurious, insignificant
matches with sequences in the database which have the same kind of low-complexity
regions, but are unrelated biologically. If this option is checked, the query
sequence will be run through the program SEG, and all amino acids in low-complexity
regions will be replaced by X's which will appear in the alignment. The masked regions
will also be visible as slashed regions in the PaintBlast image.
Gapped alignment
This will allow gaps to be introduced in the sequences when the comparison is
done, and is usually left checked.
Identity BLAST
If this box is checked, two BLAST runs are performed against the selected
database. The first run is done using the selected
matrix, and displayed. The second run is performed using an
Identity matrix. At the top of the BLAST output are the sequences with the
best scores for the Identity matrix (sequences picked up by the identity matrix
have a additional score displayed in red at the end of the line). This should
permit to see fragments that might have been missed with the default matrix, or
that may have a rather low BLAST score despite a high similarity over a short
region. This option is especially useful for fishing fragments in the case of
big protein families.
Output page
The output page is divided into three sections. The first is a summary of the hits,
including the score and e-value of the best HSP for each hit. The second part is a graphical view summarizing the matching portions for each hit. The third part contains
the alignments between the query and the hits.
From the summary of the hits, several operations may be performed on selected sequences.
This is only available for blastp against the protein databases :
-
ClustalW
-
is a multiple sequence alignment program,
- T-COFFEE
- is an alignment program that often gives better results than ClustalW,
especially when dealing with divergent sequences and long insertions,
- Reduce redundancy
-
is a program to reduce the redundancy in a set of unaligned sequences.
-
- PRATT
-
is a tool to discover patterns that are conserved in a set of protein sequences.
-
- SHOPS
-
is a tool to analyze the genomic operon context for any group of proteins selected on the basis of a set of sequence or domain identifiers..
- Retrieve selected entries/sequences/accession numbers
-
allows several sequences (complete entries, accession numbers only or fasta format) to be retrieved at a time from the database.
Individual entries are always available by clicking on the accession
numbers.
Graphical view
The graphical view is composed of two images.
- PROSITE profile and Pfam HMM matches are drawn as black boxes on the length of the query sequence. PROSITE and Pfam are scanned with a heuristic method which is fast but misses some matches. The boxes are clickable to retrieve a description of the domain.
- The PaintBlast view represents matching regions (HSPs) between the query and each hit in the database.
- On the left side, each matching region is drawn as a box on the query sequence.
- On the right side, each matching region is drawn on the hit sequence. Since the length of hit sequences in the database may vary quite widely, the total length of each hit sequence is drawn as a gray box in a square-root scale (the scale is indicated at the top).
Other references
BLAST
tutorial at NCBI
BLAST Frequently
Asked questions at NCBI (includes error messages)
The Statistics
of Sequence Similarity Scores by Altschul