Aligning sequences

 

1. Paste your sequences.

 

2. Select the appropriate reference MSA. (See below for a description of the other parameters.)

 

3. Choose your output format. If you desire FASTA files, you must also choose where you want the per-sequence meta data produced by SINA: In the header line (each item enclosed in []), on comment lines between header and sequence data (best readability) or in a separate CSV formatted file.

 

 

4. Click on submit

 

5. You will be redirected to the download page where you can follow your job's progress and download your file(s) once it is finished.

Old jobs will remain listed until you restart your browser. The links will remain valid for at least one day.

The log file (the second link) contains the console output generated by SINA, including its version number and the exact parameters used for alignment. If not all of your sequences are in the output file, you will probably find a statement in this file, claiming that no reference sequences could be found for the missing submitted sequences. Usually that means that they are not the kind of rRNA you selected.

Basic alignment parameters and Advanced alignment parameters are explained below.

Searching for related sequences

 

Use the sequence search stage of SINA to find related sequences and add them to your cart:

Parameters:

Search in:
Select the subset of the database to be searched.
Minimal identity with query:
Set the minimal identity of the search results with the query sequence.
The sequence identity is computed as the number of shared bases (common base-column pairs) divided by the length of the query sequence.
Number of results per query sequence:
This limits the number of reported results.
Number of intermediate results (k-mer search):
To speed up the search, only the closest relatives as determined by a k-mer search are used for alignment based comparison. This option configures the number of best matches considered. Setting this value too high will slow down the search unnecessarily, setting it too low will result in suboptimal results.

Submit:

Click on "Align sequence(s)" to submit your request. You will be redirected to the download page where you can follow your job's progress:

Once you job has completed, a link will appear, allowing you to add all search results to the cart.

Please be aware that some sequences may be matched multiple times, the number of search results may therefore be lower than the number of query sequences times the number of requested search results. Also, since the cart system is accession number based, if the search matched a genome sequence, all LSU/SSU sequences from that genome will be added to the cart.

Advanced Options:

kmer-len:
This is the length of the k-meres used in the k-mer search, that is: k. This value must be between 6 and 25. 10 is usually good. For distant sequences, 8 or 9 might yield slightly better results.
kmer-mm:
This is the number of mismatches allowed for the whole k-mer to be considered a match. A value of up to 2 is allowed. 
kmer-norel:
Normally, the k-mer distance is computed by dividing the number of k-meres shared between query and database sequence divided by the length of the shorter of the two. This "biases" the search towards short sequences. A 300 base fragment of the query sequence found in the database will get a perfect score. If this option is enabled, the k-mer distance always uses the length of the query in the divisor, thus "biasing" the search towards long sequences. If you search the Parc, you find that the results improve by enabling this flag.
kmer-nofast:
In order to speed up the k-mer search, SINA usually only uses k-meres beginning with "A" in the k-mer search. This flag disables that behavior.

Classifying sequences

 

Check "Enable classification" to request LCA classification of your sequences.

 

The result will be written to meta data fields of the form "slv_lca_tax_<taxonomy-name>"

Parameters:

Select taxonomies:
Check the taxonomies that should be used to derive the classification of your sequences. Selecting multiple taxonomies will result in multiple classification.
Please be aware, that this generates a LCA classfication based on the selected taxonomies. It does not use the mechanisms by which those taxonomies where originally built (for example, the SILVA taxonomy is manually curated based on a phylogenetic tree inferred using maximum parsimony).
Fraction of search results used in LCA:
This parameters allows relaxing the "common" criterion in LCA. Normally, LCA will choose the deepest classification level shared by all found sequences. Setting this to a value lower than 1 allows some "outliers" or "under classified" sequences to occur without shortening the classification of the query sequence.

Hints:

Search results that are classified as "Unclassified" will be ignored during classification. If no classification could be made (no search results found or results have different domains), the query sequence will be assigned the classification "Unclassified".

Since classification is based on the results from the sequence search, care must be taken in modifying search parameters. Setting the number of search results to 1 will, for example, always get you the classification of the best database match. Setting the number of search results high will result in "shallower" classification. Lowering the required identity with the query may result in misclassification.

Basic alignment parameters

 

Unaligned sequence ends

The default is to attach the remaining unaligned bases at the end of your sequences to the last aligned base. This is appropriate if your sequences are full length and properly truncated. The number of unaligned bases at the end is reported by SINA in the fields "slv_cutoff_head" and "slv_cutoff_tail". Alternatively, you may choose to move those bases to the outer columns of the alignment. Our alignments contain 1000 empty columns at both ends, so these bases are easily removable prior to e.g. tree reconstruction (in ARB, just use the TERMINI filter). Lastly, you may opt to have these bases removed from your sequences.

Wide insertions

Insertions will, as long as there is enough room in the reference alignment, always be placed adjacent to the following aligned base. Sometimes, however, the insertion may be longer than the number of free columns between the adjoining aligned bases. The default is to disallow this case during the dynamic programming stage of sequence alignment ("forbidding during alignment"). Alternatively, you may choose to have those insertions fitted in into the alignment by pushing the adjoining aligned bases outwards as required. The benefit of this option is that you can be made aware of cases where our alignment contains insufficient free columns. Lastly, you may choose to have the insertion truncated to the number of free columns. This is only appropriate if the sequences will be subjected to column filtering (i.e. prior to tree reconstruction) afterwards. Selecting this option disables sequence search.

 

Reverse / complement correction

Using default settings, SINA will verify that your sequence has the correct orientation. If a different orientation is expected to yield a better alignment, your sequences will be transformed accordingly. If you know that your sequences are correctly oriented and not complemented, you may disable this feature.

Indicate unaligned bases

If you select "indicate unaligned bases", all insertions and remaining unaligned bases at the ends of your sequences will be set in lower case letters. If you intend to validate or refine the alignment computed by SINA, identifying bases with indeterminate positions in this manner may help you locate sections in the alignment worthy of attention.

Show difference to submitted alignment

If you submitted SILVA compatibly aligned sequences, such as from a previous alignment run with different parameters or after manually modifying the alignment, checking the "show differences" option will show the sections of the new alignment differing from the submitted alignment:

Dumping pos 2528 through 2542: 
CG-G-ACA 19
CG-G-AUA 0-18 20-23 25-29 31 34-35 37-39
CG-GAAUA 33
CG-GCAUA 43 <---(## NEW ##)
CGGC-AUA 42 <---(%% ORIG %%)
GG-A-AUA 40
GG-G-AUA 41
UG-G-AUA 24 30 32 36

The example above shows all columns containing at least one base between column 2528 and 2542. The row marked with "ORIG" contains the original alignment. The row marked with NEW contains the new alignment. In this example, the dimer "GC" was placed one column further to the right than in the original alignment. The other rows show bases and alignment for the selected reference sequences. The numbers to the right of the alignment extract correspond to the position of the respective reference sequence identifiers in the field "align_family_slv".

Advanced Alignment Parameters

 

Variablility profile:
If this parameter is set different from "none", SINA will use the conservation values from the respective ARB positional variability by parsimony (PVP) filter to assign weights to alignment columns. "automatic" performs a domain level classification to select the appropriate PVP SAI dynamically.
fs-min, fs-max, fs-msc:
These parameters configure the number of reference sequences used by SINA to align the candidate sequences. At least fs-min sequences are always used. Up to fs-max sequences may be used, but only if their k-mer distance is at least fs-msc.
fs-req:
SINA will reject the candidate sequence if less than this number of matches are found by the k-mer search.
fs-req-full, fs-full-len:
Irrespective of the setting of fs-max, SINA will include at least fs-req-full sequences of at least fs-full-len bases length in the alignment reference.
gene-start, gene-end:
These parameters configure the first and last columns in the alignment considered to be part of the gene. If set to zero, defaults are extracted from the reference MSA.
fs-cover-gene:
Irrespective of the setting of fs-max, SINA will include at least this number of sequences covering the gene start and gene end positions in the alignment reference.
match-score, mismatch-score:
Scores awarded for matches or mismatches during alignment. IUPAC encoded bases are treated "leniently" by SINA. That is, if a match is conceivable, the pairing is treated as such.
pen-gap, pen-gapext:
Gap open and gap extension penalties.
fs-weight:
If different from zero, match and mismatch scores will be weighted according to the frequency with which the matched/mismatched base occurs in the selected reference sequences.
fs-kmer-len:
The length k of the k-meres used to find the sequences used as an alignment reference.
fs-kmer-mm:
The number of mismatches allowed in matching each k-mer.
fs-kmer-no-fast:
If enabled, the k-mer search will use all k-meres occurring in the candidate sequence. By default, only k-meres starting with 'A' are used.
fs-kmer-norel:
If enabled, the k-mer score will use the length of the candidate sequence in the divisor (rather than the minimum of candidate and matched sequence).