de.NBI Logo

SSU and LSU Ref NR datasets (non-redundant)

Why a non-redundant version of the SILVA Ref databases?

For users interested in representative (rRNA) sequence collections, the rapid growth of the data sets has led to immense hardware requirements paired with a significantly increased amount of time to analyse the data. In case of large databases such as the current rRNA data sets, ARB especially requires large amounts of main memory (RAM) to be able to load the database (see also box below).

Basically, there are two options to face this problem: (1) a hardware upgrade to provide the amount of RAM required or (2) a reduction of the number of sequences in the ARB database to bring the RAM requirements down to the current hardware specifications.

For multiple reasons, the second option should be prefered as long as the resulting data set still is "representative" - a very important parameter in environmental microbiology. Therefore, the SILVA project has addressed this task, resulting in a "non-redundant" (NR) SSU and LSU Ref datasets build by a dereplication of the full SSU / LSU Ref using a 99% identity criterion.

As of SILVA release 119 the SSU Ref NR is the only SSU dataset with a manual curated guide tree. SSU Ref is still provided as an ARB dataset but without the guide tree. As of SILVA release 138.1 the LSU Ref NR is the only LSU dataset with a manual curated guide tree. LSU Ref is still provided as an ARB dataset but without the guide tree.

Background information for current release (SSU Ref NR 138.1, August 2020)

The SSU Ref NR 99 138.1 dataset is based on the full SSU Ref 138.1 dataset (Opens internal link in current windowsee SILVA 138.1 documentation), in total encompassing 510,508 sequences.

By applying a 99% identity criterion to remove highly similar sequences using the Opens external link in new windowvsearch tool with a custom sequence order first based on presence in the last release's Ref NR 99 and second based on combination of sequence length (weighted twofold) and quality. For the sorting, the quality of a sequence is determind by ambiguties (50%), overall alignment quality (45%), and homopolymers (5%). The overall alignment quality of the sequence is calculated from its alignment score, alignment identity, and alignment percentage (all equally weighted). Sequences from cultivated species have been preserved in all cases. The final number of sequences within the SSU Ref NR 138.1 dataset was reduced to 510,508 just about 23% of the database entries compared to the starting point. Please note that due to this preservation and additional technical limitations (clustering of large datasets) there can still be sequences in the dataset with an identity of >99%. The guide tree was extensively manually curated taking into account the latest taxonomic information. More information about the SILVA and LTP taxonomic frameworks can be found in the respective Opens external link in new windowpaper.

The final dataset can be used as a representative environmental dataset for classification, phylogenetic analysis and probe design (for probe match you should use a comprehensive database).

Background information for current release (LSU Ref NR 138.1, August 2020)

The LSU Ref NR 99 138.1 dataset is based on the full LSU Ref 138.1 dataset (Opens internal link in current windowsee SILVA 138.1 documentation), in total encompassing 95,286 sequences.

By applying a 99% identity criterion to remove highly similar sequences using the Opens external link in new windowvsearch tool with a custom sequence order first based on presence in the last release's Ref NR 99 and second based on combination of sequence length (weighted twofold) and quality. For the sorting, the quality of a sequence is determind by ambiguties (50%), overall alignment quality (45%), and homopolymers (5%). The overall alignment quality of the sequence is calculated from its alignment score, alignment identity, and alignment percentage (all equally weighted). Sequences from cultivated species have been preserved in all cases. The final number of sequences within the SSU Ref NR 138.1 dataset was reduced to 95,286 just about 42% of the database entries compared to the starting point. Please note that due to this preservation and additional technical limitations (clustering of large datasets) there can still be sequences in the dataset with an identity of >99%. The guide tree was extensively manually curated taking into account the latest taxonomic information. More information about the SILVA and LTP taxonomic frameworks can be found in the respective Opens external link in new windowpaper.

The final dataset can be used as a representative environmental dataset for classification, phylogenetic analysis and probe design (for probe match you should use a comprehensive database).

Downloads

The Ref NR 99 datasets, or subsets thereof, can be downloaded via the Opens internal link in current windowBrowser and as ARB database file in the common Opens internal link in current window.arb format.

In the SILVA Archive (release_138_1/Exports) also FASTA exports of the NR dataset are available.  In the archive you can also find older (smaller!) versions of the SSU Ref NR dataset (ARB database files and FASTA exports). We have started producing SSU Ref NR files with SILVA release 102 and LSU Ref NR files with the SILVA release 138.1.

How to integrate the Ref NR in your workflow and hardware requirements

The Ref NR is intended as a starting point for your ARB/SILVA work with just moderate hardware requirements. It represents a representative set of sequences showing all features of the full Ref database (same alignment, navigation tree containing all sequences, new SILVA taxonomy etc.), just the total number of sequences is reduced by dereplication using a 99% identity criterion.

Once downloaded as an ARB file, the database can be supplemented with additional sequences using the Browse and Search functions of the SILVA webpage and afterwards the ARB merge tool, e.g. if you are interested to have for selected groups/clusters the full diversity as represented by the SILVA Parc/Ref databases. Of course, you can also delete selected groups/clusters from the Ref NR data set to compensate for sequence additions.

Since ARB is a so-called "in-memory" database, the larger a dataset is, the more main memory (RAM) is required by ARB to handle it. This is the only significant hardware requirement of ARB, however, currently it represents a severe bottleneck for many users due to the rapid growth of the rRNA datasets.

The following table provided by Ribocon gives you concrete numbers on the ARB hardware requirements for the SILVA Ref NR dataset and a general idea on the correlation of dataset size and ARB memory usage:  Opens external link in new windowARB/SILVA Memory Requirements.