Minimal Absent Words (MAWs) and Nullomers are two terms which describe minimal-length oligomers absent from a genome or proteome. While nullomers are the shortest possible absent motifs in a species, the broader term MAW includes both nullomers and longer absent sequences which share a common characteristic: becoming present after removing either their leftmost or rightmost letter.
The Nullomers Database is a web-based resource of significant MAWs. The term 'significant' denotes a highly expected to occur, but absent sequence, as it has been assessed by the Nullomers Assessor method. The Nullomers Database is a constantly enriched repository of significant missing sequences from various organisms and aims to serve as a central hub of information for the explorartion (and reduction) of the vast MAWs' space.
The graphical user interface of Nullomers Database is divided into three main categories, Genomic MAWs, Peptide MAWs and MAWs in viruses in order to facilitate browsing and searching. In the Genomic MAWs section, significant MAWs from hundreds of genomes, ranging from microbes to human, are provided. The Peptide MAWs have resulted from the analysis of two main organisms (Homo sapiens & Mus musculus), while particular emphasis has been given to protein regions that a significant MAW can 'emerge' upon a single amino acid alteration. Finally, the MAWs in viruses section hosts significant absent genomic motifs from thousands of human-isolated virus records (data retrieved by NCBI Virus).
Several annotation features as well as the impact of putative MAW-making mutations have been incorporated and are visually presented by utilizing the web services of Uniprot and Mutation Assessor, respectively. The ultimate goal of Nullomers Database is to prioritise and highlight the most significant absent sequences across the tree of life.
The first part of Nullomers Database presents significant nucleotide sequences absent from several genomes. In simple words, these sequences are unlikely to be absent by chance. For more information about the probabilistic method and the statistical correction procedures that have been applied, please consult the publication.
By default, the total records from all the examined species are presented in a paginated table. Users can browse the resulting table or download the result-set.
The interactive table allows users to sort the results either alphabetically or numerically as well as narrow-down the output by typing in the search box.
Another way of filtering is by selecting either a species, or a division (which will subsequently reduce the number of species in the other select box). Furthermore, users can effortlessly search for palindromic (reverse-complement) MAWs.
It is worth noting that not every species has a significant MAW. Throughout our analysis, a fixed false discovery threshold of 1% has been applied, both when searching for genomic, peptide or viral MAWs.
Proteins that are prone to 'generate' a MAW in their sequences upon a single amino acid alteration are shown in the Peptide MAWs section. This includes only proteins that are one substitution away from containing a significant MAW. The provided results are split into two major categories: i) MAW-making mutations in proteins of interest and ii) list of proteins per significant absent peptides.
MAWs per protein
In this section, users can search for proteins that are prone to 'generate' a MAW upon a mutation. As soon as the page loads, a suggestion engine initiates
providing users with a powerful way to search for proteins of interest. A search can be done either by typing a UniProt identifier, a gene name or simply a free-text description of a protein.
The suggested results are serarated into Reviewed and Unreviewed records while users can narrow down the list by selecting records from either Homo sapiens or Mus musculus only.
Upon selection, users should click the 'Search' button. Then, a graphical table displays information of the selected proteins, as well as the actual peptide sequences which are prone to create a MAW. The 'MAW-making' alteration coupled with additional information are highlighted, while a prediction of the functional impact of the specific substitution is provided by Mutation Assessor.
By clicking on a sequence, an interactive protein viewer (Molart plugin) appears which provides structural information and feature annotation of the protein. The displayed information is retrieved from UniProt database in realtime. The panel which displays sequential annotation and the 3D structure of the selected protein is interactive in several ways.
Zoom in/out, drag on selection, highlighting elements on click, export annotation and images at a specific focus, synchronization between panels while clicking or hovering as well as panning are some of the key features. Users can zoom-in/zoom-out by holding down the right button of the mouse while moving it up or down, respectively, or rotate the entire molecule by simply clicking on it.
Also, the structure can be moved (pan functionality) by holding down the middle scroll-wheel of the mouse. Users can instantly get information of co-occurring elements at a MAW-making position and explore disease-associated, deleterious or benign variants that have been found in previous studies.
Proteins per MAWs
Next, lists of proteins which are prone to generate one of the significant peptide MAW can be retrieved simply by selecting an organism and a MAW. By default, only reviewed records are shown. Users can choose between Reviewed only or Reviewed and predicted records.
Subsequently, a graphical interactive table which includes Uniprot IDs, gene names, actual peptides and MAW-making substitutions is shown. The resulted information can be handled in the same way as above. Results can be copied, exported, ordered and searched dynamically. By clicking on a UniProt ID, a new browser-tab opens redirecting users to the corresponding entry in UniProt database.