Major Bioinformatics Databases - Intuitive Tutorials

There are many free bioinformatics databases available, which offer a wealth of biological data and information. Some of the popular ones are discussed below.

NCBI

National Center for Biotechnology Information (NCBI) provides access to various biological databases, such as GenBank, PubMed, Protein, and Nucleotide. The National Center for Biotechnology Information (NCBI) is a public database that provides access to a wide range of biological information. This database is maintained by the United States National Library of Medicine (NLM) and is one of the largest bioinformatics databases in the world. NCBI is one of the major bioinformatics databases in use today.

NCBI is a valuable resource for researchers in the life sciences, providing access to a vast array of biological data, including nucleotide and protein sequences, scientific literature, genetic variation, and gene expression data. The following is a detailed overview of the different databases available on the NCBI platform:

GenBank: This is a database of nucleotide sequences, including DNA and RNA, from a variety of organisms. GenBank is the primary repository for sequence data generated by the scientific community and contains over 250 billion bases from more than 800,000 organisms.
PubMed: This is a database of scientific literature that includes articles from biomedical journals, books, and conference proceedings. PubMed contains over 30 million citations, and it is a valuable resource for researchers to find information on specific topics.
Protein: This is a database of protein sequences, and it contains information on the structure and function of proteins. Protein includes over 200 million sequences from a wide range of organisms.
Nucleotide: This is a database of nucleotide sequences, and it contains information on DNA and RNA. Nucleotide includes over 300 million sequences from a variety of organisms.
dbSNP (Single Nucleotide Polymorphism Database): This is a database of genetic variation, and it contains information on variations in DNA sequence, such as single nucleotide polymorphisms (SNPs). dbSNP includes over 1.5 billion variations.

In addition to these databases, NCBI provides access to many other resources, including tools for sequence alignment, analysis of genetic variation, and visualization of molecular structures.

NCBI is an essential resource for the life sciences community, and it plays a vital role in advancing research in fields such as genomics, proteomics, and bioinformatics. With its vast array of databases and tools, NCBI continues to be an indispensable resource for researchers around the world.

Ensembl

Ensembl is a genome database that provides access to annotated genomes of various organisms, including vertebrates, plants, fungi, and bacteria. Ensembl was created in 1999 as a joint project between the European Bioinformatics Institute (EMBL-EBI) and the Wellcome Trust Sanger Institute.

Ensembl offers a user-friendly web interface that allows researchers to browse, search, and retrieve data on genes, transcripts, proteins, and their functions. The Ensembl database contains a wealth of information on genome structure, gene expression, variation, regulation, and evolution. Some of the key features of Ensembl are:

Genomic sequences: Ensembl provides high-quality, annotated genomic sequences for a large number of organisms. These sequences are assembled from raw sequencing data and annotated with information on gene structure, exon-intron boundaries, regulatory elements, repeats, and other features.
Gene annotation: Ensembl provides comprehensive annotation of protein-coding genes, non-coding RNA genes, and pseudogenes. The annotation includes gene models, transcript structures, exon-intron boundaries, functional domains, and gene ontologies.
Comparative genomics: Ensembl offers comparative genomics tools that allow researchers to compare genomes across different species and identify conserved regions, synteny, and evolutionary relationships. Ensembl also provides gene trees, species trees, and orthology and paralogy relationships.
Variation analysis: Ensembl provides a comprehensive database of genetic variation, including single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants. Ensembl also offers tools for querying and visualizing variation data and predicting the functional consequences of genetic variants.
Regulation: Ensembl provides information on regulatory elements, such as promoters, enhancers, and transcription factor binding sites. Ensembl also offers tools for predicting the effects of genetic variants on gene regulation and expression.
Functional genomics: Ensembl provides access to functional genomics data, such as gene expression, chromatin accessibility, and epigenetic modifications. Ensembl also offers tools for integrating and visualizing functional genomics data with genomic annotation.

UniProt

UniProt is a comprehensive and widely used protein sequence database that provides information on protein sequences, structures, functions, interactions, and post-translational modifications. UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR).

UniProt comprises three different databases: UniProtKB (Knowledgebase), UniProtKB/Swiss-Prot, and UniProtKB/TrEMBL.

UniProtKB (Knowledgebase): UniProtKB is the manually curated protein sequence database. It contains high-quality protein sequences with functional annotations, including information on protein names, functions, domains, subcellular localization, and interaction partners. UniProtKB is constantly updated with new protein sequences and annotations.
UniProtKB/Swiss-Prot: UniProtKB/Swiss-Prot is the subset of UniProtKB that is manually curated by experts. It contains a smaller set of high-quality protein sequences with detailed and accurate annotations. UniProtKB/Swiss-Prot is considered the gold standard for protein annotation.
UniProtKB/TrEMBL: UniProtKB/TrEMBL is the automatically annotated protein sequence database. It contains all the protein sequences that are not yet manually curated and annotated. UniProtKB/TrEMBL serves as a resource for rapid and comprehensive protein sequence searches and contains over 200 million protein sequences.

KEGG

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a comprehensive and integrated database resource for understanding high-level functions and utilities of biological systems, such as cells, organisms, and ecosystems, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. KEGG consists of three main components:

KEGG PATHWAY: It is a collection of manually drawn pathway maps representing molecular interaction and reaction networks for various cellular processes and signaling pathways in different organisms, including metabolism, genetic information processing, environmental information processing, cellular processes, human diseases, and drug development. These pathway maps are linked to genes, proteins, and other molecular entities, as well as to various databases, tools, and resources for further analyses and interpretations.
KEGG GENES: It is a collection of gene catalogs and annotations for various organisms, including complete or draft genome sequences, gene expression profiles, and functional assignments based on computational and experimental evidence. These gene catalogs are organized into different types of functional categories, such as ortholog groups, biological pathways, and molecular modules, to facilitate comparative and integrative analyses across different organisms and datasets.
KEGG BRITE: It is a collection of hierarchical classifications and annotations for various biological entities, including genes, proteins, compounds, and diseases, based on functional and structural similarities and relationships. These classifications are organized into different levels and branches, such as pathways, enzymes, diseases, drugs, and organisms, to provide a comprehensive and flexible framework for data exploration and interpretation.

KEGG is widely used by researchers in various fields of biology, such as genetics, biochemistry, molecular biology, pharmacology, and systems biology, for data mining, hypothesis generation, pathway analysis, drug discovery, and functional annotation. KEGG also provides various tools and resources, such as KEGG Mapper, KEGG API, KEGG Orthology, and KEGG Pathogen, to support advanced data analysis and visualization.

Pfam

Pfam is a database of protein families, domains, and functional sites, which is used to annotate and classify protein sequences based on their structural, evolutionary, and functional relationships. The database is maintained by the European Bioinformatics Institute (EBI) and is freely available to the public.

Pfam contains multiple sequence alignments and profiles of protein domains, which are conserved regions of a protein that can fold independently and perform a specific function. These domains can be further grouped into protein families, which share significant sequence similarity and structural features, and can be used to predict the function of uncharacterized proteins. Pfam also contains information about the evolutionary relationships of these protein families, based on the phylogenetic analysis of their sequence conservation patterns.

Pfam covers a wide range of protein domains and families, including enzymatic domains, signaling domains, transcription factors, transporters, and many others. The database is updated regularly to incorporate new sequences and families, and to improve the accuracy and coverage of existing families.

The Pfam database provides various tools and resources, such as HMMER, an algorithm for searching protein sequences against the Pfam database, and the PfamScan tool, which can annotate protein sequences with Pfam domains and families. Pfam also integrates with other databases, such as UniProt and InterPro, to provide comprehensive and complementary information about protein function and structure.

RCSB Protein Data Bank

The RCSB Protein Data Bank (PDB) is a publicly accessible database that provides three-dimensional structures of biological macromolecules, including proteins, nucleic acids, and complex assemblies. It is maintained by the Research Collaboratory for Structural Bioinformatics (RCSB), which is a partnership among multiple institutions, including Rutgers University, University of California San Diego, and the National Institute of Standards and Technology (NIST).

The PDB contains over 180,000 structures of macromolecules that have been experimentally determined using X-ray crystallography, NMR spectroscopy, and electron microscopy. The database also includes information about the experimental methods, sample preparation, and refinement procedures used to determine each structure, as well as annotations about the biological function, ligands, and interactions of the macromolecules.

The PDB is widely used by researchers in various fields of biology, such as biochemistry, biophysics, and drug discovery, to study the structure and function of biological macromolecules. The database provides various tools and resources to facilitate data access, visualization, and analysis, such as the PDBjViewer, a molecular viewer that allows users to interactively visualize and manipulate the 3D structures, and the Ligand Explorer, a tool for exploring ligand binding sites and interactions.

The RCSB PDB also collaborates with other databases, such as UniProt, Enzyme Function Initiative, and the Genome Browser, to provide a comprehensive and integrated view of protein structure and function. Furthermore, the RCSB PDB provides educational resources and outreach activities to promote science education and public awareness of structural biology.

STRING

STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a bioinformatics database and web resource that provides a comprehensive view of protein-protein interactions and functional networks in various organisms. The database is maintained by a consortium of European research institutions and is freely accessible to the public.

STRING integrates various sources of experimental and computational data to predict and annotate protein-protein interactions, including physical interactions, co-expression, co-localization, and co-regulation. The database also provides functional annotations and predictions based on the Gene Ontology (GO) terms, KEGG pathways, and other functional categories.

STRING covers a wide range of organisms, from bacteria to humans, and includes over 24 million proteins and 2,000 organisms. The database allows users to search for specific proteins or genes, and to visualize the interactions and functional networks of the query proteins, as well as their orthologs and paralogs.

STRING also provides various tools and resources to facilitate data analysis and interpretation, such as the STRING App, which allows users to integrate and analyze their own data within the STRING framework, and the STRINGdb API, which allows programmatic access to the database and its features.

GEO

The Gene Expression Omnibus (GEO) is a publicly accessible database and web resource maintained by the National Center for Biotechnology Information (NCBI). The GEO database stores and shares high-throughput gene expression and genomic data generated by various experimental techniques, such as microarrays, RNA sequencing, and chromatin immunoprecipitation sequencing (ChIP-seq).

GEO provides researchers with a centralized and standardized platform for archiving and accessing large-scale gene expression data sets, as well as tools for data analysis and visualization. The database includes gene expression data from various organisms, tissues, and conditions, and allows users to search, browse, and download data sets, as well as to submit their own data.

GEO data sets can be analyzed using various software tools and bioinformatics resources, such as the GEOquery R package, which provides a convenient interface for accessing and analyzing GEO data sets in the R statistical environment, and the GEO2R tool, which allows users to compare gene expression profiles between different experimental conditions and identify differentially expressed genes.

GEO also provides curated data sets and gene expression profiles from various disease conditions, including cancer, cardiovascular disease, and neurological disorders, which can be used to generate hypotheses and insights into the molecular mechanisms of these diseases.

The GEO database is an important resource for researchers in various fields of biology and medicine, as it provides a wealth of gene expression data that can be used to identify new targets for drug discovery, understand the molecular basis of disease, and explore the biological functions of genes and pathways.

These are just a few examples of the many free bioinformatics databases available. The choice of database for a given research study will be often depends on the specific research question and the type of data required.

If you are looking of bioinformatics or computations biology related articles, you can find those here.