Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics notes, Lecture notes of Bioinformatics

These bioinformatics notes explain in detail about all the biological databases and tools used in bioinformatics

Typology: Lecture notes

2020/2021

Available from 03/01/2023

knowledgebyprats
knowledgebyprats 🇮🇳

3 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
BIOINFORMATICS
ASSIGNMENT
pf3
pf4
pf5

Partial preview of the text

Download Bioinformatics notes and more Lecture notes Bioinformatics in PDF only on Docsity!

BIOINFORMATICS

ASSIGNMENT

BIOLOGICAL DATABASES

In simple language, a database is a systematic collection of data or information, stored and accessed electronically from a computer system. Thus, a biological database is organised collection of biological information which can be accessed, managed and updated easily. There are different types of biological databases like nucleotide databases, gene databases, protein databases, metabolic pathway databases etc. There are two types of biological databases:

  1. primary database
  2. derivative or secondary database A primary database is that database in which experimental results are directly converted into databases. These are the original submissions by the experimentalists. Examples are GenBank and GEO. Secondary or derived database are those databases which contain the results of analysis of the primary databases. Examples are Refseq and Uniprot.

Features of Databases

The features of a database are:

  1. A database should be easy to understand.
  2. A database should be simple.
  3. It should be easy to search and locate.
  4. It should be annotated but not that much.
  5. A database should have minimum redundancy that is data stored in a database should not exist in multiple locations.
  6. It should be cross referenced.

LITERATURE DATABASE

PubMed

PubMed is a literature database and is maintained and created by National Library of Medicine, National Center for Biotechnology and National Institutes of Health. It basically contains the abstracts on journal articles and on various topics like life science, chemical science, MEDLINE and bioinformatics. It also provides additional links from various websites related to the search. All citations in MEDLINE are assigned MeSH Terms and Publication Types from NLM;s controlled vocabulary. The biggest disadvantage of PubMed is that it does not contain the full articles for most journals. It may link a bibliographic record to the full text on the journal website. Whether the article will be free for public or not depends on the author.

SEQUENCE DATABASES

GenBank

GenBank is a publicly available comprehensive database mostly used for nucleotide sequences and proteins. These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. It is a primary database. It exchanges data on daily basis exchange with the European Nucleotide Archive and the DNA Data Bank of Japan which ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed.

The Universal Protein Resource (UniProt) is a freely accessible database used for finding information about protein sequences and their functions. The core activities of UniProt are manual curation of protein sequences assisted by computational analysis, sequence archiving, development of a user friendly UniProt website and the provision of additional value added information through cross-references to other databases. UniProt is a joint effort of European Bioinformatics Institute(EBI), the Protein Resource(PIR) and the Swiss Institute of Bioinformatics(SIB). UniProt comprises four components for different uses: the UniProt Knowledgebase, the UniProt Reference Clusters, the UniProt Archive and the UniProt Metagenomic and Environmental Sequences. UniProt is a good resource for students as well as any person related to bioinformatics as it interconnects information from large and disparate sources, and it is the most comprehensive catalogue for protein sequences and functional annotation.

RefSeq

RefSeq is a publicly available database of annotated genomic, transcript and protein sequence records. It is a maintained and curated by National Center for Biotechnology (NCBI). It is a secondary type of database. It produces a set of stable, non-redundant reference sequences. Its biggest advantage is its non-redundancy.

PROSITE

PROSITE is a secondary database of proteins which collects together the patterns found in protein databases rather than the complete sequences. It consists of a database which with the help of these patterns and profiles rapidly and reliably tells us that to which class this protein belongs to. PROSITE is complemented by ProRule, a collection of rules based on profiles and patterns, which increases the discriminatory power of profiles and patterns by providing additional information about functionally and/or structurally critical amino acids. PROSITE is largely used for the annotation of domain features of UniProtKB/Swiss-Prot entries.

SECONDARY PROTEIN STRUCTURE DATABSES:

PDB

The Protein Data Bank is a database which provides information about the 3D structures of biological molecules like nucleic acids and proteins. It reveals interesting insights about the impact of 3D structures of protein targets important for discovery of new drugs. The PDB was establishes in 1971 as the first open access, digital data resource in biology and is now managed by the Worldwide PDB partnership. RCSB PDB is the Research Collaboratory for Structural Bioinformatics PDB operates the USA data center for the global pdb archive. The structures in PDB are usually obtained by the methods of X-Ray crystallography and NMR. It is a great tool for everyone who is associated with bioinformatics, biotechnology or biomedicine.

SCOP

Structural Classification of Proteins is a secondary structural database of proteins. It is a secondary type of database. It's main purpose is to provide the details of the structure and history of proteins i.e. how the protein is evolved with time. It also helps the user in finding the similarities between proteins. The source of these protein structures is PDB. The unit of classification is usually the protein domain. The classification of proteins in SCOP on hierarchical levels is done as follows:

  1. Family: Proteins are kept in this group on the basis of two criteria – first, all proteins that have residue identities of 30% and greater; second, proteins with lower sequence identities but whose functions and structures are very similar.
  2. Superfamily: Families that are not that much similar but their functional features suggest that they have a common evolution.
  3. Fold: The superfamilies and families that have the same major secondary structures in the same arrangement and with the same topological connections. The structural similarities of proteins in the same fold category probably arise from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.
  4. Class: The different folds have been grouped into classes. Most of the folds are divided in four classes as follows: (a) all-α, those whose structure is formed by α-helices; (b) all-β, those whose structure is essentially formed by β-helices; (c) α/β, those with α-helices and β-strands (d) α+β, those in which α-helices and β-strands are largely segregated SCOP includes not only the proteins in the current version of PDB, but many proteins for which they are published descriptions but whose co-ordinates are not yet available. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database so far.

CATH

CATH stands for Class Architecture Topology Homologous superfamily. It is a free protein structure classification database. It classifies proteins on the basis of:

(C)lass: The three main classes are – α proteins, those whose structure is essentially formed by α-helices; all-β

proteins, those whose structure is essentially formed by β-sheets; α/β proteins, those with α-helices and β-strands;

(A)rchitecture: Architecture of a protein is the shape of the domain; it does not include the connectivity.

(T)opology: The topology level contains the structures with the same numbers, arrangement and connectivity.

#(H)omologous superfamily: The proteins having high structural similarity is kept in this hierarchy level, which suggests us that they have evolved from a common ancestor.

Sequence family: The proteins having similarity greater than 35% are kept in this category which again suggests us

that they have evolved from a common ancestor. One big disadvantage of CATH is that it classifies only the protein structures that are in PDB bank. CATH-Gene 3D As we know, CATH is a protein database which takes it’s structures from PDB. Gene 3D uses the protein structure information from CATH and they are split into their consecutive polypeptide chains where applicable. Now their protein domains are identified and classified on the basis of CATH hierarchy level. Uses of CATH: It tells us that how secondary structures are connected with each other, how proteins are evolved, helps in finding out the conserved sites, predicts the 3D structure of protein.

KEGG (Kyoto Encyclopaedia of Genes and Genomes)

KEGG is a biological database which provides information about the genes and genomes, chemical reactions, systems for the basic understanding of biological systems and diseases and drugs. It is a group of sixteen databases which are