Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Biological Databases: A Comprehensive Overview, Study notes of Genomics

The notes have been prepared by a thorough reading of the book and understanding of lectures.

Typology: Study notes

2019/2020

Available from 06/17/2022

Ipsi_ta07
Ipsi_ta07 🇮🇳

3 documents

1 / 10

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
BIOLOGICAL DATABASE
Biological databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and
computational analyses. They contain information from research areas including
genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics.
Information contained in biological databases includes gene function, structure,
localization (both cellular and chromosomal), clinical effects of mutations as well as
similarities of biological sequences and structures.
Biological databases are an important tool in assisting scientists to understand and
explain a host of biological phenomena from the structure of biomolecules and their
interaction, to the whole metabolism of organisms and to understanding the evolution of
species. This knowledge helps facilitate the fight against diseases, assists in the
development of medications and in discovering basic relationships amongst species in
the history of life.
Biological knowledge is distributed amongst many different general and specialized
databases. This sometimes makes it difficult to ensure the consistency of information.
Biological databases cross-reference other databases with accession numbers as one way
of linking their related knowledge together.
An important resource for finding biological databases is a special yearly issue of the
journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and
categorizes many of the publicly available online databases related to biology and
bioinformatics.
Biological data is highly complex and interrelated. Vast amount of biological information
needs to be stored organized and indexed so that the information can be retrieved and
used. There are five major types of databases namely nucleotide databases, protein
databases, protein structure databases, metabolic pathway databases and the
bibliographic databases.
Genome Browser:
NCBI ( National centre for Biotechnological information) :
NCBI is one of the leading online resources known for providing Biological sequence
information. NCBI is maintained by two organizations in US ,National Library of Medicine
( NLM) and National Institute of science ( NIH). As a national resource for molecular
biology information, NCBI's mission is to develop new information technologies to aid in
the understanding of fundamental molecular and genetic processes that control health
and disease. More specifically, the NCBI has been charged with creating automated
systems for storing and analyzing knowledge about molecular biology, biochemistry, and
genetics.
NCBI is connected to various other sequence databases in order to be more efficient in
answering sequence queries. The user queries and sequence information are delivered
through NCBI’s search tool called the “entrez”
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Biological Databases: A Comprehensive Overview and more Study notes Genomics in PDF only on Docsity!

BIOLOGICAL DATABASE

Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures. Biological databases are an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. Biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together. An important resource for finding biological databases is a special yearly issue of the journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and categorizes many of the publicly available online databases related to biology and bioinformatics. Biological data is highly complex and interrelated. Vast amount of biological information needs to be stored organized and indexed so that the information can be retrieved and used. There are five major types of databases namely nucleotide databases, protein databases, protein structure databases, metabolic pathway databases and the bibliographic databases. Genome Browser: NCBI ( National centre for Biotechnological information) : NCBI is one of the leading online resources known for providing Biological sequence information. NCBI is maintained by two organizations in US ,National Library of Medicine ( NLM) and National Institute of science ( NIH). As a national resource for molecular biology information, NCBI's mission is to develop new information technologies to aid in the understanding of fundamental molecular and genetic processes that control health and disease. More specifically, the NCBI has been charged with creating automated systems for storing and analyzing knowledge about molecular biology, biochemistry, and genetics. NCBI is connected to various other sequence databases in order to be more efficient in answering sequence queries. The user queries and sequence information are delivered through NCBI’s search tool called the “entrez”

Home Page: NCBI has a simplified homepage from where the user can navigate to different resources. The left side pane of the Homepage has a site map followed by different categories which narrows down the possibility of finding the right sequence. On the right side , you can see the list of popular resources which is very useful for first time users. GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information (NCBI) as part of the International Nucleotide Sequence Database Collaboration (INSDC). The National Center for Biotechnology Information is a part of the National Institutes of Health in the United States. GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. In more than 20 years since its establishment, GenBank has become the most important and most influential database for research in almost all biological fields, whose data were accessed and cited by millions of researchers around the world. GenBank continues to grow at an exponential rate, doubling every 18 months. Entrez: The NCBI database accepts queries and delivers data via a custom made search engine called Entrez. The Home page of NCBI has a search box which directs the user to entrez. Entrez is internally connected to various biological databases which increases the probability of getting the correct information. BLAST: BLAST stands for Basic Local Alignment Search Tool. BLAST is a tools that is used to find the sequences homologous to a particular sequence. The BLAST program was developed by Stephen Altschul of NCBI in 1990 and has since become one of the most popular programs for sequence analysis. BLAST compares all the sequences in the database with the one that is searched for and provides many hits which are usually arranged in the increasing order of the scored obtained. BLAST is available at the URLhttp://blast.ncbi.nlm.nih.gov/ Variants of BLAST BLAST-N: compares nucleotide sequence with nucleotide sequences BLAST-P: compares protein sequences with protein sequences BLAST-X: Compares nucleotide sequences against the protein sequences tBLAST-N: compares the protein sequences against the translations of nucleotide sequences

✓ a location - a genomic region (for example, rat X:100000..200000); •Most search results will take you to the appropriate Ensembl view through a results page. •If you search using a location you will be directed straight to the location tab (this tab provides a view of a region of a genome). BLAT/BLAST BLAT (The BLAST-Like Alignment Tool) is fast, but it demands more exact matches. BLAST will allow lower-scoring hits, and allows more gaps in alignments. If you have a sequence, but you are not sure what the gene name or ID in Ensembl is, you can align it to the genome with BLAST or BLAT. UCSC Genome Browser The UCSC Genome Browser is an on-line, and downloadable, genome browser hosted by the University of California, Santa Cruz (UCSC). The University of California Santa Cruz (UCSC) Genome Bioinformatics website consists of a suite of free, open-source, on-line tools that can be used to browse, analyse, and query genomic data. These tools are available to anyone who has an Internet browser and an interest in genomics. The website provides a quick and easy-to-use visual display of genomic data. It places annotation tracks beneath genome coordinate positions, allowing rapid visual correlation of different types of information. Many of the annotation tracks are submitted by scientists worldwide; the others are computed by the UCSC Genome Bioinformatics group from publicly available sequence data. A few weeks later, on July 7, 2000, the newly assembled genome was released on the web at http://genome.ucsc.edu, along with the initial prototype of a graphical viewing tool, the UCSC Genome Browser. BASIC PROTOCOL USING THE UCSC GENOME BROWSER The GB is the right place to start if you are new to the UCSC Genome Bioinformatics website. Navigate to the GB by clicking on the Genomes link in the top blue navigation bar. The resulting GB gateway page lists several dozen organisms. Many organisms have more than one assembly available; as Human genome assemblies are named hg (short for h uman g enome) followed by a number. The most recent human assembly is named hg18, and the associated date is March 2006. Major tools of UCSC Printing an image The GB provides a way to produce a copy of the tracks image suitable for publication or printing. Retrieving DNA sequence

To capture the underlying DNA sequence for the chromosomal position showing in the annotation display, click on the DNA link from the navigation bar. This page contains configuration options for the DNA output format. BLAT BLAT (BLAST-Like Alignment Tool) is a sequence alignment tool. It has the ability to align both DNA and protein sequence to the underlying genome. BLAT on DNA works by keeping an index of the entire genome in memory—it is very fast. BLAT on DNA sequence is designed to quickly find sequences of 95% or greater similarity, of a length of 40 bases or more. Navigate to the BLAT tool by clicking on the BLAT link in the top blue navigation bar. Configure the BLAT page by choosing the genome and assembly to which you would like to align your DNA or protein sequence. Configure your sequence in FASTA format for submission to the BLAT tool. FASTA is a very simple plain text format for displaying nucleotide or protein sequence. For each record, there is one header line that begins with ">" and contains a description or name of the record, followed by one or more lines whose letters represent the DNA or protein sequence. VISTA: VISTA is a comprehensive suite of programs and databases for comparative analysis of genomic sequences. The VISTA family of tools is developed and hosted at Genomics Division of Lawrence Berkeley National Laboratory. This collaborative effort is supported by the Programs for Genomic Applications grant from the NHLBI/NIH and the Office of Biological and Environmental Research, Office of Science, US Department of Energy. Comparison of DNA sequences from different species is a fundamental method for identifying functional elements in genomes. Here, we describe the VISTA family of tools created to assist biologists in carrying out this task. Our first VISTA server at http://www-gsd.lbl.gov/vista/ was launched in the summer of 2000 and was designed to align long genomic sequences and visualize these alignments with associated functional annotations. Currently the VISTA site includes multiple comparative genomics tools and provides users with rich capabilities to browse pre-computed whole-genome alignments of large vertebrate genomes and other groups of organisms with VISTA Browser, to submit their own sequences of interest to several VISTA servers for various types of comparative analysis and to obtain detailed comparative analysis results for a set of cardiovascular genes. There are two ways of using VISTA - you can submit your own sequences and alignments for analysis (VISTA servers) or examine pre-computed whole-genome alignments of different species.

MODEL ORGANISIM’S GENOMES AND DATABASES:

Most of our knowledge about the basic properties of metabolism, growth, and division in living cells is a result of studies on species described as “model organisms”. These species include the bacterium Escherichia coli, bakers’ yeast (Saccharomyces cerevisiae), the fruit fly (Drosophila melanogaster), the nematode worm (Caenorhabditis elegans), and the mouse (Mus musculus). Model organism databases (MODs) host the genomic and functional information produced by organism-specific research projects and provide query and visualization tools to access these data. At every stage of the scientific process, MODs contribute to basic and applied research. By consulting MODs, researchers can easily find background information on large sets of genes, such as those involved in a biological process or implicated in a disease. MOD users can thus plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. The genome of the bacterium Escherichia coli Most prokaryotic cells contain their genetic material in the form of a large circular piece of double-stranded DNA, usually less than 5 Mb long. In addition, they may contain plasmids. The protein-coding regions of bacterial genomes do not contain introns. In many prokaryotic genomes the protein-coding regions are partially organized into operons – tandem genes transcribed into a single messenger RNA molecule under common transcriptional control. The typical prokaryotic genome contains only a relatively small amount of non-coding DNA (in comparison with eukaryotes), distributed throughout the sequence. In E. coli only ~11% of the DNA is non-coding. E. coli, strain K-12, has long been the workhorse of molecular biology, the genome of strain MG1655, published in 1997 by the group of F. Blattner at the University of Wisconsin, contains 4639 221 bp in a single circular DNA molecule, with no plastids. Approximately 89% of the sequence codes for proteins or structural RNAs. An inventory reveals:

  • 4285 protein-coding genes
  • 122 structural RNA genes
  • non-coding repeat sequences Genome database of E. coli: The EcoGene database provides a set of gene and protein sequences derived from the genome sequence of Escherichia coli K-12. EcoGene is a source of re-annotated sequences for the SWISS-PROT and Colibri databases. EcoGene is used for genetic and physical map compilations in collaboration with the Coli Genetic Stock Centre. The EcoGene12 release includes 4293 genes.

In collaboration with A. Bairoch, all EcoGene protein sequence revisions become part of the SWISS-PROT database, with cross-references to EcoGene EG accession numbers. The EcoGene model has also been applied to Salmonella typhimurium to create StyGene in collaboration with K. Sanderson of the Salmonella Genetic Stock Centre. The genome of S acchar omyces cerevisia e (baker's yeast) Yeast is one of the simplest known eukaryotic organisms. Its cells, like a own, contain a nucleus and other specialized intracellular compartments. sequencing of its genome, by an unusually effective international consortium involving ~100 laboratories, was completed in 1992. The yeast genome con 12057 500 bp of nuclear DNA, distributed over 16 chromosomes. The yeast genome contains 5885 predicted protein-coding genes, ~140 genes for ribosomal RNAs, 40 genes for small nuclear RNAs, and 275 transfer RNA genes. Of the 5885 - potential protein-coding genes, 3408 correspond to known proteins. About 1000 more contain some similarity to known proteins in other species. In two aspects the yeast genome is denser in coding region than the known genomes of the more complex eukaryotes Drosophila melanogaster , and human.

  1. Introns are relatively rare, and relatively small. Only 231 genes in yeast contain introns.
  2. There are fewer repeat sequences compared with more complex eukaryotes. Genome Database of Saccharomyces cerevisiae The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is a comprehensive web-accessible resource for genetics and molecular/cell biology of the yeast Saccharomyces cerevisiae. SGD was started by David Botstein and colleagues at Stanford University in 1993 with a mission to collect and organize biological information about yeast genes and proteins, and make it available to the research community in a manner that facilitates retrieval and understanding in a consistent, user-friendly form. The Saccharomyces Genome Database (SGD) integrates functional information about budding yeast genes and their products with a set of analysis tools that facilitate exploring their biological details. This unit describes how the various types of functional data available at SGD can be searched, retrieved, and analysed. The genome of Drosophila melano g ast er Dro sophila melanogaste r, the fruit fly, has been the subject of detailed studies of genetics and development for almost a century. Its genome sequence, the product of a collaboration between Celera Genomics and the Berkeley Drosophila Genome Project, was announced in 1999. The chromosomes of D_. melanogaster_ are nucleoprotein complexes. Approximately one- third of the genome is contained in heterochromatin, highly coiled and compact.

genome; repeat sequences over 50%. The finding of only about 30000-40000 genes suggests that alternative splicing patterns make a very significant contribution to our protein. The human genome is distributed over 22 chromosome pairs plus the X and Y chromosomes. The DNA contents of the autosomes range from 279 Mbp down and the Y chromosome only to 48 Mbp. The X chromosome contains 163 Mbp and the Y chromosome 51 Mbp. The exons of human protein-coding genes are relatively small those in other known eukaryotic genomes. The introns are relatively long. For instance, the dystrophin gene, coding for a 3685 aa protein, is > 2.4MB long. RNA genes RNA genes in the human genome include:

  • 497 transfer RNA genes. One large cluster contains 140 tRNA genes within a 4 Mb region on chromosome 6.
  • Genes for 28S and 5.8S ribosomal RNAs appear in a 44-kb tandem repeat unit of 150 - 200 copies.
  • 5S RNA genes also appear in tandem arrays containing 200–300 genes, the largest of which is on chromosome 1.