Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Motifs and Motif Finding, Lecture notes of Technical English

• A motif is a short sequence element that is repeated, perhaps with variation, multiple times in a collection of sequences. • Typical motif ...

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

karthur
karthur 🇺🇸

4.8

(8)

230 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Introduction to Motifs and Motif Finding
Jeremy Buhler
July 29, 2014
1 What do we Mean by a Sequence Motif?
We can talk about motifs from a biological or a computational standpoint.
Biologically speaking, a motif in DNA or RNA (or protein) sequence
is a short functional sequence element.
Examples found in genomic DNA include
transcription factor binding sites
small noncoding RNAs
small repetitive elements (e.g. inverted repeats)
common mRNA elements, such as Shine-Dalgarno or Kozak se-
quences, splice enhancers and suppressors.
We will be strongly tempted to talk about conserved DNA sequence
motifs. But we must be careful to recognize that many motifs, es-
pecially short transcription factor binding sites, are known to arise
convergently. Hence, the various instances of a motif, while similar
in appearance, may not actually be homologs. Don’t say “conserved”
unless you mean it!
To understand motif-finding algorithms, it will help to consider a more ab-
stract conception of what a “motif” is.
A motif is a short sequence element that is repeated, perhaps with
variation, multiple times in a collection of sequences.
Typical motif lengths are five to a few tens of bases.
1
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Introduction to Motifs and Motif Finding and more Lecture notes Technical English in PDF only on Docsity!

Introduction to Motifs and Motif Finding

Jeremy Buhler

July 29, 2014

1 What do we Mean by a Sequence Motif?

We can talk about motifs from a biological or a computational standpoint.

  • Biologically speaking, a motif in DNA or RNA (or protein) sequence is a short functional sequence element.
  • Examples found in genomic DNA include
    • transcription factor binding sites
    • small noncoding RNAs
    • small repetitive elements (e.g. inverted repeats)
    • common mRNA elements, such as Shine-Dalgarno or Kozak se- quences, splice enhancers and suppressors.
  • We will be strongly tempted to talk about conserved DNA sequence motifs. But we must be careful to recognize that many motifs, es- pecially short transcription factor binding sites, are known to arise convergently. Hence, the various instances of a motif, while similar in appearance, may not actually be homologs. Don’t say “conserved” unless you mean it!

To understand motif-finding algorithms, it will help to consider a more ab- stract conception of what a “motif” is.

  • A motif is a short sequence element that is repeated, perhaps with variation, multiple times in a collection of sequences.
  • Typical motif lengths are five to a few tens of bases.
  • What does “repeated with variation” mean? We need to define this idea more precisely in order to build motif-finding algorithms.

A motif is described by a computational model that specifies how it may vary across its instances. We will consider several important questions for working with motif models:

  1. Given a putative motif instance, how well does it fit the model?
  2. Given a collection of putative instances, can we derive a model that fits them well?
  3. When is a putative motif interesting enough to warrant further study?

Let’s focus for a minute on what “interesting” means.

  • Given a putative motif, we would like to reject the null hypothesis that it is just a bunch of unrelated sequences.
  • More precisely, we typically assume a background model for sequences that are not part of the motif, and we want to reject the hypothesis that the putative motif’s sequences are unrelated samples from this background.
  • Intuitively, a putative motif whose instances display less variation should be more interesting, since the instances are more similar than would be expected for unrelated sequences drawn from the background.
  • Often, the background is assumed to consist of i.i.d. (independent and identically distributed) random DNA bases, with each base chosen according to some fixed multinomial distribution.
  • We will sketch quantitative measures of interest for motifs, but how to turn these measures into p-values is beyond the scope of this talk. Different programs use different approaches.

2 A Simple Model of Motif Sequence Variation

To precisely describe motifs, we introduce two formal models: the consensus and the weight matrix model. We consider the consensus model first.

  • Let C be a “Platonic ideal” sequence of the motif. For example, for a transcription factor binding site, C might be the sequence that most strongly binds its target protein domain.

3 Another Model of Motif Sequence Variation

The consensus model, while appealing to people who like combinatorics, is limited in its ability to describe variation in a motif.

  • Not every position of a motif may be equally variable. For example, eukaryotic splice donor sites have an almost invariant “gt” at the intron boundary, but the surrounding sequence, while not invariant, exhibits more similarity across splice sites than would be expected by chance.
  • Not all differences from the consensus may be equally likely. IUPAC ambiguity symbols cannot capture an arbitrary set of base frequencies in a particular position.

To capture these position- and sequence-dependent effects, we introduce the weight matrix model (WMM). For simplicity, we will assume that motifs modeled by a WMM have fixed length – their instances cannot exhibit in- sertions and deletions relative to the model.

  • A WMM for a motif of length is a 4 × matrix W of probabilities. The four rows of W are labeled with the four DNA bases, while the columns are labeled 1... `.
  • W (c, i) is the probability that position i of a motif instance is the base c.
  • Instances of the motif are assumed to be drawn uniformly at random from W. More precisely, the ith base of an instance is drawn indepen- dently from the multinomial distribution W (∗, i).
  • Here’s an example W of a WMM with ` = 6:

1 2 3 4 5 6 a 0. 4 0. 1 0. 25 0. 3 0. 1 0. 1 c 0. 1 0. 7 0. 25 0. 2 0. 1 0. 4 g 0. 4 0. 1 0. 25 0. 2 0. 1 0. 1 t 0. 1 0. 1 0. 25 0. 3 0. 7 0. 4

Once again, let’s consider our fundamental questions about motif models.

  • For any sequence s of length `, we can compute Pr(s | W ), the prob- ability that s arises as an instance of W , by multiplying together the probability of seeing each base in s.
  • For example,

Pr(acagtc | W ) = 0. 4 × 0. 7 × 0. 25 × 0. 2 × 0. 7 × 0. 4 ≈ 0. 004.

  • Given a collection of putative instances of common length n, we can infer a motif model W by setting the probabilities directly from the counts of each base at each position.
  • That is,

W (c, i) =

of instances with base c at position i

total # of instances

  • The resulting W is guaranteed to give the highest total probability for the observed instances among all possible weight matrices; that is, it is a maximum-likelihood estimate of W given the instances.
  • (In practice, we don’t want to let any entry of W be zero, since we infer these probabilities from only a finite number of examples.)

How do we measure how interesting a putative motif is in this model?

  • Let W be a motif inferred from n putative instances s 1... sn.
  • For any sj , we can compare Pr(sj | W ) to the chance Pr(sj | B) that sj arose from some background base distribution B.
  • The quantity log Pr(Pr(ssjj^ | |WB^ )) is the log-likelihood ratio (LLR) for compar- ing hypotheses W and B given sj. If this quantity is greater than 0, sj is more likely to have come from the motif W than from B. If it is less than 0, the opposite is true.
  • Our measure of interest for W is the total LLR score ∑^ n

j=

log

Pr(sj | W ) Pr(sj | B)

  • Note that the total LLR of a collection of putative instances is the sum of the LLR for each; equivalently, it is the log of the product of the probability ratios for the instances.
  • We hypothesize that there exists a common motif with instances in some or all of these promoter sequences, though we don’t know where the instances are.
  • Our goal is to find the most plausible motif – a set of putative instances matching a motif that is as unlike the background (high total LLR, or small radius for length) as possible.
  • Such a motif (hopefully) explains its putative instances significantly better than does the null hypothesis.

Finding the best possible motif in a set of sequences is a computationally hard problem!

  • In the consensus model, we typically fix a motif length `, a minimum instance count k, and a radius r, then seek sets of at least k instances that are all within radius r of some common consensus.
  • Algorithms that do this search are often enumerative – they enumerate all possible consensus sequences C and check whether there is a good motif (as defined above) in the data matching each C.
  • Many sneaky tricks can be used to speed up this enumerative search, but its cost is fundamentally exponential in the motif length ` and/or the input sequence length.
  • For the WMM, we fix a motif length ` and seek a set of putative instances that maximize total LLR of their best WMM vs. the back- ground (which is assumed to consist of all sequence not part of a motif instance).
  • We may make any of several assumptions as to how motif instances are distributed in each input sequence – one instance per sequence, at most one instance per sequence, multiple instances per sequence, etc.
  • While enumerative search can be used here as well, a more common approach is local search: guess an initial WMM W 0 matching a set S 0 of instances present in the input, then progressively make small changes to S 0 and W 0 so as to improve the motif’s total LLR.
  • There are two well-known, principled strategies for local search: ex- pectation maximization and Gibbs sampling. They are guaranteed to find the best motif in the “neighborhood” of the original guess.
  • These approaches can yield good (though not necessarily globally op- timal) answers much faster than enumeration. They are the basis for well-known programs like MEME and GibbsDNA.

5 Using Homologous Sequences in Motif Finding

Conservation can sometimes provide an additional signal to make motif find- ing easier.

  • Because a motif is a functional sequence element, it is subject to evo- lutionary pressures, in particular conservation against change.
  • As with any functional feature, we expect that a motif instance will be more strongly conserved against mutation than the surrounding sequence (assuming the latter is not itself functional!).
  • Hence, if we have several homologous sequences, each of which contains a motif instance at the same location, aligning them should cause the motif to stand out as better conserved than its surroundings.
  • Using this approach to locate motifs is called phylogenetic footprinting
    • the motif leaves a “footprint” of higher-than-normal conservation in the alignment.
  • An example of this approach is the EvoPrinter tool.

The footprinting approach is subject to several challenges.

  • First, the input sequences must be globally alignable with high con- fidence; otherwise, errors due to incorrect placement of indels could result in instances of the motif not being aligned at all!
  • Second, the motif should appear at the same place in each of the sequences being aligned. Recall that motifs can arise convergently; they can also “disappear” from one place in a promoter or enhancer region and arise independently elsewhere in the same region.
  • Hence, it’s important to align homologous sequences that diverged on a time scale shorter than that of motif instance migration.
  • Third, it must be possible to distinguish the enhanced conservation of the motif from the background. But the null model for homologous background sequences is different from that for unrelated sequences.