Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Formal Languages and Automata Theory, Lecture notes of Theory of Formal Languages for Automata

Formal Languages and Automata Theory

Typology: Lecture notes

2016/2017

Uploaded on 04/10/2017

blu-blu
blu-blu 🇬🇧

5

(1)

1 document

1 / 110

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Formal Languages and Automata Theory
D. Goswami and K. V. Krishna
November 5, 2010
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Formal Languages and Automata Theory and more Lecture notes Theory of Formal Languages for Automata in PDF only on Docsity!

Formal Languages and Automata Theory

D. Goswami and K. V. Krishna

November 5, 2010

Contents

Chapter 1

Mathematical Preliminaries

Chapter 2

Formal Languages

A language can be seen as a system suitable for expression of certain ideas, facts and concepts. For formalizing the notion of a language one must cover all the varieties of languages such as natural (human) languages and program- ming languages. Let us look at some common features across the languages. One may broadly see that a language is a collection of sentences; a sentence is a sequence of words; and a word is a combination of syllables. If one con- siders a language that has a script, then it can be observed that a word is a sequence of symbols of its underlying alphabet. It is observed that a formal learning of a language has the following three steps.

  1. Learning its alphabet - the symbols that are used in the language.
  2. Its words - as various sequences of symbols of its alphabet.
  3. Formation of sentences - sequence of various words that follow certain rules of the language.

In this learning, step 3 is the most difficult part. Let us postpone to discuss construction of sentences and concentrate on steps 1 and 2. For the time being instead of completely ignoring about sentences one may look at the common features of a word and a sentence to agree upon both are just se- quence of some symbols of the underlying alphabet. For example, the English sentence "The English articles - a, an and the - are categorized into two types: indefinite and definite." may be treated as a sequence of symbols from the Roman alphabet along with enough punctuation marks such as comma, full-stop, colon and further one more special symbol, namely blank-space which is used to separate two words. Thus, abstractly, a sentence or a word may be interchangeably used

Clearly, the binary operation concatenation on Σ∗^ is associative, i.e., for all x, y, z ∈ Σ∗, x(yz) = (xy)z.

Thus, x(yz) may simply be written as xyz. Also, since ε is the empty string, it satisfies the property εx = xε = x

for any sting x ∈ Σ∗. Hence, Σ∗^ is a monoid with respect to concatenation. The operation concatenation is not commutative on Σ∗. For a string x and an integer n ≥ 0, we write

xn+1^ = xnx with the base condition x^0 = ε.

That is, xn^ is obtained by concatenating n copies of x. Also, whenever n = 0, the string x 1 · · · xn represents the empty string ε. Let x be a string over an alphabet Σ. For a ∈ Σ, the number of occur- rences of a in x shall be denoted by |x|a. The length of a string x denoted by |x| is defined as

|x| =

a∈Σ

|x|a.

Essentially, the length of a string is obtained by counting the number of symbols in the string. For example, |aab| = 3, |a| = 1. Note that |ε| = 0. If we denote An to be the set of all strings of length n over Σ, then one can easily ascertain that

Σ∗^ =

n≥ 0

An.

And hence, being An a finite set, Σ∗^ is a countably infinite set. We say that x is a substring of y if x occurs in y, that is y = uxv for some strings u and v. The substring x is said to be a prefix of y if u = ε. Similarly, x is a suffix of y if v = ε. Generalizing the notation used for number of occurrences of symbol a in a string x, we adopt the notation |y|x as the number of occurrences of a string x in y.

2.2 Languages

We have got acquainted with the formal notion of strings that are basic elements of a language. In order to define the notion of a language in a broad spectrum, it is felt that it can be any collection of strings over an alphabet. Thus we define a language over an alphabet Σ as a subset of Σ∗.

Example 2.2.1.

  1. The emptyset ∅ is a language over any alphabet. Similarly, {ε} is also a language over any alphabet.
  2. The set of all strings over { 0 , 1 } that start with 0.
  3. The set of all strings over {a, b, c} having ac as a substring.

Remark 2.2.2. Note that ∅ 6 = {ε}, because the language ∅ does not contain any string but {ε} contains a string, namely ε. Also it is evident that |∅| = 0; whereas, |{ε}| = 1.

Since languages are sets, we can apply various well known set operations such as union, intersection, complement, difference on languages. The notion of concatenation of strings can be extended to languages as follows. The concatenation of a pair of languages L 1 , L 2 is

L 1 L 2 = {xy | x ∈ L 1 ∧ y ∈ L 2 }.

Example 2.2.3.

  1. If L 1 = { 0 , 1 , 01 } and L 2 = { 1 , 00 }, then L 1 L 2 = { 01 , 11 , 011 , 000 , 100 , 0100 }.
  2. For L 1 = {b, ba, bab} and L 2 = {ε, b, bb, abb}, we have L 1 L 2 = {b, ba, bb, bab, bbb, babb, baabb, babbb, bababb}.

Remark 2.2.4.

  1. Since concatenation of strings is associative, so is the concatenation of languages. That is, for all languages L 1 , L 2 and L 3 ,

(L 1 L 2 )L 3 = L 1 (L 2 L 3 ).

Hence, (L 1 L 2 )L 3 may simply be written as L 1 L 2 L 3.

  1. The number of strings in L 1 L 2 is always less than or equal to the product of individual numbers, i.e.

|L 1 L 2 | ≤ |L 1 ||L 2 |.

  1. L 1 ⊆ L 1 L 2 if and only if ε ∈ L 2.

The positive closure of a language L is denoted by L+^ is defined as

L+^ =

n≥ 1

Ln.

Thus, L∗^ = L+^ ∪ {ε}. We often can easily describe various formal languages in English by stat- ing the property that is to be satisfied by the strings in the respective lan- guages. It is not only for elegant representation but also to understand the properties of languages better, describing the languages in set builder form is desired. Consider the set of all strings over { 0 , 1 } that start with 0. Note that each such string can be seen as 0x for some x ∈ { 0 , 1 }∗. Thus the language can be represented by { 0 x | x ∈ { 0 , 1 }∗}.

Examples

  1. The set of all strings over {a, b, c} that have ac as substring can be written as {xacy | x, y ∈ {a, b, c}∗}. This can also be written as

{x ∈ {a, b, c}∗^ | |x|ac ≥ 1 },

stating that the set of all strings over {a, b, c} in which the number of occurrences of substring ac is at least 1.

  1. The set of all strings over some alphabet Σ with even number of a′s is

{x ∈ Σ∗^ | |x|a = 2n, for some n ∈ N}.

Equivalently, {x ∈ Σ∗^ | |x|a ≡ 0 mod 2}.

  1. The set of all strings over some alphabet Σ with equal number of a′s and b′s can be written as

{x ∈ Σ∗^ | |x|a = |x|b}.

  1. The set of all palindromes over an alphabet Σ can be written as

{x ∈ Σ∗^ | x = xR},

where xR^ is the string obtained by reversing x.

  1. The set of all strings over some alphabet Σ that have an a in the 5th position from the right can be written as

{xay | x, y ∈ Σ∗^ and |y| = 4}.

  1. The set of all strings over some alphabet Σ with no consecutive a′s can be written as {x ∈ Σ∗^ | |x|aa = 0}.
  2. The set of all strings over {a, b} in which every occurrence of b is not before an occurrence of a can be written as

{ambn^ | m, n ≥ 0 }.

Note that, this is the set of all strings over {a, b} which do not contain ba as a substring.

2.3 Properties

The usual set theoretic properties with respect to union, intersection, comple- ment, difference, etc. hold even in the context of languages. Now we observe certain properties of languages with respect to the newly introduced oper- ations concatenation, Kleene closure, and positive closure. In what follows, L, L 1 , L 2 , L 3 and L 4 are languages.

P1 Recall that concatenation of languages is associative.

P2 Since concatenation of strings is not commutative, we have L 1 L 2 6 = L 2 L 1 , in general.

P3 L{ε} = {ε}L = L.

P4 L∅ = ∅L = ∅.

Proof. Let x ∈ L∅; then x = x 1 x 2 for some x 1 ∈ L and x 2 ∈ ∅. But ∅ being emptyset cannot hold any element. Hence there cannot be any element x ∈ L∅ so that L∅ = ∅. Similarly, ∅L = ∅ as well.

P5 Distributive Properties:

  1. (L 1 ∪ L 2 )L 3 = L 1 L 3 ∪ L 2 L 3.

Proof. Suppose x ∈ L∗L. Then x = yz for some y ∈ L∗^ and z ∈ L. But y ∈ L∗^ implies y = y 1 · · · yn with yi ∈ L for all i. Hence,

x = yz = (y 1 · · · yn)z = y 1 (y 2 · · · ynz) ∈ LL∗.

Converse is similar. Hence, L∗L = LL∗. Further, when x ∈ L∗L, as above, we have x = y 1 · · · ynz is clearly in L+. On the other hand, x ∈ L+^ implies x = x 1 · · · xm with m ≥ 1 and xi ∈ L for all i. Now write x′^ = x 1 · · · xm− 1 so that x = x′xm. Here, note that x′^ ∈ L∗; particularly, when m = 1 then x′^ = ε. Thus, x ∈ L∗L. Hence, L+^ = L∗L.

P11 (L∗)∗^ = L∗.

P12 L∗L∗^ = L∗.

P13 (L 1 L 2 )∗L 1 = L 1 (L 2 L 1 )∗.

Proof. Let x ∈ (L 1 L 2 )∗L 1. Then x = yz, where z ∈ L 1 and y = y 1 · · · yn ∈ (L 1 L 2 )∗^ with yi ∈ L 1 L 2. Now each yi = uivi, for ui ∈ L 1 and vi ∈ L 2. Note that viui+1 ∈ L 2 L 1 , for all i with 1 ≤ i ≤ n − 1. Hence, x = yz = (y 1 · · · yn)z = (u 1 v 1 · · · unvn)z = u 1 (v 1 u 2 · · · vn− 1 unvnz) ∈ L 1 (L 2 L 1 )∗. Converse is similar. Hence, (L 1 L 2 )∗L 1 = L 1 (L 2 L 1 )∗.

P14 (L 1 ∪ L 2 )∗^ = (L∗ 1 L∗ 2 )∗.

Proof. Observe that L 1 ⊆ L∗ 1 and {ε} ⊆ L∗ 2. Hence, by properties P and P6, we have L 1 = L 1 {ε} ⊆ L∗ 1 L∗ 2. Similarly, L 2 ⊆ L∗ 1 L∗ 2. Hence, L 1 ∪ L 2 ⊆ L∗ 1 L∗ 2. Consequently, (L 1 ∪ L 2 )∗^ ⊆ (L∗ 1 L∗ 2 )∗. For converse, observe that L∗ 1 ⊆ (L 1 ∪ L 2 )∗. Similarly, L∗ 2 ⊆ (L 1 ∪ L 2 )∗. Thus, L∗ 1 L∗ 2 ⊆ (L 1 ∪ L 2 )∗(L 1 ∪ L 2 )∗. But, by property P12, we have (L 1 ∪ L 2 )∗(L 1 ∪ L 2 )∗^ = (L 1 ∪ L 2 )∗^ so that L∗ 1 L∗ 2 ⊆ (L 1 ∪ L 2 )∗. Hence,

(L∗ 1 L∗ 2 )∗^ ⊆ ((L 1 ∪ L 2 )∗)∗^ = (L 1 ∪ L 2 )∗.

2.4 Finite Representation

Proficiency in a language does not expect one to know all the sentences of the language; rather with some limited information one should be able to come up with all possible sentences of the language. Even in case of programming languages, a compiler validates a program - a sentence in the programming language - with a finite set of instructions incorporated in it. Thus, we are interested in a finite representation of a language - that is, by giving a finite amount of information, all the strings of a language shall be enumerated/validated. Now, we look at the languages for which finite representation is possible. Given an alphabet Σ, to start with, the languages with single string {x} and ∅ can have finite representation, say x and ∅, respectively. In this way, finite languages can also be given a finite representation; say, by enumerating all the strings. Thus, giving finite representation for infinite languages is a nontrivial interesting problem. In this context, the operations on languages may be helpful. For example, the infinite language {ε, ab, abab, ababab,.. .} can be con- sidered as the Kleene star of the language {ab}, that is {ab}∗. Thus, using Kleene star operation we can have finite representation for some infinite lan- guages. While operations are under consideration, to give finite representation for languages one may first look at the indivisible languages, namely ∅, {ε}, and {a}, for all a ∈ Σ, as basis elements. To construct {x}, for x ∈ Σ∗, we can use the operation concatenation over the basis elements. For example, if x = aba then choose {a} and {b}; and concatenate {a}{b}{a} to get {aba}. Any finite language over Σ, say {x 1 ,... , xn} can be obtained by considering the union {x 1 } ∪ · · · ∪ {xn}. In this section, we look at the aspects of considering operations over basis elements to represent a language. This is one aspect of representing a language. There are many other aspects to give finite representations; some such aspects will be considered in the later chapters.

2.4.1 Regular Expressions

We now consider the class of languages obtained by applying union, con- catenation, and Kleene star for finitely many times on the basis elements. These languages are known as regular languages and the corresponding finite representations are known as regular expressions.

Definition 2.4.1 (Regular Expression). We define a regular expression over an alphabet Σ recursively as follows.

Example 2.4.8. The language L over { 0 , 1 } that contains 01 or 10 as sub- string is regular.

L = {x | 01 is a substring of x} ∪ {x | 10 is a substring of x} = {y 01 z | y, z ∈ Σ∗} ∪ {u 10 v | u, v ∈ Σ∗} = Σ∗{ 01 }Σ∗^ ∪ Σ∗{ 10 }Σ∗

Since Σ∗, { 01 }, and { 10 } are regular we have L to be regular. In fact, at this point, one can easily notice that

(0 + 1)∗01(0 + 1)∗^ + (0 + 1)∗10(0 + 1)∗

is a regular expression representing L.

Example 2.4.9. The set of all strings over {a, b} which do not contain ab as a substring. By analyzing the language one can observe that precisely the language is as follows. {bnam^ | m, n ≥ 0 }

Thus, a regular expression of the language is b∗a∗^ and hence the language is regular.

Example 2.4.10. The set of strings over {a, b} which contain odd number of a′s is regular. Although the set can be represented in set builder form as

{x ∈ {a, b}∗^ | |x|a = 2n + 1, for some n},

writing a regular expression for the language is little tricky job. Hence, we postpone the argument to Chapter 3 (see Example 3.3.6), where we construct a regular grammar for the language. Regular grammar is a tool to generate regular languages.

Example 2.4.11. The set of strings over {a, b} which contain odd number of a′s and even number of b′s is regular. As above, a set builder form of the set is:

{x ∈ {a, b}∗^ | |x|a = 2n + 1, for some n and |x|b = 2m, for some m}.

Writing a regular expression for the language is even more trickier than the earlier example. This will be handled in Chapter 4 using finite automata, yet another tool to represent regular languages.

Definition 2.4.12. Two regular expressions r 1 and r 2 are said to be equiv- alent if they represent the same language; in which case, we write r 1 ≈ r 2.

Example 2.4.13. The regular expressions (10+1)∗^ and ((10)∗ 1 ∗)∗^ are equiv- alent. Since L((10)∗) = {(10)n^ | n ≥ 0 } and L(1∗) = { 1 m^ | m ≥ 0 }, we have L((10)∗ 1 ∗) = {(10)n 1 m^ | m, n ≥ 0 }. This implies

L(((10)∗ 1 ∗)∗) = {(10)n^1 1 m^1 (10)n^2 1 m^2 · · · (10)nl^1 ml^ | mi, ni ≥ 0 and 0 ≤ i ≤ l}

= {x 1 x 2 · · · xk | xi = 10 or 1 }, where k =

∑^ l

i=

(mi + ni)

⊆ L((10 + 1)∗).

Conversely, suppose x ∈ L((10 + 1)∗). Then,

x = x 1 x 2 · · · xp where xi = 10 or 1 =⇒ x = (10)p^1 1 q^1 (10)p^2 1 q^2 · · · (10)pr^1 qr^ for pi, qj ≥ 0 =⇒ x ∈ L(((10)∗ 1 ∗)∗).

Hence, L((10+1)∗) = L(((10)∗ 1 ∗)∗) and consequently, (10+1)∗^ ≈ ((10)∗ 1 ∗)∗.

From property P14, by choosing L 1 = { 10 } and L 2 = { 1 }, one may notice that ({ 10 } ∪ { 1 })∗^ = ({ 10 }∗{ 1 }∗)∗.

Since 10 and 1 represent the regular languages { 10 } and { 1 }, respectively, from the above equation we get

(10 + 1)∗^ ≈ ((10)∗ 1 ∗)∗.

Since those properties hold good for all languages, by specializing those prop- erties to regular languages and in turn replacing by the corresponding regular expressions we get the following identities for regular expressions. Let r, r 1 , r 2 , and r 3 be any regular expressions

  1. rε ≈ εr ≈ r.
  2. r 1 r 2 6 ≈ r 2 r 1 , in general.
  3. r 1 (r 2 r 3 ) ≈ (r 1 r 2 )r 3.
  4. r∅ ≈ ∅r ≈ ∅.
  5. ∅∗^ ≈ ε.
  6. ε∗^ ≈ ε.

Chapter 3

Grammars

In this chapter, we introduce the notion of grammar called context-free gram- mar (CFG) as a language generator. The notion of derivation is instrumental in understanding how the strings are generated in a grammar. We explain the various properties of derivations using a graphical representation called derivation trees. A special case of CFG, viz. regular grammar, is discussed as tool to generate to regular languages. A more general notion of grammars is presented in Chapter 7. In the context of natural languages, the grammar of a language is a set of rules which are used to construct/validate sentences of the language. It has been pointed out, in the introduction of Chapter 2, that this is the third step in a formal learning of a language. Now we draw the attention of a reader to look into the general features of the grammars (of natural languages) to formalize the notion in the present context which facilitate for better under- standing of formal languages. Consider the English sentence

The students study automata theory.

In order to observe that the sentence is grammatically correct, one may attribute certain rules of the English grammar to the sentence and validate it. For instance, the Article the followed by the Noun students form a Noun-phrase and similarly the Noun automata theory form a Noun-phrase. Further, study is a Verb. Now, choose the Sentential form “Subject Verb Object” of the English grammar. As Subject or Object can be a Noun-phrase by plugging in the above words one may conclude that the given sentence is a grammatically correct English sentence. This verification/derivation is depicted in Figure 3.1. The derivation can also be represented by a tree structure as shown in Figure 3.2.

Sentence ⇒ Subject Verb Object ⇒ Noun-phrase Verb Object ⇒ Article Noun Verb Object ⇒ The Noun Verb Object ⇒ The students Verb Object ⇒ The students study Object ⇒ The students study Noun-phrase ⇒ The students study Noun ⇒ The students study automata theory

Figure 3.1: Derivation of an English Sentence

In this process, we observe that two types of words are in the discussion.

  1. The words like the, study, students.
  2. The words like Article, Noun, Verb.

The main difference is, if you arrive at a stage where type (1) words are appearing, then you need not say anything more about them. In case you arrive at a stage where you find a word of type (2), then you are assumed to say some more about the word. For example, if the word Article comes, then one should say which article need to be chosen among a, an and the. Let us call the type (1) and type (2) words as terminals and nonterminals, respectively, as per their features. Thus, a grammar should include terminals and nonterminals along with a set of rules which attribute some information regarding nonterminal symbols.

3.1 Context-Free Grammars

We now understand that a grammar should have the following components.

  • A set of nonterminal symbols.
  • A set of terminal symbols.
  • A set of rules.
  • As the grammar is to construct/validate sentences of a language, we distinguish a symbol in the set of nonterminals to represent a sen- tence – from which various sentences of the language can be gener- ated/validated.