




























































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Formal Languages and Automata Theory
Typology: Lecture notes
1 / 110
This page cannot be seen from the preview
Don't miss anything!
A language can be seen as a system suitable for expression of certain ideas, facts and concepts. For formalizing the notion of a language one must cover all the varieties of languages such as natural (human) languages and program- ming languages. Let us look at some common features across the languages. One may broadly see that a language is a collection of sentences; a sentence is a sequence of words; and a word is a combination of syllables. If one con- siders a language that has a script, then it can be observed that a word is a sequence of symbols of its underlying alphabet. It is observed that a formal learning of a language has the following three steps.
In this learning, step 3 is the most difficult part. Let us postpone to discuss construction of sentences and concentrate on steps 1 and 2. For the time being instead of completely ignoring about sentences one may look at the common features of a word and a sentence to agree upon both are just se- quence of some symbols of the underlying alphabet. For example, the English sentence "The English articles - a, an and the - are categorized into two types: indefinite and definite." may be treated as a sequence of symbols from the Roman alphabet along with enough punctuation marks such as comma, full-stop, colon and further one more special symbol, namely blank-space which is used to separate two words. Thus, abstractly, a sentence or a word may be interchangeably used
Clearly, the binary operation concatenation on Σ∗^ is associative, i.e., for all x, y, z ∈ Σ∗, x(yz) = (xy)z.
Thus, x(yz) may simply be written as xyz. Also, since ε is the empty string, it satisfies the property εx = xε = x
for any sting x ∈ Σ∗. Hence, Σ∗^ is a monoid with respect to concatenation. The operation concatenation is not commutative on Σ∗. For a string x and an integer n ≥ 0, we write
xn+1^ = xnx with the base condition x^0 = ε.
That is, xn^ is obtained by concatenating n copies of x. Also, whenever n = 0, the string x 1 · · · xn represents the empty string ε. Let x be a string over an alphabet Σ. For a ∈ Σ, the number of occur- rences of a in x shall be denoted by |x|a. The length of a string x denoted by |x| is defined as
|x| =
a∈Σ
|x|a.
Essentially, the length of a string is obtained by counting the number of symbols in the string. For example, |aab| = 3, |a| = 1. Note that |ε| = 0. If we denote An to be the set of all strings of length n over Σ, then one can easily ascertain that
Σ∗^ =
n≥ 0
An.
And hence, being An a finite set, Σ∗^ is a countably infinite set. We say that x is a substring of y if x occurs in y, that is y = uxv for some strings u and v. The substring x is said to be a prefix of y if u = ε. Similarly, x is a suffix of y if v = ε. Generalizing the notation used for number of occurrences of symbol a in a string x, we adopt the notation |y|x as the number of occurrences of a string x in y.
2.2 Languages
We have got acquainted with the formal notion of strings that are basic elements of a language. In order to define the notion of a language in a broad spectrum, it is felt that it can be any collection of strings over an alphabet. Thus we define a language over an alphabet Σ as a subset of Σ∗.
Example 2.2.1.
Remark 2.2.2. Note that ∅ 6 = {ε}, because the language ∅ does not contain any string but {ε} contains a string, namely ε. Also it is evident that |∅| = 0; whereas, |{ε}| = 1.
Since languages are sets, we can apply various well known set operations such as union, intersection, complement, difference on languages. The notion of concatenation of strings can be extended to languages as follows. The concatenation of a pair of languages L 1 , L 2 is
L 1 L 2 = {xy | x ∈ L 1 ∧ y ∈ L 2 }.
Example 2.2.3.
Remark 2.2.4.
(L 1 L 2 )L 3 = L 1 (L 2 L 3 ).
Hence, (L 1 L 2 )L 3 may simply be written as L 1 L 2 L 3.
|L 1 L 2 | ≤ |L 1 ||L 2 |.
The positive closure of a language L is denoted by L+^ is defined as
L+^ =
n≥ 1
Ln.
Thus, L∗^ = L+^ ∪ {ε}. We often can easily describe various formal languages in English by stat- ing the property that is to be satisfied by the strings in the respective lan- guages. It is not only for elegant representation but also to understand the properties of languages better, describing the languages in set builder form is desired. Consider the set of all strings over { 0 , 1 } that start with 0. Note that each such string can be seen as 0x for some x ∈ { 0 , 1 }∗. Thus the language can be represented by { 0 x | x ∈ { 0 , 1 }∗}.
Examples
{x ∈ {a, b, c}∗^ | |x|ac ≥ 1 },
stating that the set of all strings over {a, b, c} in which the number of occurrences of substring ac is at least 1.
{x ∈ Σ∗^ | |x|a = 2n, for some n ∈ N}.
Equivalently, {x ∈ Σ∗^ | |x|a ≡ 0 mod 2}.
{x ∈ Σ∗^ | |x|a = |x|b}.
{x ∈ Σ∗^ | x = xR},
where xR^ is the string obtained by reversing x.
{xay | x, y ∈ Σ∗^ and |y| = 4}.
{ambn^ | m, n ≥ 0 }.
Note that, this is the set of all strings over {a, b} which do not contain ba as a substring.
2.3 Properties
The usual set theoretic properties with respect to union, intersection, comple- ment, difference, etc. hold even in the context of languages. Now we observe certain properties of languages with respect to the newly introduced oper- ations concatenation, Kleene closure, and positive closure. In what follows, L, L 1 , L 2 , L 3 and L 4 are languages.
P1 Recall that concatenation of languages is associative.
P2 Since concatenation of strings is not commutative, we have L 1 L 2 6 = L 2 L 1 , in general.
P3 L{ε} = {ε}L = L.
P4 L∅ = ∅L = ∅.
Proof. Let x ∈ L∅; then x = x 1 x 2 for some x 1 ∈ L and x 2 ∈ ∅. But ∅ being emptyset cannot hold any element. Hence there cannot be any element x ∈ L∅ so that L∅ = ∅. Similarly, ∅L = ∅ as well.
P5 Distributive Properties:
Proof. Suppose x ∈ L∗L. Then x = yz for some y ∈ L∗^ and z ∈ L. But y ∈ L∗^ implies y = y 1 · · · yn with yi ∈ L for all i. Hence,
x = yz = (y 1 · · · yn)z = y 1 (y 2 · · · ynz) ∈ LL∗.
Converse is similar. Hence, L∗L = LL∗. Further, when x ∈ L∗L, as above, we have x = y 1 · · · ynz is clearly in L+. On the other hand, x ∈ L+^ implies x = x 1 · · · xm with m ≥ 1 and xi ∈ L for all i. Now write x′^ = x 1 · · · xm− 1 so that x = x′xm. Here, note that x′^ ∈ L∗; particularly, when m = 1 then x′^ = ε. Thus, x ∈ L∗L. Hence, L+^ = L∗L.
Proof. Let x ∈ (L 1 L 2 )∗L 1. Then x = yz, where z ∈ L 1 and y = y 1 · · · yn ∈ (L 1 L 2 )∗^ with yi ∈ L 1 L 2. Now each yi = uivi, for ui ∈ L 1 and vi ∈ L 2. Note that viui+1 ∈ L 2 L 1 , for all i with 1 ≤ i ≤ n − 1. Hence, x = yz = (y 1 · · · yn)z = (u 1 v 1 · · · unvn)z = u 1 (v 1 u 2 · · · vn− 1 unvnz) ∈ L 1 (L 2 L 1 )∗. Converse is similar. Hence, (L 1 L 2 )∗L 1 = L 1 (L 2 L 1 )∗.
P14 (L 1 ∪ L 2 )∗^ = (L∗ 1 L∗ 2 )∗.
Proof. Observe that L 1 ⊆ L∗ 1 and {ε} ⊆ L∗ 2. Hence, by properties P and P6, we have L 1 = L 1 {ε} ⊆ L∗ 1 L∗ 2. Similarly, L 2 ⊆ L∗ 1 L∗ 2. Hence, L 1 ∪ L 2 ⊆ L∗ 1 L∗ 2. Consequently, (L 1 ∪ L 2 )∗^ ⊆ (L∗ 1 L∗ 2 )∗. For converse, observe that L∗ 1 ⊆ (L 1 ∪ L 2 )∗. Similarly, L∗ 2 ⊆ (L 1 ∪ L 2 )∗. Thus, L∗ 1 L∗ 2 ⊆ (L 1 ∪ L 2 )∗(L 1 ∪ L 2 )∗. But, by property P12, we have (L 1 ∪ L 2 )∗(L 1 ∪ L 2 )∗^ = (L 1 ∪ L 2 )∗^ so that L∗ 1 L∗ 2 ⊆ (L 1 ∪ L 2 )∗. Hence,
(L∗ 1 L∗ 2 )∗^ ⊆ ((L 1 ∪ L 2 )∗)∗^ = (L 1 ∪ L 2 )∗.
2.4 Finite Representation
Proficiency in a language does not expect one to know all the sentences of the language; rather with some limited information one should be able to come up with all possible sentences of the language. Even in case of programming languages, a compiler validates a program - a sentence in the programming language - with a finite set of instructions incorporated in it. Thus, we are interested in a finite representation of a language - that is, by giving a finite amount of information, all the strings of a language shall be enumerated/validated. Now, we look at the languages for which finite representation is possible. Given an alphabet Σ, to start with, the languages with single string {x} and ∅ can have finite representation, say x and ∅, respectively. In this way, finite languages can also be given a finite representation; say, by enumerating all the strings. Thus, giving finite representation for infinite languages is a nontrivial interesting problem. In this context, the operations on languages may be helpful. For example, the infinite language {ε, ab, abab, ababab,.. .} can be con- sidered as the Kleene star of the language {ab}, that is {ab}∗. Thus, using Kleene star operation we can have finite representation for some infinite lan- guages. While operations are under consideration, to give finite representation for languages one may first look at the indivisible languages, namely ∅, {ε}, and {a}, for all a ∈ Σ, as basis elements. To construct {x}, for x ∈ Σ∗, we can use the operation concatenation over the basis elements. For example, if x = aba then choose {a} and {b}; and concatenate {a}{b}{a} to get {aba}. Any finite language over Σ, say {x 1 ,... , xn} can be obtained by considering the union {x 1 } ∪ · · · ∪ {xn}. In this section, we look at the aspects of considering operations over basis elements to represent a language. This is one aspect of representing a language. There are many other aspects to give finite representations; some such aspects will be considered in the later chapters.
We now consider the class of languages obtained by applying union, con- catenation, and Kleene star for finitely many times on the basis elements. These languages are known as regular languages and the corresponding finite representations are known as regular expressions.
Definition 2.4.1 (Regular Expression). We define a regular expression over an alphabet Σ recursively as follows.
Example 2.4.8. The language L over { 0 , 1 } that contains 01 or 10 as sub- string is regular.
L = {x | 01 is a substring of x} ∪ {x | 10 is a substring of x} = {y 01 z | y, z ∈ Σ∗} ∪ {u 10 v | u, v ∈ Σ∗} = Σ∗{ 01 }Σ∗^ ∪ Σ∗{ 10 }Σ∗
Since Σ∗, { 01 }, and { 10 } are regular we have L to be regular. In fact, at this point, one can easily notice that
(0 + 1)∗01(0 + 1)∗^ + (0 + 1)∗10(0 + 1)∗
is a regular expression representing L.
Example 2.4.9. The set of all strings over {a, b} which do not contain ab as a substring. By analyzing the language one can observe that precisely the language is as follows. {bnam^ | m, n ≥ 0 }
Thus, a regular expression of the language is b∗a∗^ and hence the language is regular.
Example 2.4.10. The set of strings over {a, b} which contain odd number of a′s is regular. Although the set can be represented in set builder form as
{x ∈ {a, b}∗^ | |x|a = 2n + 1, for some n},
writing a regular expression for the language is little tricky job. Hence, we postpone the argument to Chapter 3 (see Example 3.3.6), where we construct a regular grammar for the language. Regular grammar is a tool to generate regular languages.
Example 2.4.11. The set of strings over {a, b} which contain odd number of a′s and even number of b′s is regular. As above, a set builder form of the set is:
{x ∈ {a, b}∗^ | |x|a = 2n + 1, for some n and |x|b = 2m, for some m}.
Writing a regular expression for the language is even more trickier than the earlier example. This will be handled in Chapter 4 using finite automata, yet another tool to represent regular languages.
Definition 2.4.12. Two regular expressions r 1 and r 2 are said to be equiv- alent if they represent the same language; in which case, we write r 1 ≈ r 2.
Example 2.4.13. The regular expressions (10+1)∗^ and ((10)∗ 1 ∗)∗^ are equiv- alent. Since L((10)∗) = {(10)n^ | n ≥ 0 } and L(1∗) = { 1 m^ | m ≥ 0 }, we have L((10)∗ 1 ∗) = {(10)n 1 m^ | m, n ≥ 0 }. This implies
L(((10)∗ 1 ∗)∗) = {(10)n^1 1 m^1 (10)n^2 1 m^2 · · · (10)nl^1 ml^ | mi, ni ≥ 0 and 0 ≤ i ≤ l}
= {x 1 x 2 · · · xk | xi = 10 or 1 }, where k =
∑^ l
i=
(mi + ni)
⊆ L((10 + 1)∗).
Conversely, suppose x ∈ L((10 + 1)∗). Then,
x = x 1 x 2 · · · xp where xi = 10 or 1 =⇒ x = (10)p^1 1 q^1 (10)p^2 1 q^2 · · · (10)pr^1 qr^ for pi, qj ≥ 0 =⇒ x ∈ L(((10)∗ 1 ∗)∗).
Hence, L((10+1)∗) = L(((10)∗ 1 ∗)∗) and consequently, (10+1)∗^ ≈ ((10)∗ 1 ∗)∗.
From property P14, by choosing L 1 = { 10 } and L 2 = { 1 }, one may notice that ({ 10 } ∪ { 1 })∗^ = ({ 10 }∗{ 1 }∗)∗.
Since 10 and 1 represent the regular languages { 10 } and { 1 }, respectively, from the above equation we get
(10 + 1)∗^ ≈ ((10)∗ 1 ∗)∗.
Since those properties hold good for all languages, by specializing those prop- erties to regular languages and in turn replacing by the corresponding regular expressions we get the following identities for regular expressions. Let r, r 1 , r 2 , and r 3 be any regular expressions
In this chapter, we introduce the notion of grammar called context-free gram- mar (CFG) as a language generator. The notion of derivation is instrumental in understanding how the strings are generated in a grammar. We explain the various properties of derivations using a graphical representation called derivation trees. A special case of CFG, viz. regular grammar, is discussed as tool to generate to regular languages. A more general notion of grammars is presented in Chapter 7. In the context of natural languages, the grammar of a language is a set of rules which are used to construct/validate sentences of the language. It has been pointed out, in the introduction of Chapter 2, that this is the third step in a formal learning of a language. Now we draw the attention of a reader to look into the general features of the grammars (of natural languages) to formalize the notion in the present context which facilitate for better under- standing of formal languages. Consider the English sentence
The students study automata theory.
In order to observe that the sentence is grammatically correct, one may attribute certain rules of the English grammar to the sentence and validate it. For instance, the Article the followed by the Noun students form a Noun-phrase and similarly the Noun automata theory form a Noun-phrase. Further, study is a Verb. Now, choose the Sentential form “Subject Verb Object” of the English grammar. As Subject or Object can be a Noun-phrase by plugging in the above words one may conclude that the given sentence is a grammatically correct English sentence. This verification/derivation is depicted in Figure 3.1. The derivation can also be represented by a tree structure as shown in Figure 3.2.
Sentence ⇒ Subject Verb Object ⇒ Noun-phrase Verb Object ⇒ Article Noun Verb Object ⇒ The Noun Verb Object ⇒ The students Verb Object ⇒ The students study Object ⇒ The students study Noun-phrase ⇒ The students study Noun ⇒ The students study automata theory
Figure 3.1: Derivation of an English Sentence
In this process, we observe that two types of words are in the discussion.
The main difference is, if you arrive at a stage where type (1) words are appearing, then you need not say anything more about them. In case you arrive at a stage where you find a word of type (2), then you are assumed to say some more about the word. For example, if the word Article comes, then one should say which article need to be chosen among a, an and the. Let us call the type (1) and type (2) words as terminals and nonterminals, respectively, as per their features. Thus, a grammar should include terminals and nonterminals along with a set of rules which attribute some information regarding nonterminal symbols.
3.1 Context-Free Grammars
We now understand that a grammar should have the following components.