快速了解 Formal Language (形式語言)

這篇文章只會簡單的介紹 formal language,目的是用最快的速度擁有最粗淺的了解,如果對於這個領域還有興趣,可以再分項一一往下研究。 Turing Machine 在 1936 年由 Alan Turing 提出後,便奠定了現今電腦的基礎,整個 Formal Language 大約是在 1930-1940 年代發展成熟,而人類第一台通用計算機 ENIAC 則是 1946 的事情了。

Overview

Why we study formal language theory?

theory provides concepts and principles that help us understand the general nature of the discipline.

學習理論讓我們知道計算機科學的最核心的概念和原則,進而了解電腦的極限,而不用透過無意義的 trial-and-error 以及猜測來理解電腦,以更高的角度來看待計算機科學。

the ideas we will discuss have some immediate and important applications.e.g. digital design, programming languages, and compilers.

就實作面來說,formal language 是 compiler 的理論基礎,因為 compiler 就是在處理語言和字串,本質上和 formal language 所探討的問題是一致的;programming languages 核心的概念也是把人類理解的語言,轉成機器可以執行的語言,不論高階語言還是低階語言,本質上都是處理語言和字串的轉換。

the subject matter is intellectually stimulating and fun.

最後一個理由是,透過理解這些理論基礎和證明,會再次感受到前人研究的偉大,而我們也只是站在這些巨人的肩膀上,希望獲得新知識的過程能帶來無比的成就感以及樂趣。

Using Formal Language :

  • definition of programming languages.
  • interpreters.
  • compilers.
關於 formal language 的翻譯 如果把 formal language 翻譯成「正規語言」,那 regular language 可能會不知道該如何翻譯,因此比較好的翻譯應該是「形式語言」,而「正規語言」是包含於「形式語言」。

Definition : Languages, Grammars, Automata

在開始討論 formal language 之前,我們必須對於要討論的對象做根本上的定義。

Languages 是一個數學上的字串集合; 這些字串集合的 derivation rules 就稱為 Grammars; 而設計一個自動機來決定給定的字串是否屬於該集合則稱為 Accepter (Automata 的一種)。

Languages 的例子:

Automata 基本上分為兩種:

  • Accepter : output response is limited to a simple “yes” or “no”.
  • Transducer : producing strings of symbols as output.

Grammars 是用數學嚴謹的定義來描述 language:

  • variables (V) : finite set.
  • terminal symbols (T) : finite set.
  • start variable (S) : special symbol.
  • productions (P) : which used to derives the strings.

這個 Grammar 其實就是 Language $L_1$ 那個例子的 Grammar,所以

Regular Languages and DFA

我們將會從最簡單的 automata 開始討論,由於接下來的 automata 都是 accepters,所以可以暫時理解 $automata=accepters$。

Language 是一種字串的集合,關於這個集合我們會用各種角度和方式來描述它,讓它變得更加具體,其中一個就是設計一個 automata 讓所有該集合的字串都能通過該 automata(accepter);Grammars 則是透過數學的方式來描述該 language。

Finite Automata (Finite State Machine, FSM)

  • Deterministic Finite Automata (DFA) is defined by internal states, input alphabet, transition function, initial state, final states.
    • each move consumes one input symbol.
    • the string is accepted if the automaton is in one of its final states.
  • Nondeterministic Finite Automata (NFA)
    • transition function range is a set of possible states.
    • can make a transition without consuming an input symbol.
    • there is no transition defined for the specific situation.

The classes of DFA and NFA are equally powerfully.

Regular Language

A language $L$ is called regular if and only if there exists some DFA $M$ such that, $L=L(M)$.

Regular Grammar

  • A language $L$ is regular if and only if there exists a regular grammar $G$ such that $L=L(G)$.
  • A regular grammar is one that is either right-linear or left-linear, which there is exactly one variable occuring as the rightmost/leftmost symbol.
    Right-Linear Grammars :$A \to xB$
    Left-Linear Grammars : $A \to Bx$

Regular Expression

  • One way of describing regular language is via the notation of regular expression.
  • Two regular expressions are equivalent if they denote the same language.

There are three ways to describe regular languages:DFA, regular expression, regular grammars.

想知道更多?
  • Closure Properties of Regular Languages
  • membership algorithm
  • Identifying Nonregular Languages
    • Pigeonhole Principle (鴿籠原理)
    • Pumping Lemma

Context-Free Languages and Pushdown Automata

雖然 regular language 廣泛應用在 computer science 領域中,但是還是有很多是 regular language 無法處理的,例如我們舉的第一個例子:

假如 $a=“(”, b=”)”$,那麼這個 language 所產生的 automata 將會是判斷 programming language 的括號是否合法的方法之一。可以想像的是,利用 finite state 將無法紀錄已經 parsing 幾個 a 了,因此我們需要更強大的 language 來處理 regular language 無法處理的 language。 最核心的概念就是在 finite state 以外新增 storage (stack)。你將會發現「新增/修改 storage」是區別每個 language 所對應的 automata 重要的 property 之一。

Context-Free Grammars

  • all productions in P have the form $A \to x$.
  • retaining the restriction on the left side, but permitting anything on the right.

Context-Free Languages

  • A language $L$ is said to be context-free if and only if there is a context-free grammar $G$ such that $L=L(G)$.
想知道更多?
  • Derivation Tree (Parse tree)
  • Sentential form of the derivation
  • Parsing and Ambiguity
    • exhaustive search parsing (brute force parsing)
    • Simple Grammar (s-grammar)
  • Simplification of Context-Free Grammars and Normal Forms
    • Chomsky Normal Form : A->BC or A->a
    • Greibach Normal Form : A->ax

Pushdown Automata

  • Nondeterministic Pushdown Automata (NPDA)
    • For any context-free language $L$, there exists an NPDA $M$ such that $L=L(M)$.
    • If $L=L(M)$ for some NPDA $M$, then $L$ is a context-free language.
  • Deterministic Pushdown Automata (DPDA)
    • A language $L$ is said to be a deterministic context-free language if and only if there exists a DPDA $M$ such that $L=L(M)$.

deterministic and nondeterministic pushdown automata are not equivalent.

想知道更多?
  • Properties of Context-Free Languages
    • A pumping lemma for context-free languages
    • Closure Properties for Context-Free Languages

Recursively enumerable Languages and Turing Machine

我們發現還是有 Language 不屬於 regular and context-free language

因此透過修改 storage 可以得到更 powerful 的 language : Turing Machine (Using Tapes).

Turing Machine

  • Definition : associated with the tape is a read-write head that can travel right or left.
  • The Turing machine enters a final state and halts, then the string is considered to be accepted.
  • main features :
    • tape is unbounded in both directions.
    • deterministic.
    • no special input file / output device.
  • Church Turing Thesis : 任何在算法上可計算的問題同樣可由圖靈機計算。

Recursively enumerable Languages

  • A language $L$ is said to be recursively enumerable if there exists a Turing machine that accepts it.
  • There is an enumeration procedure for every recursively enumerable language.
  • enumeration procedure : 存在一個 algorithm 可以列出所有 strings in the language.
  • Recursively enumerable Languages is countable.

Recursive Languages

  • A language is recursive in and only if there exists a membership algorithm for it.
  • membership algorithm : 存在一個 algorithm 可以決定該 string 是否屬於 language。
  • If a language is recursive, then there exists an enumeration procedure.
  • the family of recursive language is a proper subset of the family of recursively enumerable languages.
想知道更多?
  • 這邊常用的證明方法 : diagonalization

Unrestricted Grammars

  • A grammar is called unrestricted if all the productions are of the form $u \to v$.
  • Any number of variables and terminals can be on the left or righting any order.
  • Any language generated by an unrestricted grammar is recursively enumerable.
  • For every recursively enumeration language $L$, there exists an unrestricted grammar $G$, such that $L=L(G)$.

Context-Sensitive Languages and Linear Bounded Automata

Context-free languages,顧名思義就在 derivation 的時候並不考慮上下文,因此根據 definition,there is no terminal on the left of any production.接下來將會介紹根據上下文的 Context-Sensitive Languages。

Context-Sensitive Grammars

  • A grammar is called Context-Sensitive if all the productions are of the form $x \to y$, where .
  • the length of successive sentential forms can never decrease.

Linear Bounded Automata

  • Turing machine with limiting the tape.
  • For every context-sensitive language $L$ (not including empty string), there exists some linear bounded automata $M$ such that $L=L(M)$.

Context-Sensitive Languages

  • A language $L$ is said to be context-sensitive if and only if there is a context-sensitive grammar $G$ such that $L=L(G)$.

The Chomsky Hierarchy

  • Type 0 : recursively enumerable languages (unrestricted grammars).
  • Type 1 : context-sensitive language.
  • Type 2 : context-free language.
  • Type 3 : regular language.
想知道更多?
  • Recursive Languages
  • Linear Languages
  • Determinitstic Context-Free Languages

Limits of Algorithmic Computation

Computable Functions : It states that a function on the natural numbers can be calculated by an effective method, if and only if it is computable by a Turing machine.

接下來,我們將會討論那些 Turing machine 無法處理的問題,在討論之前,先介紹幾個概念。

Computability

A function $f$ is said to be computable if there exists a Turing Machine that computes the value of $f$.

Decidability

The result of a computation is simple “yes” or “no”, in this case, a problem being decidable or undecidable.

The Turing Machine Halting Problem

「給定一個 input string 和 Turing Machine,把 string 丟進去 Turing Machine 執行後是否會停在某個 state,抑或是執行後永不停止。」

這個問題是 undecidable,證明方法主要是利用反證法:If halting problem were decidable, then every recursively enumerable language would be recursive. 因為如果知道會不會停止,就可以建構出 membership algorithm,而凡是可以建構出 membership algorithm 的 language 就是 recursive language。

可以 reduced to Halting Problem 的問題:

  • The state-entry Problem : the state is ever entered when applied to input string.
  • The blank-tape halting Problem : Turing machine halts if started with a blank tape.

證明的方法就是利用該 Problem 建構出 Halting Problem,但已知 Halting Problem is undecidable,因此該 Problem 也是 undecidable (反證法)。

想知道更多?
  • Undecidable Problems for Recursively Enumerable Languages
  • The Post Correspondence Problem
  • Undecidable Problems for Context-Free Languages

Computational Complexity

我們只介紹 P Problem, NP Problem。

  • P Problem : Includes all languages that are accepted by some deterministic Turing machine in polynomial time. 指的是有明確 polynomial time 的解法 (通常指的是暴力解)。
  • NP Problem : Includes all languages that are accepted by some nondeterministic Turing machine in polynomial time. 提供一個解法,可以在 polynomial time 被驗證 (通常暴力解會是 exponential time)。
想知道更多?
  • The SAT problem (NP)
  • The Hamiltonian Path Problem (NP)
  • The Clique Problem (NP)
  • NP-compete Problem
  • NP-hard Problem

Resource

Comments