在Coq中实现算符优先文法分析器

10 minute read

Published:

一、编译原理与算符优先文法(OPG)

编译的工作可以分为前端和后端两大部分。前端主要处理字符串(即程序代码文本),也就是把具体的字符串 变成 抽象的数据结构(抽象语法树)。后端主要进行目标代码生成(以及各种编译检查、优化),也就是把抽象的数据结构(抽象语法树) 变成 能在具体平台(如X86)上运行的汇编语言代码。

图 1. 编译总览

再细分,前端的工作可以分为 词法分析语法分析 。词法分析将字符串分成一个个词法单元(token)。语法分析将一个个词法单元根据 形式文法(语法) 组建成抽象语法树。

我们现在要处理的 算符优先文法 就是一种 形式文法 。(进一步地,属于 上下文无关文法 (CFG))算符优先文法是 任意产生式的右部都不含两个相继的非终结符算符 的文法。

二、Coq证明助手

主流的定理证明主要依赖社会途径,即以非形式化的学术通用语言来呈现证明,证明的正确与否需要他人验证。然而,对于大型的证明,社会途径的验证是不实际,不可靠的。大型计算机程序逻辑正确性的验证需要大篇幅的证明,完全不适合社会途径,需要寻找新的正确性保障方式。交互式定理证明是计算机辅助证明的一种。交互式指每一步证明的条件和目标都会由程序展现,便于人类思考。

Coq证明助手是一种交互式定理证明器,可用来验证程序是否达到了特定的逻辑功能。已有许多著名的定理证明工作基于Coq完成。

三、文法表示

先定义文法中的 符号 的表示形式(symbol)。其中构造子T代表终结符,NT代表非终结符。

Inductive symbol: Type :=
| T: string -> symbol
| NT: string -> symbol
| U: string -> symbol. (* U -> undecide T or NT*)

然后就可以定义 文法 的表示形式。产生式(rule)由一个非终结符(symbol)和一个或多个规则(expr)构成。文法(grammar)是产生式(rule)的列表。

Inductive expr: Type :=
| base: (list symbol) -> expr
| combine: expr -> expr -> expr.

Definition rule: Type := symbol * expr.
Definition grammar: Type:= list rule.

举个例子,现在有如下的文法,

E \rightarrow E + T \;| \; T \\ T \rightarrow T * F \;| \; F\\ F \rightarrow ( E ) \;| \; i

我们可以这样表示它(注意,这里的符号(symbol)全部使用构造子U),

Definition test: grammar := 
    [   ((U "E"), combine (base [(U "E"); (U "+"); (U "T")]) (base [(U "T")]));
        ((U "T"), combine (base [(U "T"); (U "*"); (U "F")]) (base [(U "F")]));
        ((U "F"), combine (base [(U "("); (U "E"); (U ")")]) (base [(U "i")]))
    ].

由于已知只有产生式左部是非终结符,可以编写程序将文法中的终结符与非终结符找到。

Fixpoint find_NT (target: grammar): list string :=
    match target with
    | nil => nil
    | (x, _)::xs => let l := find_NT xs in
                        let s := get_string_from_symbol x in
                            if negb (In_string l s) then s::l else l
    end.

(* x not in find_NT iff. x is T *)
Definition find_T (target: grammar): list string := 
    (reify_EOF
        (delete_multi_occur_str 
            (filter (fun x => negb (In_string (find_NT target) x)) (find_string_in_grammar target))))++["$"].

对于以上例子,可以得到

Compute (find_NT test). (* = ["E"; "T"; "F"] : list string *)
Compute (find_T test). (* = ["+"; "*"; "("; ")"; "i"] : list string *)

基于此,可以实现reify函数,可以将原始文法转为标记了终结符和非终结符的文法,

Definition reify_symbol (nt: list string)(s: symbol): symbol :=
    let str:= get_string_from_symbol s in
        if (In_string nt str) then NT str else T str.

举个例子,对于测试文法,

Compute (reify test).
(* = [(NT "E", combine (base [NT "E"; T "+"; NT "T"]) (base [NT "T"]));
      (NT "T", combine (base [NT "T"; T "*"; NT "F"]) (base [NT "F"]));
      (NT "F", combine (base [T "("; NT "E"; T ")"]) (base [T "i"]))]
    : grammar *)

四、文法检查

算符优先文法要求任意产生式的右部都不含两个相继的非终结符算符。在实现中,就是要判断有没有列表中相继的2个 以NT为构造子的符号

Fixpoint check_expr (nt: list symbol)(l: list symbol): bool :=
    match l with
    | nil => true
    | x::xs => if In_symbol x nt then 
                    if In_symbol (nth_default default xs 0) nt then false else check_expr nt xs
                else check_expr nt xs
    end.

Fixpoint check_expr_l (nt: list symbol)(e: expr): bool :=
    match e with
    | base l => check_expr nt l
    | combine a b => andb (check_expr_l nt a) (check_expr_l nt b)
    end. 

Fixpoint _type_checker (nt: list symbol)(g: grammar): bool :=
    match g with
    | nil => true
    | (_, e)::xs => andb (check_expr_l nt e) (_type_checker nt xs)
    end.

(* Check the grammar *)

(* G -> ... Q P ... is not accepted *)
Definition type_checker (g: grammar): bool :=
    _type_checker (map NT (find_NT g)) g.

五、构建 firstVT 和 lastVT 集

构建firstVT集的方法如下所示,

firstVT(E)=\{a|E→a...或E→Qa... 或 E→Q...且a∈firstVT(Q)\}

也就是包含以下三种情况,

  • E->a…….,即以终结符a开头
  • E->Qa…….,即以非终结符Q开头,紧跟Q的终结符a
  • E->Q…..,即先以非终结符Q开头,非终结符Q的firstVT

构建lastVT的方法如下所示,

lastVT(E)=\{a|E→...a或E→...aQ 或 E→...Q且a∈lastVT(Q)\}

也就是包含以下三种情况,

  • E->…….a,即以终结符a结尾
  • E->…….aQ,即以非终结符Q结尾,紧跟Q的终结符a
  • E->…..Q,即先以非终结符Q结尾,非终结符Q的lastVT

不难发现 firstVT 和 lastVT 的构建方法的对称关系,此处我们使用firstVT的构建算法来实现lastVT。

首先,根据 firstVT 的构建方法,实现函数 find_first ,由于 Coq 对递归函数的限制,此处引入了递归深度,

Definition look_ahead (l: list symbol): list symbol:=
    match (nth_default default l 1) with
    | T x => [T x]
    | _ => nil (* default is 'U "ERROR"', which is in this clause *)
    end.

Fixpoint find_first_ (target: symbol)(g: grammar)(dec: nat): list symbol :=
    match dec with
    | S n =>
        match (get_expr target g) with
        | Some e => _find_first target g e n
        | None => nil
        end
    | O => nil
    end
with _find_first (target: symbol)(g: grammar)(e: expr)(dec: nat):=
    match dec with
    | S n =>
        match e with
        | base l => let fst:= (nth_default default l 0) in 
                        if (symbol_dec fst target) 
                            then look_ahead l
                            else (
                                match fst with
                                | T x => [T x] ++ look_ahead l
                                | NT _ => (find_first_ fst g n) ++ look_ahead l
                                | U _ => nil
                                end
                            )
        | combine a b => (_find_first target g a n) ++ (_find_first target g b n)
        end
    | O => nil
    end.

(* max execution depth: 100 *)
(* to pass the Gallina's recursion check *)
Definition MAX_EXECUTION_DEPTH := 100.

Definition find_first (target: symbol)(g: grammar) := find_first_ target g MAX_EXECUTION_DEPTH.

函数式的好处尽数体现,算法的非形式化描述几乎和算法的形式化实现一样。

基于 find_first 函数,就可以构造出 firstVT 和 lastVT 集,

Definition firstvt (g: grammar) := 
let nt := find_NT g in    
    let l := map (fun s => map get_string_from_symbol (delete_multi_occur (find_first (NT s) g))) nt in
        (nt :: map reify_EOF l).

(* reverse the grammar, then use firstvt() to define lastvt() *)
Fixpoint rev_expr (e: expr): expr :=
    match e with
    | base l => base (rev l)
    | combine a b => combine (rev_expr a) (rev_expr b)
    end.

Definition lastvt (g: grammar):= 
    let rev_g:= map (fun x => (fst x, rev_expr (snd x))) g in
        firstvt rev_g.

举个例子,对于测试文法,

Compute (firstvt test_reify).
(* = [["E"; "T"; "F"]; ["+"; "*"; "("; "i"]; ["*"; "("; "i"]; ["("; "i"]]
: list (list string) *)
Compute (lastvt test_reify).
(* = [["E"; "T"; "F"]; ["+"; "*"; ")"; "i"]; ["*"; ")"; "i"]; [")"; "i"]]
: list (list string) *)

五、构建算符优先关系表

算符优先分析表中有3种关系: >< , = ,在 Coq 中如下定义关系(relation)与关系表(matrix),

Inductive relation: Type :=
| le : relation (* < *)
| ge : relation (* > *)
| eq : relation (* = *)
| emp : relation. (* ? undecided *)

Definition matrix: Type := list (list relation).

等于关系:如果产生式中有形如 aQb 或 ab,则 a=b

(* find => aQb | ab => a equal b*)
Fixpoint equal_list_eval (m: matrix)(l: list symbol)(g: grammar): matrix :=
    match l with
    | nil => m
    | x::xs => if is_T x 
                    then (let (snd, trd):= (nth_default default xs 0, nth_default default xs 1) in
                            if (andb (is_NT snd) (is_T trd)) (* aQb *)
                                then (let (a, b):= (find_x_in_g_symbol g x, find_x_in_g_symbol g trd) in 
                                        equal_list_eval (change_n_m_to_x m a b eq) xs g)
                                else if is_T snd (* ab *)
                                        then (let (a, b):= (find_x_in_g_symbol g x, find_x_in_g_symbol g snd) in 
                                                equal_list_eval (change_n_m_to_x m a b eq) xs g)
                                        else equal_list_eval m xs g)
                    else equal_list_eval m xs g
    end.

小于关系:如果产生式有形如 aQ,则 a < firstVT(Q)

(* find => aQ => a < firstvt(Q)*)
Fixpoint less_list_eval (m: matrix)(l: list symbol)(g: grammar): matrix :=
    match l with
    | nil => m
    | x::xs => if is_T x 
                    then (let snd:= nth_default default xs 0 in
                            if (is_NT snd) (* aQ *)
                                then (let fstq:= firstvt_gen g snd in 
                                        less_list_eval (fold_le g fstq m (find_x_in_g_symbol g x)) xs g)
                                else less_list_eval m xs g)
                    else less_list_eval m xs g
    end.

大于关系:如果产生式有形如 Qa ,则 lastVT(Q) > a

(* find => Qa => lastvt(Q) > a*)
Fixpoint greater_list_eval (m: matrix)(l: list symbol)(g: grammar): matrix :=
    match l with
    | nil => m
    | x::xs => if is_NT x 
                    then (let snd:= nth_default default xs 0 in
                            if (is_T snd) (* Qa *)
                                then (let lstq:= lastvt_gen g x in 
                                        greater_list_eval (fold_ge g lstq m (find_x_in_g_symbol g snd)) xs g)
                                else greater_list_eval m xs g)
                    else greater_list_eval m xs g
    end.

至此,可以分3次分别生成出算符优先关系表中的等于、大于、小于关系,

(* for less, greater and equal generation *)
Fixpoint meta_expr_eval (f: matrix -> list symbol -> grammar -> matrix)(m: matrix)(e: expr)(g: grammar): matrix :=
    match e with
    | base l => f m l g
    | combine a b => meta_expr_eval f (meta_expr_eval f m a g) b g
    end.

Fixpoint _meta_eval (f: matrix -> list symbol -> grammar -> matrix)(m: matrix)(exprs: list expr)(g: grammar): matrix :=
    match exprs with
    | nil => m
    | x::xs => _meta_eval f (meta_expr_eval f m x g) xs g
    end.

Definition meta_eval (f: matrix -> list symbol -> grammar -> matrix)(g: grammar):=
    _meta_eval f (matrix_gen_emp g) (map snd g) g.

(* OPG matrix can be generated using following 3 functions *)
Definition equal_eval (g: grammar):= meta_eval equal_list_eval g.
Definition less_eval (g: grammar):= meta_eval less_list_eval g.
Definition greater_eval (g: grammar):= meta_eval greater_list_eval g.

最后将三张关系表合并,得到最终的算符优先关系表(注意:如果此处有2张表合并时在同行同列有冲突的关系,则会产生一个空行),

(* merge three matrix into one *)
Fixpoint merge_list (l1 l2: list relation): list relation :=
    if Nat.eqb (length l1) (length l2) then
        match l1, l2 with
        | x1::xs1, x2::xs2 => if is_emp x1 then x2::(merge_list xs1 xs2)
                                           else if is_emp x2 then x1::(merge_list xs1 xs2)
                                                             else nil (* both are not empty, should not happen! *)
        | _, _ => nil
        end
    else nil.

Fixpoint merge (m1 m2: matrix): matrix :=
    if Nat.eqb (length m1) (length m2) then
        match m1, m2 with
        | x1::xs1, x2::xs2 => (merge_list x1 x2)::(merge xs1 xs2)
        | _, _ => nil
        end
    else nil.

Definition get_opg_matrix (g: grammar) := merge (merge (less_eval g) (equal_eval g)) (greater_eval g).

到现在为止,我们已经可以通过原始文法生成算符优先关系表:

Definition output_t (g: grammar):= find_T (reify g).
Definition output_nt (g: grammar):= find_NT (reify g).
Definition output_matrix (g: grammar):= get_opg_matrix (reify g).

六、文法的文法分析(Haskell)

为了更方便地进行文法的文法分析与美化输出,此处将之前的 Coq 代码导出到 Haskell,

(* Extraction to Haskell for I/O *)
Require Coq.extraction.Extraction.
Extraction Language Haskell.
Require Coq.extraction.ExtrHaskellString.
Require Import ExtrHaskellBasic.
Extraction "./Module/OPG_core.hs" output_t output_nt output_matrix test type_checker.

对于一个文法,我们可能希望是这样表示的:

E -> E + T | T
T -> T * F | F
F -> ( E ) | i

这里使用 Parser Combinator 对文法进行文法分析,

exprsToExpr :: [Expr] -> Expr
exprsToExpr = foldl1 Combine

line :: GenParser Char st Rule
line =
  do
    constr <- symbolContent
    skipMany space
    string "->"
    skipMany space
    result <- exprs
    let expr = exprsToExpr result
    return (constr, expr)

exprs :: GenParser Char st [Expr]
exprs =
  do
    skipMany space
    first <- exprContent
    skipMany space
    next <- remainingExpr
    return (first : next)

remainingExpr :: GenParser Char st [Expr]
remainingExpr =
  (char '|' >> exprs)
    <|> return []

exprContent :: GenParser Char st Expr
exprContent =
  do
    l <- many symbolContent
    return (Base l)

symbolContent :: GenParser Char st Symbol
symbolContent =
  do
    first <- many1 (noneOf "->|\n ")
    skipMany space
    return (U first)

eol :: GenParser Char st Char
eol = char '\n'

parseOPG :: String -> Either ParseError Rule
parseOPG = parse line "(unknown)"

parseOPGfile :: [String] -> Either ParseError [Rule]
parseOPGfile [] = Right []
parseOPGfile (x : xs) = case parseOPG x of
  Left err -> Left err
  Right p -> case parseOPGfile xs of
    Left err -> Left err
    Right p1 -> Right (p : p1)

此时,字符形式的文法已经被 parse 成了原始文法形式的文法语法树,然后使用刚才在 Coq 中实现的 output_matrix 函数来得到对应的算符优先关系表,

compute :: String -> IO ()
compute path =
  if unsafePerformIO (doesFileExist path)
    then do
      let grammar_raw = unsafePerformIO (readLines path)
      let grammar = parseOPGfile grammar_raw
      if parseCheck grammar
        then do
          successMessage $ Text.pack "Grammar loaded."
          output (path ++ "_output.md") (unwrap grammar)
        else errorMessage $ Text.pack "Grammar parse failed!"
    else errorMessage $ Text.pack "File does not exist!"

再进行一些输出美化,最终结果如下所示,

图2.测试文法

图3. 更多测试

七、结语

用 Coq 写大作业成就达成✔

MisakaCenter/OPG_Parser

Oh, from RegEx to NFA to DFA to miniDFA
Oh, from LL(1) to LR(0) to SLR to LALR to LR(1)
Oh, from SDD to SDT
Oh, from AST to 3AC
Oh, from IR to BB to CFG to Assembly