CentraleSupélecDépartement informatique
Plateau de Moulon
3 rue Joliot-Curie
F-91192 Gif-sur-Yvette cedex
Sémantique d'expressions régulières simples

L'objectif de cet exercice est de donner une sémantique à des expressions régulières simples comportant :

  • l'expression vide
  • l'expression atomique, qui reconnaît uniquement son atome
  • l'alternative entre deux expressions régulières, qui reconnaît ce que reconnaît soit l'une, soit l'autre
  • la séquence d'expressions régulières, qui reconnaît les mots composés d'un préfixe reconnu par la première, et d'un suffixe reconnu par la seconde
  • la clôture de Kleene, qui reconnaît toute séquence finie de mots reconnus par une expressions régulière.

On spécifiera le langage reconnu par une expression régulière et on démontrera un certain nombre de propriétés.

RegExprExercise.thy

theory RegExprExercise

imports Main

begin

text ‹
  A theory of simple regular expressions on an alphabet 'a.
  We handle:
  ▪ the empty regular expression ε, which matches nothing
  ▪ the atomic regular expression, which matches its atom
  ▪ the alternative between two regular expressions, which matches either one or the other
  ▪ the sequence of regular expressions, which matches a sequence of matches of two REs
  ▪ the Kleene closure of a regular expression, which matches any sequence of matches of the RE
›
datatype 'a regexpr =
    Empty                           (‹ε›)
  | Atom 'a                         (‹<_>›)
  | Alt ‹'a regexpr› ‹'a regexpr›   (infixr ‹.|.› 65)
  | Seq ‹'a regexpr› ‹'a regexpr›   (infixr ‹.+.› 75)
  | Star ‹'a regexpr›               (‹_*›)

text ‹
  Some examples on characters
›
term ‹<CHR ''a''>›
term ‹<CHR ''a''> .|. <CHR ''b''>›
term ‹<CHR ''a''>*›
term ‹(<CHR ''a''> .|. <CHR ''b''>)*›
term ‹(<CHR ''a''> .|. <CHR ''b''>)* .+. <CHR ''c''>›

text ‹
  Languages are sets of words, a word being an 'a list.
›
type_synonym 'a language = ‹'a list set›

text ‹
  The Kleene closure of a language is the smallest set that contains:
  ▪ the empty word
  ▪ all concatenations of words of the language.

  inductive_set is like inductive, but directly defines a set
›
inductive_set Kleene :: ‹'a language ⇒ 'a language› (‹_⋆›)
                      for L :: ‹'a language›
where
    ― ‹The empty word belongs to the Kleene closure›
    Empty:  ‹[] ∈ Kleene L›
    ― ‹Which other words belong to the closure? Think about 'append', the @ operator on lists›
  | Cons:   ‹⟦w ∈ L ⟧ ⟹ w ∈ Kleene L› ― ‹FIXME: this sets L⋆ = [] ∪ L›

text ‹
  Language recognized by a regular expression
›
fun language :: ‹'a regexpr ⇒ 'a language›
where
    Lang_Empty: ‹language ε = undefined›
  | Lang_Atom:  ‹language <a> = undefined›
  | Lang_Alt:   ‹language (r1 .|. r2) = undefined›
  | Lang_Seq:   ‹language (r1 .+. r2) = undefined›
  | Lang_Star:  ‹language (r*) = undefined›


section ‹Some properties›

text ‹What is the language of the empty regexpr?›
lemma epsilon_empty: ‹language ε = X›
  sorry

text ‹And what about its Kleene closure?›
lemma ‹language (ε*) = X›
  sorry

text ‹The alternative commutes›
lemma ‹A .|. B = B .|. A› ― ‹FIXME: I am wrong›
  sorry

text ‹ε is left and right absorbant for the sequence›
lemma ‹ε .+. A = ε› sorry
lemma ‹A .+. ε = ε› sorry
― ‹FIXME: these lemmas are wrong, write the correct form and prove them›

text ‹ε is neutral for the alternative›
lemma ‹ε .|. A = A› sorry
lemma ‹A .|. ε = A› sorry
― ‹FIXME: these lemmas are wrong, write the correct form and prove them›

text ‹The language of a regexpr is included in its Kleene closure›
lemma kleene_includes: ‹something about ''r'' ⊆ someting about ''r*''› ― ‹FIXME: I am wrong›
  sorry

text ‹Correspondance between Kleene and Star›
lemma morph_kleene: ‹something about ''*'' = something about ''⋆''›
  by simp

text ‹The Kleene closure is stable by concatenation›
lemma kleene_closed: ‹⟦w ∈ (L⋆); w' ∈ (L⋆)⟧ ⟹ w@w' ∈ (L⋆)›
  sorry

text ‹Dual version for regexpr›
corollary kleene_closed': ‹True› ..

text ‹
  Some checks on concrete strings and regexprs.
›
schematic_goal a_or_b_then_c_star:
  ‹language (((<CHR ''a''> .|. <CHR ''b''>) .+. <CHR ''c''>)*) = ?X›
by simp

thm a_or_b_then_c_star

lemma ‹''acbc'' ∈ language (((<CHR ''a''> .|. <CHR ''b''>) .+. <CHR ''c''>)*)› (is ‹?w ∈ language (?R*)›)
proof -
  have ‹''ac'' ∈ language (?R)› by simp
  hence 1:‹''ac'' ∈ language (?R*)› using kleene_includes by blast
  have ‹''bc'' ∈ language (?R)› by simp
  hence 2:‹''bc'' ∈ language (?R*)› using kleene_includes by blast
  have ‹?w = ''ac''@''bc''› by simp
  with kleene_closed'[OF 1 2] show ?thesis by presburger
qed

end