At uni, one of my favorite courses was our introduction to functional programming. We were given the task of implementing a basic finite-state automaton [1], with the purpose of lexing a basic mathematical expression language.
Obviously I chose Clojure as my language. Find my full code in the repository [2].
The lexer works by generating a stream of tokens that are then rendered into an HTML file with basic syntax highlighting. See the example image:
The state machine itself is encoded as a single Clojure map. The map defines a starting state, and a list of transitions from a given state to other states based on a condition. Each transition optionally executes some actions (transition-based state machine), such as emitting a token.
You can see the hand-written state machine in all it's glory here:
(def language
{:start :stmt
:transitions
{;; statements, are either empty, comments, or assignemnts
:stmt [{:where :end :to :halt}
{:where :ws :to :stmt}
{:where :newline :to :stmt :action :linebreak}
{:where :slash :to :cmt-0 :action :eat}
{:where :istart :to :asg-token :action :eat}]
;; comments begin with two slashes and end with a newline
:cmt-0 [{:where :slash :to :cmt-1 :action :eat}]
:cmt-1 [{:where :end :to :halt :action :out-comment}
{:where :return :to :cmt-1}
{:where :newline :to :stmt :action [:out-comment :linebreak]}
{:where :any :to :cmt-1 :action :eat}]
;; assignments begin with a token
:asg-token [{:where :irest :to :asg-token :action :eat}
{:where :ws :to :asg-equal :action :out-token}
{:where :equal :to :asg-expr :action [:out-token :out-equal]}]
;; then have an assignment operator
:asg-equal [{:where :equal :to :asg-expr :action :out-equal}
{:where :ws :to :asg-equal}]
;; then have an expression
:asg-expr [{:where :num :to :expr-num :action :eat}
{:where :istart :to :expr-token :action :eat}
{:where :minus :to :asg-expr-m :action :eat}
{:where :oparen :to :asg-expr :action [:add-paren :out-oparen]}
{:where :ws :to :asg-expr}]
;; this allows for negative values (unary '-' operator)
:asg-expr-m [{:where :num :to :expr-num :action :eat}
{:where :istart :to :expr-token :action :eat}
{:where :oparen :to :asg-expr :action [:out-op :add-paren :out-oparen]}
{:where :ws :to :asg-expr-m}]
;; expressions have numeric parts
:expr-num [{:where :end :to :halt :action [:check-paren :out-num]}
{:where :dot :to :expr-float :action :eat}
{:where :num :to :expr-num :action :eat}
{:where :newline :to :stmt :action [:check-paren :out-num :linebreak]}
{:where :slash :to :expr-cmt-0 :action [:out-num :eat]}
{:where :op :to :asg-expr :action [:out-num :eat :out-op]}
{:where :cparen :to :expr-op :action [:del-paren :out-num :out-cparen]}
{:where :ws :to :expr-op :action :out-num}]
;; floats are slightly more complex numbers with a decimal part
:expr-float [{:where :num :to :expr-float-rest :action :eat}]
:expr-float-rest [{:where :end :to :halt :action [:check-paren :out-float]}
{:where :e-notation :to :expr-float-exponent :action :eat}
{:where :num :to :expr-float-rest :action :eat}
{:where :newline :to :stmt :action [:check-paren :out-float :linebreak]}
{:where :slash :to :expr-cmt-0 :action [:out-float :eat]}
{:where :op :to :asg-expr :action [:out-float :eat :out-op]}
{:where :cparen :to :expr-op :action [:del-paren :out-float :out-cparen]}
{:where :ws :to :expr-op :action :out-float}]
:expr-float-exponent [{:where :num :to :expr-float-exponent-r :action :eat}
{:where :minus :to :expr-float-exponent-n :action :eat}]
:expr-float-exponent-n [{:where :num :to :expr-float-exponent-r :action :eat}]
:expr-float-exponent-r [{:where :end :to :halt :action [:check-paren :out-float]}
{:where :num :to :expr-float-exponent-r :action :eat}
{:where :newline :to :stmt :action [:check-paren :out-float :linebreak]}
{:where :slash :to :expr-cmt-0 :action [:out-float :eat]}
{:where :op :to :asg-expr :action [:out-float :eat :out-op]}
{:where :cparen :to :expr-op :action [:del-paren :out-float :out-cparen]}
{:where :ws :to :expr-op :action :out-float}]
;; expressions can also have tokens instead of a numeric part
:expr-token [{:where :end :to :halt :action [:check-paren :out-token]}
{:where :irest :to :expr-token :action :eat}
{:where :newline :to :stmt :action [:check-paren :out-token :linebreak]}
{:where :slash :to :expr-cmt-0 :action [:out-num :eat]}
{:where :op :to :asg-expr :action [:out-token :eat :out-op]}
{:where :cparen :to :expr-op :action [:del-paren :out-token :out-cparen]}
{:where :ws :to :expr-op :action :out-token}]
;; expressions can end optionally by comments, which can be a little confused by division '/'
:expr-cmt-0 [{:where :slash :to :cmt-1 :action [:check-paren :eat]}
{:where :num :to :expr-num :action [:out-op :eat]}
{:where :istart :to :expr-token :action [:out-op :eat]}
{:where :oparen :to :asg-expr :action [:out-op :add-paren :out-oparen]}
{:where :ws :to :asg-expr :action :out-op}]
;; separated by operators
:expr-op [{:where :end :to :halt :action :check-paren}
{:where :ws :to :expr-op}
{:where :newline :to :stmt :action [:check-paren :linebreak]}
{:where :cparen :to :expr-op :action [:del-paren :out-cparen]}
{:where :slash :to :expr-cmt-0 :action :eat}
{:where :op :to :asg-expr :action [:eat :out-op]}]}})
We have to remember that finite-state machines cannot actually process input with parenthesized expressions. This is because this kind of programs are not actually a regular language [3]! For this reason the parenthesis are actually handled as part of the actions and kept track of elsewhere. In this sense it could be considered a Push-down Automaton.
(Side track, did you know that "automaton" is the singular noun, while "automata" is the plural form? I certainly did not.)
Once we have a defined state machine, we just need a basic main loop. The loop just keeps track of the current state and executes transitions based on the input sequence. Once we are done, we can just process the tokens in any way we see fitting. In my case I just create a basic HTML file with some CSS. The output is a nice syntax highlighter for our custom language.
That's it!
[1]: https://en.wikipedia.org/wiki/Finite-state_machine/
[2]: https://github.com/eduardvercaemer/itesm-tc2037-automata-lexer/
[3]: https://en.wikipedia.org/wiki/Regular_language/