Norvig’s spelling corrector in 16 lines
[Update: I’ve trimmed the code to 15 lines. See the next post.]
I need to confess a dirty secret of mine: I like language shoot-outs.
I know they are not very scientific. I know they don’t show real-world code. I know that the results depend on the skill of individual contributors. But whether shoot-outs compare speed or brevity I still love them, for a simple reason: They let me see a lot of languages in action at once. If you look at the results as an order of magnitude thing rather than as an exact comparison, they become telling.
Peter Norvig (of the Teach Yourself Programming in Ten Years fame) recently posted a rough explanation of how Google’s spelling corrector works. Since I did something similar ten ago with the Serbian language (it’s depressing how statistics on large input samples can beat experts), I found this interesting.
As it goes on the internets, a lot of people couldn’t resist and rewrote Peter’s sample in their language of choice. While looking at the results it struck me as wrong that F# would do so much worse than C#. So when Ana and the kids went to bed last night I rewrote Lorenzo Stoakes’s C# code in sixteen lines of F#:
#light
open System.Text.RegularExpressions
let (+..) = Seq.append
let edits1 (w:string) =
seq { for i in { 0 .. w.Length - 1} -> w.Substring (0, i) + w.Substring (i + 1) } +..
seq { for i in { 0 .. w.Length - 2 } -> w.Substring(0, i) + w.Substring(i + 1, 1) + w.Substring(i, 1) + w.Substring(i + 2) } +..
seq { for i in { 0 .. w.Length - 1 } do for c in { 'a' .. 'z' } -> w.Substring(0, i) + c.ToString() + w.Substring(i + 1) } +..
seq { for i in { 0 .. w.Length - 1 } do for c in { 'a' .. 'z' } -> w.Substring(0, i) + c.ToString() + w.Substring(i) }
let re = Regex ("[a-zA-Z]+", RegexOptions.Compiled)
let words = System.IO.File.ReadAllText "big.txt" |> re.Matches |> Seq.cast |> Seq.map (fun (x:Match) -> x.Value.ToLower())
let wordMap = words |> Seq.count_by id |> Map.of_seq
let arg = Sys.argv.[1]
let cor1 = arg |> Seq.singleton |> Seq.filter wordMap.ContainsKey
let cor2 = if Seq.is_empty cor1 then edits1 arg |> Seq.filter wordMap.ContainsKey else cor1
let cor3 = if Seq.is_empty cor2 then edits1 arg |> Seq.map_concat edits1 |> Seq.filter wordMap.ContainsKey else cor2
(if Seq.is_empty cor3 then arg else cor3 |> Seq.sort_by (fun w -> - wordMap.[w]) |> Seq.hd) |> printf "%s\n"
A simple run from the command line shows it in action:
PS C:\vsp\spellfs\bin\Debug> ./spellfs droylea doyle
Some notes, in no particular order:
1. I have a nagging feeling that the edits1 function can be rewritten to be much shorter, but I can’t put my finger on it just yet.
2. The cor1->cor2->cor3 sequence makes me go yuck. I need to check whether the F# standard library has an equivalent of the C# ?? operator that would also work on sequences. I could add it here but it would increase the line count. (Lousy measure, I know, but I’m a competitive bastard.)
3. Somebody needs to rewrite this in one line of K and the go “nya nya nya” to all of us.
4. The recipe for winning the brevity shoot-outs is quite simple: a) be functional, b) have a large standard library with a lot of data structures and a shitload of algorithms, and c) have a lot of operators. If you don’t overdo it this is also a good recipe for succeeding in the real world. A lot of places in the code and not performance-critical.

24. December 2008 at 23:02
I noticed that he updated his table… But I’ve updated mine… Perhaps we should stop at 15 lines before we start getting unreadable code
!
seasons greetings…