Another area where fuzzy string search is really useful is computational biology. You often have a particular DNA or RNA sequence and you want to figure out what it is. However, the version on file can be different by a few mutations. For this sort of search where the database can be quite large, there are specialized algorithms like BLAST that index by n-grams.
Just thought I'd mention it because if you find it interesting, computational biology is a real growth area right now - the cost of sequencing genomes has been dropping faster than Moore's law, and it's quite possible that within our lifetimes we will be using this sort of technology to genetically engineer people.
Yes, its useful for alignment. However, I've noticed that such alignment techniques are quite slow (since many of them resort to brute force on O(n^2) on sufficiently large strings - a few thousand maybe).
I've found that a better way to approach this problem is to use suffix arrays to generate the best overlaps (ranges) from the reference and then perform a "range compression" (not of the audible kind) to coalesce adjacent ranges that differ by very little and insert/delete the offending characters.
I'm using a similar technique for genome compression against a reference and am able to get a 700x compression on 2 Korean genomes and 130x compression on hg18(build 36.1) and another build/assembly(number 36.3) of the human reference genome.
I have been working on a Ternary Search Trie implementation for Python on and off for a while. This is definitely a good read for some other methods for fuzzy string search. The benefits of the TST (versus other Trie formulations) is summarized on my blog.[1] I think I will do a follow up post covering their characteristics in comparison to the methods Nikita covered in this post.
The author mentions at the end that he didn't consider Tries. However, I think they are valuable data structure.
For those that don't make it all the way through the article, the author has implemented all the algorithms in Java and is available at his project page (above).
i've used Damerau–Levenshtein + Jaro-Winkler for database record matching/customer dedupe. these things work magic. found 2x as many matches vs exact comparison.
I'm currently using a Patricia trie in an open source project I'm working on, still pretty new to trees / tries but now that I have begun to use them I can see how widely used they are in a lot of modern software products
I have implemented TST (ternary search tree) in C to create fast dictionary (histogram) for counting words in Wikipedia. It is fast and memory efficient.
Just thought I'd mention it because if you find it interesting, computational biology is a real growth area right now - the cost of sequencing genomes has been dropping faster than Moore's law, and it's quite possible that within our lifetimes we will be using this sort of technology to genetically engineer people.