Fuzzy string search

jphackworth · on Nov 29, 2011

Another area where fuzzy string search is really useful is computational biology. You often have a particular DNA or RNA sequence and you want to figure out what it is. However, the version on file can be different by a few mutations. For this sort of search where the database can be quite large, there are specialized algorithms like BLAST that index by n-grams.

Just thought I'd mention it because if you find it interesting, computational biology is a real growth area right now - the cost of sequencing genomes has been dropping faster than Moore's law, and it's quite possible that within our lifetimes we will be using this sort of technology to genetically engineer people.

dhruvbird · on Nov 29, 2011

Yes, its useful for alignment. However, I've noticed that such alignment techniques are quite slow (since many of them resort to brute force on O(n^2) on sufficiently large strings - a few thousand maybe).

I've found that a better way to approach this problem is to use suffix arrays to generate the best overlaps (ranges) from the reference and then perform a "range compression" (not of the audible kind) to coalesce adjacent ranges that differ by very little and insert/delete the offending characters.

I'm using a similar technique for genome compression against a reference and am able to get a 700x compression on 2 Korean genomes and 130x compression on hg18(build 36.1) and another build/assembly(number 36.3) of the human reference genome.

timtadh · on Nov 29, 2011

I have been working on a Ternary Search Trie implementation for Python on and off for a while. This is definitely a good read for some other methods for fuzzy string search. The benefits of the TST (versus other Trie formulations) is summarized on my blog.[1] I think I will do a follow up post covering their characteristics in comparison to the methods Nikita covered in this post.

The author mentions at the end that he didn't consider Tries. However, I think they are valuable data structure.

[1] http://blog.hackthology.com/ternary-search-tries-for-fast-fl...

rkalla · on Nov 29, 2011

REF: http://code.google.com/p/fuzzy-search-tools/

For those that don't make it all the way through the article, the author has implemented all the algorithms in Java and is available at his project page (above).

leeoniya · on Nov 29, 2011

i've used Damerau–Levenshtein + Jaro-Winkler for database record matching/customer dedupe. these things work magic. found 2x as many matches vs exact comparison.

devs1010 · on Nov 29, 2011

I'm currently using a Patricia trie in an open source project I'm working on, still pretty new to trees / tries but now that I have begun to use them I can see how widely used they are in a lot of modern software products

aquarin · on Nov 29, 2011

I have implemented TST (ternary search tree) in C to create fast dictionary (histogram) for counting words in Wikipedia. It is fast and memory efficient.

Cieplak · on Dec 1, 2011

Any intentions of open-sourcing it?

olihb · on Nov 29, 2011

We use aho-corasick to extract keywords from terabyte size corpus. Works really well but you have to build your matching tree before.