Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.
Do you have any experiences with the algorithms? Do you know how the algorithms compare to one another?
I'd really appreciate some advice.
PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.
Oh, I'm sorry I forgot to mention that.
No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).
We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.
Anyway, it's not for duplicate detection.
Similar questions here stackoverflow.com/questions/31494/how-to-detect-duplicate-data and here stackoverflow.com/questions/42013/… Others can be found through related tags and search terms. However, you didn't specify exactly why you need to match strings this way -- are you checking for duplicate data?
Are you matching 'real' strings (ie, English) or bioinformatics? If real strings, what are you using for your substitution matrix?
This is, indeed, very interesting.