Reading the campaign material of political candidates in order to figure out what they stand for is a rather annoying exercise. Invariably, they're 95% fluff. Look at
http://www.johnmccain.com/Informing/Issues/65bd0fbe-737b-4851-a7e7-d9a37cb278db.htm .
It isn't my intent to pick on McCain, but this makes a good example. These five paragraphs are almost but not quite semantically null. I can glean from the first paragraph that he believes in man-made global warming, and from the fourth paragraph that he supports nuclear energy. This is useful information, but having to parse five paragraphs to discover it seems inefficient.
I think text classifiers might be able to improve this process. "Nuclear energy" seems like it would be a pretty strong ham token, while "for our children" and "addressing the challenges" seem pretty spammy. Trying this out is not quite as simple as just feeding things into CRM114, because we're trying to classify parts of messages rather than complete messages. It ought to possible to work around that, though: perhaps score each clause, each sentence, and each paragraph, and then somehow derive an overall score from these inputs.
Anyone think this could work?