Hacker News new | past | comments | ask | show | jobs | submit login
Hack idea: text classification for political propaganda
15 points by dfranke on Feb 13, 2008 | hide | past | favorite | 18 comments
Reading the campaign material of political candidates in order to figure out what they stand for is a rather annoying exercise. Invariably, they're 95% fluff. Look at

http://www.johnmccain.com/Informing/Issues/65bd0fbe-737b-4851-a7e7-d9a37cb278db.htm .

It isn't my intent to pick on McCain, but this makes a good example. These five paragraphs are almost but not quite semantically null. I can glean from the first paragraph that he believes in man-made global warming, and from the fourth paragraph that he supports nuclear energy. This is useful information, but having to parse five paragraphs to discover it seems inefficient.

I think text classifiers might be able to improve this process. "Nuclear energy" seems like it would be a pretty strong ham token, while "for our children" and "addressing the challenges" seem pretty spammy. Trying this out is not quite as simple as just feeding things into CRM114, because we're trying to classify parts of messages rather than complete messages. It ought to possible to work around that, though: perhaps score each clause, each sentence, and each paragraph, and then somehow derive an overall score from these inputs.

Anyone think this could work?




This verbose stenography also applies to economists. Apparently, Alan Greenspan wasn't happy with a speech until he read the newspaper coverage the following day. If one newspaper reported the opposite of another newspaper then he was satisfied. You don't want to spook the market.


Even very simple text classification can get you somewhere. Last year a couple of economists, Matthew Gentzkow and Jesse Shapiro, published a wonderful paper in which they did something similar.

Take a look at it: http://home.uchicago.edu/~jmshapir/biasmeas052507_formatted....

"To measure news slant, we examine the set of all phrases used by members of Congress in the 2005 Congressional Record, and identify those that are used much more frequently by one party than by another. We then index newspapers by the extent to which the use of politically charged phrases in their news coverage resembles the use of the same phrases in the speech of a congressional Democrat or Republican."

They then go on to compare their index of politically-slanted language in newspapers to the politics of the newspapers' readers, and conclude that newspaper bias is driven more by readers wanting their own prejudices confirmed than by the politics of the newspaper owners. It's a really great piece of work: worth reading in full if you have any interest in text classification politics.


I like the idea, though if it became in widespread use, writers would do as spammers do--write to get past your filter.

One concern is that it'd be labor intensive for you to build up a large enough corpus with some metric to compare to so you know how well you're doing. Marking it by hand is labor intensive, and if you use a lot of humans, it might be a little inconsistent.


I don't think they'd do that. The purpose of the fluff is not (usually) to deliberately conceal their beliefs. It's there because it's what some people want to read. They want a candidate who "cares". I don't think the people who write the fluff would have any motivation to try to beat fluff filters.


Come to think of it, instead of trying to beat the filters, what if they specifically cooperated with them so that they'd filter out nothing at all? Think of what a great boast it would make: "I'll take you straight to the point, and these folks over here will prove it mathematically". I could picture Ralph Nader trying this.


I think they already do that. Those phrases on posters of politicians remind me a lot of the way I write my CV, inserting all the keywords that recruiters look for.

In fact, it probably would be easy to identify the best keywords for politics with Google Adwords, just see which ones get clicked the most?


Semi-related:

You might want to take a look at this blog entry:

http://billburnham.blogs.com/burnhamsbeat/2008/02/skygrid-an...

It's about a startup that has accomplished some cool things in the area of recognizing meaning from text. They've developed some "sentiment"-measuring algorithms that allow them to classify articles into "good" or "bad", for the purposes of deciding whether to invest in a stock. Pretty cool stuff! Perhaps it might give you some ideas on similar approaches you could apply to your problem.


A professor at the university where I did my undergrad has done research on this topic:

http://www.cs.queensu.ca/home/skill/papers.html

For example, he took the Enron email data set, and applied machine learning techniques to try to distinguish "dishonest" emails where something was being concealed from normal emails. A student of his has applied similar techniques to speeches given by MPs in Parliament (Canadian equivalent of Congress).


So, a program that loads certain text, then recognizes the primary points of the text, and prints it the filtered version?

I'd think that there would be much better uses for it than as a BS-O-Meter for political literature.

Think about a program that could generate the summary of an essay or book, or assist novice writers in writing concisely. Maybe use it as a service that takes news items and abbreviates them for busy people who want to stay on top of things. you could call it, snapnews.com or something.


I think you'd find that usefulness dimishes pretty rapidly as the problem domain expands. The reason spam filters work as well as they do is that most spam pretty much looks alike. Likewise, there are only so many ways you can phrase soothing platitudes that will get soccer moms to vote for you.


I'm working on something different, but related. I'm hoping to open-source part of it this week, so stay tuned. :)


Shoot me an e-mail when you do, please?

fedorov@rutgers.edu


A bit of topic, but that reminds me of Isaac Asimov's Foundation. At one point, a politician comes, makes a big speech with lots of promises and everybody is happy, until a day-long sophisticated semantic analysis of the speech reveals that it was actually devoid of any meaningful content.


Good idea, I think it could be done. Perhaps you could look at congressional speeches and match it to how each speaker voted?



where do you get good and bad seed phrases?


Read transcripts from every speaker of the house you've never heard of. Being powerful but creating nothing memorable is an indicator.


Another possibility: watch a few speeches on Youtube, and write down the applause lines.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: