Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Parsley is great for this. http://github.com/fizx/parsley For Python: http://github.com/fizx/pyparsley

There use to be a website for sharing "parselet" scripts.



I find it amazing that somebody came up with a universal language for data extraction and nothing is being done with it.


Hi, I wrote Parsley.

We (tectonic and I) also wrote an in-browser IDE, and a Javascript-driven web crawler that runs on headless Firefoxes on EC2. We wrote something similar to scraperwiki, that integrated a simpler version of the IDE.

We got a couple consulting deals, building smart web crawlers for clients, and about 15 passionate open-source users. I think the primary problem is that unless you're scraping many sites, it's easier to write 50 lines in Ruby/Python/$language_you_already_know, than to learn a new langauge and cut it to 5 lines.

If there's interest in digging this project up, please contact at kyle@kylemaxwell.com.


I'd like something in .NET that I could throw a parselet/blob/widget/thingy at and it would return a list of important stuff from a website that I was authenticated on. Others could do the same from other platforms using the same setup. And then the parslets would be independent of the platform. Ideally it would be integrated into the browser or the O/S. This means when I sign up for your service X, I also download my widget for it. I can then access the data anytime I like without having to play inside your walled garden.

Not sure I'd want to pursue it right now, but shit, it'd be a game changer on all kinds of levels. Especially if it allowed two-way communication.

EDIT: In fact, I know just how I'd deploy it -- as a shell mod in windows. You have a drive Y which is really a NortonBackup, no reason why you can have drives and folders that represent any sort of online service you have that stores your stuff -- FB updates, Twitter, etc. Let the O/S worry about synchronization and all of that. Why read 27 ads for FB games when you can just go to X:\Facebook\FriendsStatus and read all of the unadorned status updates, which is why you visit anyway?


This would be really easy on unix with fuse. It's been a while since I've don windows development.


Ain't nothing but a thing. Assuming there's a C library that would fetch the page, apply the widget, and return the list, it's probably 3-4 days of work.

Using my Markham's Estimating Tool (double the number and go to next higher units) that comes out at 6-8 weeks.

Windows shell extensions are all COM nonsense. You have to know the magic numbers and know where to put them. The rest is pretty straightforward code.


That is awesome, at first cut it does look nicer than Nokogiri, which I had been using, for quick scrapes.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: