Hacker News new | past | comments | ask | show | jobs | submit login

The parser we're using now is not based on that JavaScript codebase at all. It's excellent to it being used in so many ways now, though.

We're all about making reading better on the web.




Are you using the lxml port or did you guys rewrite it again?


A completely rewritten stack. It's continually improving — one of our biggest efforts.

We do heavily utilize lxml though.


Given the number of articles I guess you're processing each day, I think you should probably rewrite your parser in C. I used to run a service which basically consisted on a feed reader where every article was preprocessed by an algorithm similar to readability. I wrote the parser using lxml and it looked fast enough, but when I started running on the 400K-500K pages per day territory I started having performance problems. Since parsing the pages is easily paralelizable across multiple machines, I could have just rented some more servers. But where's the fun in that? So I sat in front of the computer and 4 hours later I had a C implementation which passed all the testsuite and, according to valgrind, didn't have any memory leaks. As soon as I deployed it into production, CPU and memory usage dropped by something like 10x (don't remember the exact number) and I was able to remove some servers and bring the costs down. Sadly, I had to close that project because I was spending too much time on it compared to the revenue it was generating, but it was so much fun while it lasted.


Another anecdote: I was writing an HTML-to-text converter. The prototype used lxml and some custom DOM-traversal and formatting logic in Python. I got about a 17x speedup from porting the thing to use C and libxml2 (the parser that lxml uses). The port to C took most of an afternoon, and it's currently chewing through a lot of HTML without a problem.


I can confirm that libxml2 is indeed great.


I would love to use Enjoy Reading, but it is a non-starter for me until it works with noscript... :(




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: