More

jmt_ · on Nov 19, 2022

Hey Simon, thanks for creating Django with Adrian. I was deeply interested in programming from a young age but learning Django in my teens sparked a passion for web development that has yet to feign so many years later! Appreciate all your contributions to this space.

hnfong · on Nov 19, 2022

OMG.

My most impactful thing I've done outside of paid work is a website running on Django. I could live without queryBySelector or their descendants, but not without Django.

Thank you, Simon.

yuuu · on Nov 19, 2022

jmt_ · on Nov 19, 2022

Please tell me you're legally allowed to talk more about Megaupload and the work you did - sounds like an absolutely amazing blog post, would love to hear as much as you're able to discuss.

Also, I have a project in production at work where a device needs to grab its public IP address. My code has a list of sites that provide that info and I have ip4.me as a fallback in that list, so thank you for building it!

jmt_ · on Nov 19, 2022

It's kind of amazing anyone chooses to go into healthcare having to work like this. It's the absolute last field I would ever want to go into, even as an engineer who wouldn't need to actually practice medicine. Seems like you need to practically give up your life to save countless others. Your wife, and those like her, are truly performing an innately critical job at an absurd cost to themselves - God bless.

jmcgough · on Nov 19, 2022

It depends a lot on the specialty. Obgyn is particularly hellish.

But yeah, there's a good reason why suicide rates are so high for doctors...

jmt_ · on Nov 18, 2022

Interesting idea leveraging a cheap Android. I wonder how difficult it would be to modify an instance of a regular headless browser in order to convince a website you're using an Android browser. Not sure if Androids just come with mobile Chrome these days or if OEM/carrier-developed type stock browsers still get shipped.

Also totally right on the IP reputation point. I saw a post on HN in the last few months of someone describing how they used a cheap mobile data plan + USB LTE modem to proxy their web scraping. I believe you get effectively treated as a residential IP (depends on the complexity of the system - if they're simply blacklisting datacenter IPs then this should work) with the additional benefit of being able to change the IP assigned to the modem easily.

jmt_ · on Nov 18, 2022

How would you actually use an anti-detect browser programmatically? Would you need to write a custom Selenium driver for it or equivalent for Playwright? Even if the browser is built off something like Chrome, you'd still need a way to interact with the anti-detect related features.

A good trick I discovered is using webkit thru Playwright to bypass fingerprinting and related anti-bot measures. Firefox/Chrome simply leaks too much information, even with various "stealth" modifications. e.g: have been able to reliably scrape a well known companies site that implemented a "state of the art, AI-powered, behavioral analysis, etc" anti-bot product. Using Chrome/Firefox + stealth measures in Playwright did not work - simply switching to Webkit with no further modifications did the trick.

Not exactly what you're asking, but my point is, that with a little time and effort, I've usually been able to find fairly simple holes in most anti-bot measures -- it probably wouldn't be terribly hard (especially since you're versed in scraping) to build-out something similar to what you're looking to achieve without having to pay for sketchy anti-detect browsers.

DantesTravel · on Nov 18, 2022

Yes, that’s what I’ve done up to now. When forced to use Playwright, I’ve noticed too that Webkit is less detected, but depends from website to website. I tried the solution described on the substack, fundamentally the gologin browser, based on chromium, opens a port on your local machine and Playwright connects to that browser, automating the crawling.

jmt_ · on Nov 18, 2022

Yeah, Chrome is the worst choice for this use-case - see my last comment on this thread for more on that. Can you speak a bit more on what you'd like to use a headless anti-detect browser for over regular headless browsers? Is it to leverage their built-in fingerprinting control, effectively avoiding anti-bot measure with little effort, or management of multiple "profiles", etc? My system effectively comes down to using webkit, and storing credentials (encrypted w/ symmetric key) as well as whatever information is needed by Playwright to reconstruct the session. Simply using webkit + DB effectively achieves a headless anti-detect browser, but you're right that webkit alone isn't always a one-and-done solution.

anxiously · on Nov 18, 2022

Any tips/code examples for your webkit solution(s)? Where does one begin with using webkit for scraping?

I think using anti-fingerprinting is itself a fingerprint. I imagine it would be easier to hide in the noise of regular browsers.

jmt_ · on Nov 18, 2022

> I think using anti-fingerprinting is itself a fingerprint. I imagine it would be easier to hide in the noise of regular browsers.

That's what I thought originally too. The problem is the "leaky-ness" of Chrome and Firefox - they expose a large amount of information that can be easily used to train various ML classifiers. Chrome's DevTool Protocol is most commonly used when headless access to Chrome is desired and is inherently "leaky", by design as a protocol for debugging. Don't even try to use any flavor of headless Chrome, even with stealth plugins. Firefox isn't much better.

Webkit doesn't seem to expose as much information, and having a much lesser percentage of usage, I think there's simply less information to feed into a classifier to learn to detect it reliably. There's a few sites that offer fingerprint testing such as:

- https://amiunique.org/fp

- https://webscraping.pro/wp-content/uploads/2021/02/testresul...

Try writing a script that goes to a page like this and have it take a screenshot, using Chrome, Firefox, and then Webkit to see the difference yourself. I use the Python port of Playwright personally. In the project I mentioned in my last comment, all I had to do was change the browser Playwright was using to webkit - i.e "browser = p.webkit.launch()" where "p" is a sync_playwright context manager instance. I tried Chrome and Firefox with many, many, attempts at stealth modifications and none worked. Removing my "stealth code" for the other browsers and changing it to webkit was all that was needed. Blew me away that it was that simple honestly. I've used this trick on other websites and have noticed webkit just gets processed differently by captchas/anti-bot, etc. Selenium should also offer support for a WebKit driver if you prefer it over Playwright.

jmt_ · on Nov 18, 2022

Don't remember where I saw this but: "The act of purchasing physical books and the act of reading them are two entirely different hobbies and practices".

Additionally, if someone downloads your book for free, and barely reads it, as you suggest, have they _really_ done much more than picking up the book in person, scanning the table of contents, reading some pages in particular, then setting the book back down? What you're describing sounds like the digital version of what I used to do at physical bookstores; it seems a less compelling argument against what the z-lib founders were doing, not more.

SturgeonsLaw · on Nov 18, 2022

https://www.raptitude.com/2022/01/everything-must-be-paid-fo...

jmt_ · on Nov 17, 2022

I've built all my web backends in either Flask or Django - what are the selling points/advantages of using Rust vs. Python in this context? Certainly not arguing against it but am very curious and unfamiliar with Rust. If anyone has moved from Python to Rust for this kind of work, can you speak on your experiences with doing so?

drogus · on Nov 17, 2022

I have a very limited experience with Python (3 months of production experience), but vast Ruby experience and I think a lot of the things apply to both. For me these are reasons why I would choose Rust over Python or Ruby in most cases:

* it's relatively easy to write code that will pretty much never crash. yes, Rust will not prevent all of the bugs, but it will prevent almost all of the things that end up as a runtime exception in dynamic languages. So ou will not end up with an error tracker full of errors

* refactoring is so much easier in Rust - I can enter a new project, change a bunch of stuff and after I fix compile errors and tests my confidence that I didn't break anything is like 10 times higher than in Ruby or Python

* while most of the time speed is not a major concern for backend development, it sometimes is a huge bonus. There were cases in the past when I had to spend a lot of time to optimise Ruby code, because it was just too slow (imagine rendering a lot of HTML, doing computation that is hard to do on the DB side etc)

* handling JSON with serde is just on another level

* Rust is very versatile. When a company starts using a language, they will naturally try to fit as much stuff into the language as possible, after all if you have mostly Python devs, you will prefer Python. A lot of people say "just use the right tool for the job", but even if there was something like "the right tool for the job", in practice it's more like: if the downsides of using our primary tech are huge, let's consider introducing a new language, otherwise let's stick with what we have. I've seen it numerous times in the past. I feel like with Rust the downsides of using it for most of the stuff are much smaller than for example for most of the other languages

fiedzia · on Nov 18, 2022

Modern web will be Python + Django/Flask + Js/Typescript (and maybe Swift or Java for mobile apps). You can replace almost all of that with Rust, so you'll get one language to rule them all, but focusing on Python part only: obviously performance, sane tooling and dependency management, statically typed language results in less runtime errors. Overall Python gets you a lot faster from 0 to "it works on my laptop", Rust gets you faster to 'it's good enough to put it in production".

theptip · on Nov 17, 2022

If you really need speed (high RPS) then Python is going to tap out at some point. Normally for line-of-business apps and most startups, developer productivity trumps raw perf IMO (obviously depends on the domain though).

There is some argument that you (or at least some developers) can be more productive with a strongly-typed ORM, but I think FastAPI / Django-ninja capture most of these benefits in the Python ecosystem so it’s not a big win in this dimension IMO.

But in summary it’s a perf vs productivity trade-off.

dehrmann · on Nov 17, 2022

It's written in Rust. Rust is at the state of its life where people will try to do anything and everything with it because they can. Some will stick, while some is just silly.

If I interviewed somewhere that used Rust for web apis, I'd be very hesitant because this isn't really what Rust is good at (yet?), and someone chose it because they wanted to try it more than use the right tool for the job.

jmt_ · on Nov 16, 2022

Trash language is a bit harsh. I'm not sure I would try to put an R project into production or build a huge project with it but, at the very least, R/R Studio was the best scientific calculator I've ever used. Was particularly great during college

jmt_ · on Nov 12, 2022

I just want to be able to choose which I want on a day-by-day basis. Some days I am much more productive in the office because I need to, say, communicate with a bunch of people or there's some sort of back and forth that needs to occur. Or I just need to get out of the house for a bit. Other days, I just want to stay home and spend the time focusing on something in particular without being bothered. But I need both to have both to do my job effectively. However, I work at a small company and am the primary software engineer which changes the dynamic a bit.

jjav · on Nov 12, 2022

> Some days I am much more productive in the office because I need to, say, communicate with a bunch of people

But on that day all of those people might've chosen to work from home so you sit all alone in the office.

The hybrid solution only really works well if the office days are the same for everyone.

justahuman74 · on Nov 12, 2022

Forced-hybrid is just using a sledgehammer on the calendar rather than providing the flexibility that is actually liked by optional-attendance-office.

If the staff actually find this face-to-face time valuable, they'll organize and go into the office themselves, personally I've had zero requests from anyone to do that outside of explicit team-building activities.

closeparen · on Nov 12, 2022

In the before times, "Is there a Zoom for this?" was a live question. Zoom fatigue, such as it was, did not dominate working life. Savvy collaborators anticipated when a conversation would demand more than Zoom, and they made judicious use of the public transit between our offices to keep things smooth.

My enthusiasm for RTO is not about changing the venue from which we take our Zoom calls, but about changing our relationship with Zoom to something like its pre-pandemic state. To have some conversations naturally again. But there's no getting around that the WFHers would have to either show up or be excluded from them.

I personally find Zoom excruciating, and I blame the transition to remote meetings for all communication for turning a once optimistic and joyful career experience into a miserable slog.

theshrike79 · on Nov 12, 2022

We have a weekly slack poll every Friday for office days the next week. Works pretty well.

We used to have so that every Thursday was "come to the office if you can" -day, but switched to the poll system.

jmt_ · on Nov 1, 2022

I can see why you're saying that, but the use of commutative diagrams communicate the structure between the functions and objects of interest - it is this perspective which is core idea to category theory. So I'd argue that the diagrams are a result of communicating a category-theoretic model rather than the end-result itself, and therefore have much deeper meaning than just "boxes being connected by arrows".