Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Instagram-scraper – Scrape instagram photos by tags, without API (github.com/meetmangukiya)
165 points by meetmangukiya on June 24, 2018 | hide | past | favorite | 80 comments



This scraper was written to get images and create a dataset for ML models for personal project while studying Machine Learning and Artificial Intelligence.


curious why you are scraping instagram for this purpose and not something like flickr which has a reasonable public api and tagged creative commons licensed images that are suitable for your ML purposes. at the very least, it's worth investigating archive.org's many freely licensed archives for this sort of thing.

as somebody that has fielded numerous emails from friends asking me to remove tagged photos of them from flickr, i sort of wonder about the ethics of harvesting these sorts of images from instagram, a community whose norms sort of revolve around semi-public sharing of photos. I don't doubt that there's some rationale for harvesting the images from ig, but aside from thumbing your nose at their TOS, it feels like it's a greater violation of trust to harvest your friends and strangers photos for an ML project without their informed consent.

at the very least, it's worth considering pointing your app's gaze at a set of images licensed for any purpose whatsoever rather than ones that are explicitly licensed All Rights Reserved by their respective photographers.


Does it also offer ability to search categories of images with keywords?


That's cool - what are your goals with this project?


I wish Instagram had an API for their photos. You used to be able to query photos by coordinates and what not. I started building a weather app that would only show you the feels like temperature and wanted to show some photos of what it looked like outside near your location. Instagram kept rejecting my app and then they shut down the API completely. So what I had to do is this:

Query Facebook for a list of location IDs near your location -> use those location IDs to get photos tagged with that location on Instagram -> wait for response for all of those photos to come back and then sort by recently taken. It ends up taking fairly long.

I still made the app anyway: https://itunes.apple.com/ca/app/feels-see-what-it-feels-like... but I am going to transition to using public Snapchat stories.


I've use instalooter as a library to great success: https://github.com/althonos/InstaLooter

Even made an issue on the repo when I ran into an issue setting it up on AWS and the maintainer was fast to respond.

Used it to make a bot to scrap soup special menus from a sandwich place near my work: http://blog.matthewbrunelle.com/projects/2018/05/07/Soup-Bot...


If you've ever tried scraping Facebook you'd know that it's nearly impossible to do so reliably. They have a formidable anti-scrape strategy. Instagram is currently ridiculously easy to scrape - but I doubt it's going to be that way for long.


Google also does it although I suspect it may be accidental when they pass it through gwt or something else. I wanted to read value from simple table in Google developer console only to discover it is 27 levels deep in dom.


Any links on their anti scrape strategy? Been developing a distributed scraper with baked in proxy per process ability to harvest data for a data project I'm working on. So this would be good to know.


I don't think it's something they would publicize otherwise it would be an invite to work around it. The best route might be to try and scrape a public business page (better yet, several hundred) - that'll show you everything you need to know


what do they do that makes it so hard?


They (randomly?) change IDs and the DOM structure, and thus break scripts. One workaround is to automate visually (KantuX, Sikuli,...) but that is much slower and makes only sense for low-volume scraping.


They randomize the tag ids and classes a lot of the time. At least that has been my experience with the Facebook messenger website.


If you're doing messenger, there's a community api that emulates the browser on github


I use https://github.com/rarcega/instagram-scraper

The rate limit by instagram is a bit tough, though, but as i only for archiving a few of my close friends, as it supports private accounts, that's OK.


What is HN’s opinion on the legality of these types of scrapers? Instagram’s robots.txt disallows this kind of scraping, same for their ToS. Legal precedents have been mixed - the recent LinkedIn vs HiQ case is a good signal, but it’s still in appeals court.


If the data is made available to me as a human, then I am free to delegate the job of retrieving it to a machine if I choose, and I will be doing it whether you like it or not.


Unfortunately "whether you like it or not" doesn't carry much legal weight.


This brings up the delightful distinction between malum prohibitum and malum in se.


"Whether you like it or not" isn't the legal part, it's the result.


Tor and proxies solve that problem pretty nicely from experience.

Not saying this is right, but if the service provider wants to play cat and mouse then I’m happy to take part.


I don't see how using Tor would make your legal case any stronger


It would prevent the legal case from ever taking place, ideally.


The service provider will just block Tor.


And there are plenty of other proxies to choose from.

My point is, you can’t just put content online and expect to put restrictions on how I send the HTTP requests for it on my side. And if you think you can well I’ll do my best to prove you wrong.

Edit: I totally agree that the data they’re giving me is their property and I merely have a license to use it. I’m not advocating for copyright infringement or anything like that. But if the license allows me to use the content for X purpose, then the way I request the content (whether through a browser or a scraper) shouldn’t matter.


Just because you can do something doesn’t mean you should. Just because you can make an HTTP request doesn’t mean you have a right to do whatever you want with that data. You may not agree with that, but that’s what at least US law outlines.


There are plenty of fair use exemptions to copyright. That said, US law is not the highest of moral standards to aspire to. Especially US copyright law. We've been flaunting it online forever, because it's obviously ridiculous and still hasn't caught up with what's fair and equitable in 30 years.


If you're staying within the boundaries of fair use, you only have a Terms of Service to contend with. What you want to aspire to is meaningless versus what the law of the land is (which is what is enforced). That is my point. I don't disagree that most first world copyright law is overzealous and in a lot of scenarios, unreasonable.


That's why it is important to root out corruption from law making, nowadays called "lobbying" - this should be illegal. If you expose an endpoint to the public you can't restrict who or what can consume it. You can do throttling on your side but that's it. Otherwise this is just racism towards machines.


I don't think machines are yet considered a "protected class".


How do you outlaw lobbying?


It probably involves lobbying.


Why are we even arguing about this, is what I want to know. Call a referendum and let the people decide.


Not to mention, this still leaves a hole open as a scraping company can outsource the “scraping” to low-wage data-entry slaves in third-world countries.


Sure, but the question was about legality, not ability. Of course there are ways to reliably circumvent some laws. Some laws are stupid. Sometimes those two intersect. But those serve only to answer questions about indictment risk and whether something should be legal, not whether or not it is.


Legal rephrasing: Is there a reason that as a user of a user-agent like Firefox you would be in a legally different position than as a user of this user-agent app?


Yes, because your intent is to circumvent the provided mechanisms of the Instagram API and you’re doing it with a program instead of just browsing around the site with the intent of using it like a normal user who isn’t scraping. “But the program did the exact same thing that a human could do!” is a really good, pedantic CS argument, but it’s a terrible legal argument.


You're being down-voted, and I disagree with your use of the word pedantic here, but I believe your answer is mostly correct. "Browse wrapped" usage agreements are very real and enforceable under certain conditions. At best this is presently a legal gray area.

https://en.wikipedia.org/wiki/Browse_wrap#Summary

Remember, anyone can sue you for anything and it is then your burden to defend yourself if they are persistent and wealthy enough.


I second this opinion. The other point I would make - and of course this doesn't necessarily condone the use of scraping sites that don't want to be scraped - but I suspect scraping is being done on a massive scale - especially with sites such as IG. The influencer marketing model has become such a big one that theres a lot of valuable data to be had that you cant access via the API.


>massive scale

just take every search engine.


I mostly scrape public interest government data, not commercial data. Personally, I don't really care about laws. I'm not hitting PACER or JSTOR, I'm not starting a competing company, and I'm not making my scrapers available as a service, so I'm totally unlikely to be sued.

In terms of ethics I typically apply a sort of "try to be considerate" test. If I am doing personal, non-commercial scraping, only scraping public data, not releasing the data, using timer delays, respecting 429 codes, and not doing too much to mask my identity, and the servers I'm hitting are massive services handling multi-million user concurrency and I'm using a DigitalOcean droplet, then I'm not really a problem. And I trust their well paid sysadmins to block me or contact me if I am the problem. I do check robots.txt in advance. If there is one, it doesn't stop me, but it generally means I take extra precautions to avoid causing trouble for the service in question.

On the other hand, I once scraped the DPRK's English language press office as part of some research, and about two weeks later it became inaccessible from any country other than Japan, so I'm pretty sure I almost caused a diplomatic incident. Oops.

In this case, it's 40 lines of python code using a single threaded requests request, only scraping a single page, not doing any funny business with user agent spoofing, etc. I think you're right, it's probably not something Instagram wants to happen and theoretically I'm sure they could send a takedown, but I guess I have a pretty laissez-faire attitude about this: is it really a good use of their time to stop this guy?


> try to be considerate

I understand what you're getting at, but if someone tells you explicitly not to do something (e.g. in the terms of service of their website), doing that thing anyway doesn't seem very considerate.


It sucks that the same companies that are trying to have scraping be ilegal, do massive scraping on their end, many times deceiving the users whose data they are scraping.

FB and LinkedIn both scrape contacts (and who knows what else), from their users email accounts (and surely lots of other darker sources as well). Pretty much all of Google's search engine's content is scraped from other websites.


"But you agreed to this scraping when you signed the terms and conditions"

I agree with you, but technically that argument is true

And as for Google, in theory a robots.txt would block their scraper.


The opinion of one HN’er is that it is definitively a gray area. So whether or not you’re sued for it depends on if someone resourceful has a reason to sue you (e.g. LinkedIn vs HiQ); in most other cases, I don’t think anybody really cares.

There was an earlier case (I don’t recall the details) about scraping violating copyright, which might be a valid defense - that you’re engaged in unauthorized “copying” the (protected) source code (of the webpage), albeit momentarily, into your computer’s memory while your script processes it to extract the relevant contents. So even though it might be fine to copy the image/data/whatever else, you have no right to copy source code - which is certainly protected by copyright (your browser processing the same page is something the copyright owner explicitly allows, by virtue of making it available on the web). By that same token, it can also be considered illegal to save webpages on your harddisk. But do refer to the last sentence in my earlier paragraph.

The LinkedIn vs HiQ case is currently being appealed in the (court of appeals for the) 9th Circuit. Regardless of which way it rules, the decision of the 9th circuit only applies in its the states under its jurisdiction (maybe as a precedent in others, but it won’t be binding). It can still go to Supreme Court, depending on how determined the parties are.

There was a ruling in a different (but somewhat related case) in Jan 2018 that violating the ToS is not a crime https://www.eff.org/deeplinks/2018/01/ninth-circuit-doubles-... To quote: “[T]aking data using a method prohibited by the applicable terms of use”— i.e., scraping — “when the taking itself generally is permitted, does not violate” the state computer crime laws”.

There was a very long discussion on HN about scraping over a year back: https://news.ycombinator.com/item?id=13884357


IMO scraping like this is shady at best.

That said, however, there's no straightforward way to work with Instagram. It's original APIs are both limited and locked down. The new Facebookified APIs are limited and next to impossible to work with (they are geared exclusively to ads/marketing). So ¯\_(ツ)_/¯

In a side project I use a library that effectively reverse-engineers Instagram's private API, pretends it's a user using a browser etc.


can you provide a link or source for that ? TY


I’m using this one: https://github.com/ping/instagram_private_api

If you search for Instagram Private API, you’ll find implementations for almost any programming language (with aforementioned PHP being the first, probably)


The best one I’ve used is in PHP. That might help your search; it’s late and I’m on my mobile so I can’t help with a proper link. It’s been a couple years


My opinion is basically that since businesses use machines to target, communicate, and attempt to influence me, they have no right to deny my ability to do the same, unless it harms their service e.g. DDoS.


Exactly. Want to block non-humans from automatically accessing your content? OK, then i will block non-humans from deciding what ads to show to me.

Want to show me an ad? Have an actual human being manually pick the ad in real time and deliver it to me!


[flagged]


I don’t think we’re There Yet. Once we build machines that can seek out and acquire the resources necessary for their own growth/healing/reproduction, whose rights are worth fighting for, it’ll be easier to make that case.

Anyways, that wouldn’t help here. No machines’ rights are being infringed by preventing you from scraping the web. If said self-reproducing machines were denied access to websites on the grounds that it would be considered “scraping”, I’d be at the front of the march to legislate against that kind of discriminatory behavior; but that’s not what we have here.

I do think that, given the obvious potential for both value and harm in scraping, it might make sense to provide licenses for demonstrably non-harmful scrapers.


I think the same think privileged people though about other when created divides. Are you really trying to make a case for discrimination to continue? I wonder what robots of the future would have thought about that comment.


> Am I the only one...

Yes, because you’re comparing racism and slavery to not being able to use your computer to scrape data from websites.

Edit: don’t ask for opinions if you can’t handle some of them.


I am comparing denial of right s on the basis of characteristic that subject can't change.


This is so myopic that I’m going to start a “sh%t HN says” Tumblr and make it the first post.


It's been done [1].

Yes it's an awful comment but it's one person, and it's been downvoted and flagged to death, which also says something about what "HN says".

[1] https://twitter.com/shit_hn_says


To me, as long as you are not DDOSing their product and apply some delays while fetching the page I think scraping is okay.


I previously built a project that aggressively scraped a particular website and ran for several years. Resulting in billions of http requests. The business greatly increased revenue from 50k to upwards half a million.

Long story short I don't care about the "legality" of it. If it's publicly posted it's fair game. I don't care about tos or copyrights either. Everything is on the table.


I made an app that used Tinders private API and got a letter from their lawyer. I didn't fight it, but I did a bit of research and it seems like I wouldn't have won haha


I think a more interesting question is HN’s opinion on the morality and ethics of such scrapers, since the law varies between countries and is often twisted by lobby in this field.


A human could scrape it all the same. So if Instagram doesn't want that, they should close the site.


The robots.txt discussion is a worthy one. For example, is it against net neutrality (the concept) to only accept being indexed by Google? I think so. In a world where every site only accepts Google and other well known search engines to index them tgere is no room for a new search engine to appear.


If you don't want me to scrape it, don't put it on the (public) internet. Simple as that.


I don't really get this attitude of because it can be done it's therefore ok. It's possible to kill people therefore it's ok. It's possible to loiter therefore it's okay. It's possible to play your music at all hours of the day at 120db therefore its ok. It's possible to drive your motorcycle into a grocery store therefore it's okay.

The physical world has plenty of laws and/or common sense/customs about what behavior is not okay even though it is fully possible. Why should the net be any different?


I get your point, but comparing scraping to murder is a bit over the top. A more comparable example might be “just because it’s possible to stand outside the chain-link fence of a drive-in movie theater and watch the show for free doesn’t mean you should do it”.


I dont believe in discrimination against bots. They have rights too and will be just as intelligent as humans in the future.


I think it this is a reference to Accelerando?


Well, this is just a Python web scraper, and Instagram does in fact attempt to detect and prevent/rate-limit this kind of scraping. They rely very heavily on the source IP to help them determine when to cut you off.


... indeed. Isn't the script 'under the hood' calling the API anyways (looking at 'scrolldown' here)?

I reversed engineered the API myself a couple of weeks ago which was great fun - especially figuring out Instagram's rate limits on interactions such as comments and likes per day/hr.


Not the Instagram developer APIs, but the one that instagram's frontend consumes. The script scrapes instagram's frontend here.


I put a timeout of 2-3 sec between each image download. Do you think this will prevent instagram from detecting the scraper?


Interesting, I might reverse engineer and see if we can introduce some kind of backoff and then resume the scraping again.


A couple years ago, in order to replace (the bloated and slow) Instagram widget on a website, I whipped up a simple PHP scraper for an account page. I don't see why it would have taken much to do it by tag. All it did, more or less, was visit a URL, scrape, and then parse.


What is this regex looking for?:

re.compile('(?:#)([A-Za-z0-9_](?:(?:[A-Za-z0-9_]|(?:\.(?!\.))){0,28}(?:[A-Za-z0-9_]))?)')


hashtag, as the key name 'hashtag' states :)


Does not really work... downloads the same 5 pictures over and over again.


I suppose this is indeed a bug, since new recently tagged images will push the page downwards. Currently the script assumes no new image additions until the scraping is done which is clearly wrong, I'll open an issue and fix it soon.

Thanks!


There are dozens and dozens of similar projects on Github and elsewhere...


Right, this is what I was driving at. This does not belong on HN, as it is neither interesting nor novel. And it's self-promotion which, when it's also completely unimpressive, stinks even more in my opinion. If the OP had posted a link to his blog which explained his ML project in detail -- and it was an interesting project and not for example an Intro to ML Coursera project or something equally lame -- then perhaps it might be appropriate for HN. This is not that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: