You can search for the longer phrases in the spintax and find all of the places this spam was dropped before it apparently "barfed its recipe." It's pretty fascinating - Google returns over 600,000 results:
For those unindoctrinated in SEO spam, the trick is to get past the spam filter, then leave a link in the name/username. You generate hundreds of thousands of backlinks to one page, which Google considers "votes" for your website. You can either send links directly to your "money site" or you can send them somewhere else and do a 301 redirect, passing the link juice.
Usually a spammer starts using something like Scraperbox and a lot of proxies to find thousands of blogs or forums with open comment fields, then use a script like the OP plugged into something like XRumer (http://en.wikipedia.org/wiki/XRumer). The syntax of the OP is called spintax, and it basically chooses a random word inside of the {}s, creating an infinite number of comments. You find a few hundred thousand open comment fields to post in, ride a little wave of SEO boost until Google finds you and kills your site, rinse and repeat.
Nowadays most spam syntax as simple as the OP will be caught in filters or penalized by Google, but most spammers are actually really bad at what they do (as you can see by them forgetting a bracket and dumping the entire spintax).
Tip: The result count estimate on the first result page is often off by several orders of magnitude. Often it's possible to navigate to the end of the results pretty quickly, and the number drops to something like 469:
I agree that is's annoying when journalists do it to demonstrate that a topic is important/controversial/etc. But I'm not sure there's anything wrong with it when the point being made it literally (and only) about the number of times a specific piece of text appears on the Internet.
Yeah, I'm curious about that. My guess is it has 600,000 results for things it thinks matches your query, based on how it interpreted it. But if it's not in the top 10 or 20, your query must be interpreted more strictly, and synonyms or slight misspellings or related pages fall off the radar.
I don't get that. I clicked your link, then the first results page, and now I'm browsing result pages 10, 15, 25, 30 for some reason and it doesn't stop.
Personalization should affect ranking, but not so much the total set of results. Domain, language & safe search prefs could. I wonder if the dedup filter can vary a lot (did the results end in "similar results omitted"?).
Google knows people rarely browse beyond the first few pages of results. Everything else is derived from the days when it was important to show off how big your archive is. They're basically still performing that vanity game. 600,000 results, nearly all of which are of such low quality it'd be embarrassing for Google to show off what they pretend is a valid result so they can fake a large result count (but they know users don't go there, so it doesn't matter).
I'm rather inclined to believe this is an artefact of the algorithm, something like:
1) grab index for each word in the query
2) grab first x results from each index
3) cross these to filter down to actual matches
I believe step 3 is as expensive as rendering the results AND nobody would like to wait for rendering 600k results before getting some of it. This they stick with some simple estimate, like max(size of found indices), or average or whatnot.
Seems like a reasonable hypothesis, but you can easily prove that it's wrong by doing a search for a single word, so there is no step (3).
I did a search for "hackerspace" (no quotes) and it claims "About 588,000 results." Paging through the results, I eventually got "In order to show you the most relevant results, we have omitted some entries very similar to the 390 already displayed." Even with omitted entries included, there only seems to be 832 results.
I agree. When I follow kuschku's link (starting at the 900th result), there is no next page link at the bottom, and if I try hacking the "start" value in the URL to go even one document farther I get no results and a "Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 901.)" below the search box.
Google quickly gets a set of results from some servers, and then estimates the total.
Say, for example, that their Search cluster has 10000 nodes. And suppose that the query returns 60 results from the first node itself; so it could multiply 60*10,000 and claim there may be 600,000 results. But when it is asked to actually go and fetch the results, for various reasons, it may not get to that figure; most of the nodes may just shrug and say "we got nothin'".
Fascinating, thanks for the explanation. I'm curious: I thought blogs/forums/etc output user-supplied links with rel="nofollow" so google ignores them... so how is the link actually beneficial (even for a short period of time)?
Thanks for responding, but I don't understand what you're saying. The 3rd sentence is clear (although it seems absolutely crazy to me that google wouldn't ignore it since it was their idea in the first place to combat this exact kind of spam). But I am unclear what sentences 1 and 2 mean -- care to clarify?
I really wonder how effective comment spam like that really is; most platforms with user comments mark outbound links as nofollow, meaning Google and other search engines won't consider them and making them useless for dark SEO practices. They'd still work - for maybe 1% of viewers - for clickthrough though, of which another 1% might get infected with the malware on the target page - which, I guess, can be a good enough return on investment. It's all a matter of scale I guess.
I had an employer that did this on purpose, insisting because it worked so well for a time that it was the best way to do things for SEO. We owned almost 400 domains related to the business, and just had tons of comments and backlinking pushing juice the the handful of real websites.
I complained and explained it was only a matter of time before the whole thing fell through and got hit by an update or blacklist, but he didn't listen. I did at least redesign the main websites to be html5 with good metadata to mitigate it.
This is what happens when C-levels think it's still the dot-com era.
A new technique that I've seen is they do not link directly to their "money" site, they link to a page, that link to another page, that link to their money site.
So if they manage to put a link on your blog, the next step is to make spam links to Your site.
Google needs to scrap the pagerank and come up with another metric. Lately I've seen a drastic decrease in quality where 9 out of 10 pages in the top-10 is just garbage.
Even if I copy a whole sentence from a site with low pagerank, it will be hidden deep in like page 4 o 5 in the google search result.
We don't know: 1) how much of a factor pagerank still is and 2) how pagerank is calculated. Google may very well be beating down pagerank as a factor, while simultaneously adjusting how pagerank gets calculated so these spam links matter much less.
My bigger gripe is how awful it is to get rid of bad links, either as a result of your own past discretions (or past discretions you've inherited) or from negative SEO.
That's called "spintax" and the overall approach is known as article spinning[1]. There's a small industry of software tools that facilitate that sort of thing (e.g. The Best Spinner[2]). Seems to be dying out as that kind of posting is more likely to get you penalized quickly nowadays rather than produce any real results.
I don't know how either one works well, but I saw some similarities, and with my limited knowledge of how that name generator format works I would know either more or less about how article spinning works.
Yep, this is how forums work. Tools are very rare compared to other sold services because making tools are hard, and making a tool that people want to buy even harder.
You may see software like TheBestSpinner, XRumer and another one I can't remember the name of, basically a Wordpress comment spammer cannon that sell thousands of copies, more than that person would make using these tools themselves.
I'm surprised no one with an eye towards reducing the number/variety of such tools has inflicted an intentional App Store-style price race to the bottom by creating such a tool, preferably a superior one, and selling it well under cost (e.g. $1.00) to make it unprofitable to develop them. Might take multiple "competing" tools to cause this to occur, of course, and at some point they'd want to stop supplying their tools.
Of course, your $100 price might be a figure pulled out of thin air, not an accurate one.
It isn't that simple. There are other issues that the developers may not want to deal with in using their own tech. Access to proxies is a big one. Obtaining lists of sites with enough traffic/page rank for this to matter is another.
That said, these blackhat forums tend to be full of people with little money. Many users are from places like India and Pakistan, trying whatever they can to eke out an existence. Since almost none of the users on those forums have money to buy the tools (even $100 is a very high price point for products on these forums), writing tools for them is a fool's errand.
People creating this stuff need something to sell so might as well sell something they know something about already. Sort of how marketing experts will market their own marketing teleseminars, etc.
The company I'm working for has alarms which monitor log files for "FATAL" and the like, some of them also contain user input, so naming your kid something like this is another good way to annoy some ops people :D
The guy who published a book about XSS, which ended up causing an XSS exploit on Amazon via the "book preview" function[1] is my favorite version of that.
There was also an attempted attack on the Swedish voting system via a handwritten SQL injection attack, but that was unsuccessful[2].
With that said, the English is actually quite good compared to most of the spam I see (although some of the synonyms don't make sense, so were probably automatically generated).
It is the best time to make a
few plans for the future and it is time to be
happy. I have learn this put
up and if I may just I desire to
counsel you few attention-grabbing things or tips.
Maybe you could write subsequent articles regarding this
article. I desire to learn even more things approximately it!
I'll right away clutch your rss feed as I can not to
find your email subscription hyperlink or e-newsletter service.
Do you've any? Kindly permit
me recognize in order that I may subscribe.
Thanks.
It's a rather general problem. It has ads, requires scripting (to view text!), breaks formatting, doesn't have sensible syntax highlighting available, etc. Imho, it's probably the worst paste client anyone could use for anything. http://ix.io and http://sprunge.us are significantly better alternatives.
But, if one is dead-set on using pastebin.com, the least they could do is post the raw link so users can escape some of the horrible.
This is pretty outdated tool to cheaply generate "unique" versions of the articles in order to fool search engines in terms on "link juice". What you normally do is hire someone for $5-100[0] to write an article and manually or using a tool generate spintax version. During posting bot would simply select a random alternative from the spin list.
When Google started to use machine learning to filter out machine generated content this out this method quickly lost effectiveness. However, it is more effective on less sophisticated search engines so using this will show some benefit.
One of the more current methods is markov chain based article generation with provided keywords. And perhaps more importantly paying people to write content.
[0] With internet proliferation this has dropped substantially.
Often you'll see wordpress comments where the comment is some generic praise to make it seem legitimate and avoid deletion, and the "website" field is filled out so their username is a link to wherever they want.
One of my sites gets these every now and then and, given that the website field is omitted on my forum, users are usually quite bemused at the mysterious intent of the authors.
Other responses covered the basics, but there are a few other interesting uses.
- Text that does not clearly market anything or provide a meaningful backlink may be part of a larger link network, or just simply testing to generate a list of vulnerable blogs. Those lists can be sold to others or used for your own purposes.
- Similar to some recent spam bots hitting Google Analytics referrer results, these people may not care about organic rankings. They might be marketing to the blog owner who sees the comment and investigates the username/link. If I'm spamming for SEO services, and you are a blog owner looking to increase traffic for example, the attempt may not be about ranking for SEO terms (a losing battle), but simply getting you to look at my site.
It's pretty much useless now that Google actually penalises people for this - pretty funny that brands who use to do this now have to beg for their spam to be removed.
Basically they would spam these out to not look overly spammy, and have their website and name listed as their website they are trying to promote as people use to believe that having millions of shitty backlinks is good SEO.
It seems pretty counterproductive for Google to penalize the linked-to website for this, because they don't necessarily have any control over the links. This means that now a good strategy would be to make a bunch of spam links to your competitors' websites and have their rankings go down through no fault of their own. While that would be less effective than having your own go up, it's still useful.
What Google should do is make the links have no effect at all, thus preventing this abuse.
This is tricky. Generally the spammers are doing it because they're paid to do it by someone. Penalizing is the right thing to do. But you raise an interesting point - perhaps an unscrupulous firm would unleash a spambot pointing to their competition, to damage them. I'm not sure of a good way for Google to differentiate, other than your suggestion of ignoring.
Thank you for allowing me to talk myself into a circle. :-)
A lot of these guys are all about A/B split testing. If they find something that hurts rather than helps, they'll use that tactic as an offensive on their competition.
I'm curious to see when spam bots will eventually use sophisticated language models in order to generate text, aimed to combat a search engine's own machine learning algorithms which detect spam. Barring the actual scenario where Google's resources are larger and Google's employees more intelligent than spammers, it would be interesting to see spam bots generate more and more intelligent content in a war of the AI.
You can use a context-free grammar to produce amusing and sometimes convincing writing if the writing style it's intended to imitate is sufficiently obtuse, e.g.:
(I once ported it to Python and wrote a parser for a more-convenient syntax like the OP's, then never released it partly because I'd hate to see spammers using it.)
https://www.google.com/?gws_rd=ssl#q=%22time+to+make+some+pl...
Looking at this link (http://prophesyagain.org/radio/#comment-98), it looks like this spam was left for a URL that redirects to http://www.itunescoms.com/, a fake looking iTunes knockoff that probably drops all kinds of nasty adware/malware on your PC.
For those unindoctrinated in SEO spam, the trick is to get past the spam filter, then leave a link in the name/username. You generate hundreds of thousands of backlinks to one page, which Google considers "votes" for your website. You can either send links directly to your "money site" or you can send them somewhere else and do a 301 redirect, passing the link juice.
Usually a spammer starts using something like Scraperbox and a lot of proxies to find thousands of blogs or forums with open comment fields, then use a script like the OP plugged into something like XRumer (http://en.wikipedia.org/wiki/XRumer). The syntax of the OP is called spintax, and it basically chooses a random word inside of the {}s, creating an infinite number of comments. You find a few hundred thousand open comment fields to post in, ride a little wave of SEO boost until Google finds you and kills your site, rinse and repeat.
Nowadays most spam syntax as simple as the OP will be caught in filters or penalized by Google, but most spammers are actually really bad at what they do (as you can see by them forgetting a bracket and dumping the entire spintax).