Follow

I wonder what my blog's stats would look like with all the malicious activity (spam and exploit bots) removed. My suspicion is that these amount to almost half of the visits. Anybody have promising filtering approaches?

Β· Β· Web Β· 2 Β· 1 Β· 0

Following a suggestion by @leip4Ier, I created a small Python script to filter the log – only RSS requests and requests by IP addresses that loaded CSS are passed through. Almost all bot activity is filtered away and I am left with around 50% of the visits on most days.

But there is far less fluctuation. Apparently, the spike earlier this month was caused by bot activity. Some articles went way down in terms of traffic, particularly as I already suspected the one with "login" in title and an ancient article about Flash.

Unsurprisingly, all planet.mozilla.org referrers went away – these were image requests only (same with some forums that embedded my images directly). 90% of the Twitter and 70% of GitHub referrers vanished as well however, these weren't actual clicks.

One of the remaining requests scanning for vulnerabilities is apparently produced by the DotGit browser extension (thanks @leip4Ier for the hint).This one will query /.git/HEAD for any website visited in order to recognize websites that have repository metadata exposed.

@WPalant bots usually don't load styles after visiting html pages, i filtered by that. but i assume those were websites with much less traffic than yours and with much less advanced bots visiting them, so it probably won't work for you..

@leip4Ier Wow, that turned out surprisingly radical, only 10% of the visits left. πŸ˜‚ It filtered out all RSS clients, so maybe if I allow these as well…

@leip4Ier Now I have roughly a third of my regular visit numbers left. But some of the bots scanning for vulnerabilities are definitely loading CSS files, I still see them in the stats.

@WPalant huh, i wonder if those are puppeteer/phantomjs. if so, it must be expensive to scan random websites using a full-fledged browser!

@leip4Ier Looked up one of those – pretty conventional bot. Apparently, it's just a coincidence that a real user visited the site from the same subnet five days later (IP addresses are anonymized, so keeping apart different IP addresses within the same subnet isn't possible and it's a university network). Maybe I should put a limit on how far away the requests are supposed to be, will make the approach more complicated however…

@leip4Ier Never mind, this wasn't a regular user – a single request only. I was too quick to draw conclusions here.

@leip4Ier That bot is requesting a page every now and then (only HTML file), and it will also request a bunch of WordPress URLs on other occasions. In a few cases there is also a request looking like a regular person – but the subnet belongs to a hosting provider. Probably the same bot using Puppeteer. After all, Chrome 71 this month? This user agent appears to be used by bots exclusively by now…

@WPalant oh, i see.. now it'd be interesting to know its success rate, how many requests end up useful for it.

@leip4Ier On my site pretty much all of its requests end up as 404. Except when it requests my articles, but those appear to be its entry URLs, probably gathered from a search engine.

@WPalant yeah, and it keeps trying! what are they expecting? sounds like a very inefficient scheme..

@leip4Ier These bots don't care whether it's inefficient. Many of those are running on compromised systems, looking for more systems to compromise.

But it sure is a very noisy scheme, prone to detection. It certainly isn't a targeted attack, rather looking for the low hanging fruit all over the web.

@leip4Ier I wonder what those "/.git/HEAD" requests are – they are being made along with requesting other page resources but without a referrer. User agents and IP addresses make sense for regular users. Is there a Chrome/Firefox browser extension that will produce that request during regular web surfing?

I made a tiny Python script which is a lot faster than grep and can consider timing as well. Looks good, that probing for git repositories is the only thing left.

@WPalant wait, github and twitter embed images from 3rd-party domains?..

@leip4Ier No, that's mostly HEAD requests – seem to be checks whether a page exists and metadata retrievals, yet they have the referrer set. I didn't check whether it's Twitter and GitHub themselves who perform these requests or whether it's some third parties. But they definitely aren't browsers.

Sign in to participate in the conversation
Infosec Exchange

A Mastodon instance for info/cyber security-minded people.