Removing Google Analytics Spam (Part 2/2)

removing google analytics spam

5 Steps to Dealing With Google Analytics Spam

In our last post, we gave you an introduction to Google Analytics spam and an overview of the two main types of spam – ghost spam and crawler spam (If you haven’t already, it may be worthwhile reading that post before going any further here).  Now that we understand how this spam is finding it’s way into our Google Analytics reports, we’re in a position to block it, or at least filter it out in our reporting views. Here are five steps you can follow to deal with all the types of spam discussed in our last post:

1. Property and Views Configuration

Before getting into setting up filters to exclude spam, you should ensure that you have set up the following 3 views for each of your Google Analytics properties:

Raw View

This view should be created as soon as you set up your Google Analytics property and should have no configuration i.e. no filters are applied or goals are set. This view is essentially a backup in case anything goes wrong in your other views.

Test View

This is the view where you should test all configurations before moving them to your Main View. Before applying goals, filters or any other configuration changes to your Main View, you should test them here first.

Main View

This is the main view that you should use for analysis and reporting. This is where all filters are applied, goals and custom reports are set up, and site search is enabled after being tested in the Test View.

2. Exclude All Hits From Known Bots

Google has taken measures to reduce the amount of spam making its way to Google Analytics reports and has included an option to exclude all hits from known bots and spiders under view settings.

To enable this feature go to your Google Analytics admin panel and select your reporting view. Then select “View Settings”, scroll down to “Bot Filtering” and check the box. Then click save.

google analytics spam exclude bots

3. Ghost Spam Filter

As I mentioned, ghost spam will always either leave a fake hostname or no hostname (not set) in Google Analytics. Legitimate traffic to your site, on the other hand, will always use a real hostname – usually the domain of your website and depending on the configuration of your Google Analytics, there may be others e.g. if you use cross domain tracking.

So to stop this type of spam showing in our reports, we need to create a filter (in our main reporting view) which excludes data that references any hostname except for those that we know are associated with our website.

Before you set up the filter, you will want to define the list of valid hostnames of visits to your website.

To find a list of valid hostnames go to Audience > Technology > Network report and select Hostname as the primary dimension. Set a significant time period – 6+ months if you have that much data to ensure you gather info on all hostnames.

Here’s what that might look like for our site:

google analytics spam - ghost spam filter

So let’s identify the valid hostnames for our website. Obviously, both of the following are valid:

  • arekibo.com
  • blog.arekibo.com
  • www.arekibo.com (usually this gets rewritten to arekibo.com but obviously there has been a few cases where this did not go as planned – nothing much to worry about)

There are also a few other hostnames which, whilst not the domains of any of our sites, are valid. These are:

  • webcache.googleusercontent.com – this is the hostname which is used when someone views one of our web pages from the Google cache.
  • translate.googleusercontent.com – this is the hostname which is used when someone views one of our web pages through Google’s translate service.

Note: There are some other hostnames which may be valid for your site – e.g. other translate or caching services, shopping carts, other subdomains, payment services and IP addresses.

What about google.com and google.org? Don’t be fooled by these hostname entries that may not look spammy – spammers often use well known URLs to mislead people.

So by the power of deduction, we end up with a valid hostname list of:

  • arekibo.com
  • blog.arekibo.com
  • www.arekibo.com
  • webcache.googleusercontent.com
  • translate.googleusercontent.com

Steps to create our filter:

A. Go to the Admin tab in Google Analytics and select the view where you want to filter out ghost spam (main reporting view or test view)

google anlaytics spam ghost spam filter

B. Select “Filters” and “+Add Filter”

google anlaytics filter ghost spam

C. Enter a name for your filter e.g. “Valid Hostname”

D. Select “Custom” as your Filter Type

E. Select “Include” and select “Hostname” from the Filter Field dropdown

F. In the “Filter Pattern” box we want to enter a REGEX expression to include all the valid hostnames identified in earlier. To do so we simply type each hostname URL separated by a pipe bar (|) with no spaces. (If you’d like to read more about REGEX for Google Analytics, this guide by Robbin Steif is a great place to start)

Our expression will be: arekibo.com|blog.arekibo.com|www.arekibo.com|webcache.googleusercontent.com|translate.googleusercontent.com

google anlaytics spam hostname filter

G. Save your filter.

Note: If you add your Google Analytics tracking ID to any other domain, subdomain or service, you will need to update this filter accordingly.

4. Crawler Spam Filter

Since crawler spam uses a valid hostname, we need to create a separate filter to exclude this type of spam. We will again us a REGEX expression but this time instead of filtering Hostnames, we will filter Sources.

Identifying all sources of crawler spam is trickier than identifying fake hostnames. Here are a couple of options:

A. One option is to go to your Acquisition > All Traffic > Source/Medium report, filter by “Hostname – exactly matches -*yoursite.com*” and “Avg. Session Duration -Less than – 1”hostname report filter google analytics

This will show you non ghost spam traffic to your site, including crawler spam, that has less than 1 second average time on site – a key indication that the traffic is spam – crawlers don’t generally hang around!

So you can export this list, clean it i.e. remove any traffic that is unlikely to be spam, and use this as a basis for your filter expression. In the screenshot below – all the listed sources are spam except for tpc.googlesyndication.com (traffic from DoubleClick / Google Display Network)

host name report crawler spam google analytics

So our list of main offenders for crawler spam is:

  • keywords-monitoring-your-success.com
  • timer4web.com
  • top1-seo-service.com
  • free-video-tool.com
  • fix-website-errors.com
  • keywords-monitoring-success.com

B. Alternatively, you can find a list of common crawler spam sources online.

Steps to create our filter:

A. Go to the Admin tab in Google Analytics and select the view where you want to filter out ghost spam (main reporting view or test view)

google anlaytics spam crawler spam filter

B. Select “Filters” and “+Add Filter”

google anlaytics filter crawler spam

C. Enter a name for your filter e.g. “Crawler Spam Exclusion”

D. Select “Custom” as your Filter Type

E. Select “Exclude” and select “Source” from the Filter Field dropdown

F. In the “Filter Pattern” box we want to enter a REGEX expression to exclude all the spammy sources that we have identified. To do so we simply type spammy sources separated by a pipe bar (|) with no spaces.

Our expression, based on spammy sources within the Arekibo account earlier, will be: keywords-monitoring-your-success\.com|timer4web\.com|top1-seo-service\.com|free-video-tool\.com|fix-website-errors\.com|keywords-monitoring-success\.com

(Note: the reason I’ve added a backslash (\) before the “.com” in each entry is to indicate that the “.” should be treated as a plain text character and not a REGEX character)

crawler spam filter google analytics

G. Save your filter.

5. Language Spam Filter

Your Ghost Spam filter should exclude most language spam but just in case, you can set up a final filter to exclude any that slips through the net.

The logic behind this filter: most legitimate language settings sent by browsers are 5-6 symbols and rarely is there traffic with 8-9 symbols in this field. There are also symbols which cannot be used in the language field, but which can be used by spammers to create fake domain names (e.g. “secret,google,com”).

Therefore we will create a regular expression which excludes all non-standard language traffic (e.g. en-us, en-uk, es) where the language dimensions contains any of these invalid characters or where the dimension contains 15 or more symbols.

Our expression will be: \s[^s]*\s|.{15,}|\.|,

google analytics language spam filter

Conclusion

Google Analytics is an indispensable tool to allow organisations of all shapes and sizes to understand their website traffic, users and behaviour. However, unfortunately, it is not simply a “plug in and play” solution that needs no configuration. One critical aspect of this configuration, to ensure that you receive reliable and trustworthy data, is to apply filters to exclude as much Google Analytics spam within your reports as possible.

By following the steps outlined in this post, you will ensure that your Google Analytics reporting data will not be skewed by any of the most common types of spam.

We hope that these last two blog posts have given you a solid understanding of Google Analytics spam and how to remove this spam from your reports going forward.

If you have any questions, comments or suggestions, feel free to get in touch with us at enquiry@arekibo.com.

And keep an eye out for our next post on our social media channels: Twitter & LinkedIn