Read articles behind paywalls by masquerading as Googlebot

TechExpert

Thu, 02/25/2016 - 23:41

The Internet is at a tipping point. The continued rise of adblocking has put an end to the revenue model that relies solely on ad dollars to operate websites and businesses.

Especially news sites have started to experiment with ways to diversify income sources, and one prominent option that sites like The Wall Street Journal, Financial Times, The New York Times or The Washington Post have all implemented is the paywall system.

There are different types of paywalls but they all have in common that they block access to content either directly or after a certain number of articles have been read on site.

Visitors are then asked to subscribe to the site to continue reading articles on it.

news site paywall

It may make sense from a business point of view, and may be more lucrative than battling it out with users who run adblockers, but there is a downside to it both for the paywalled site and the blocked user.

Sites lose a high percentage of visitors if they implement a paywall system. It is unclear how high the percentage really is, and it probably varies from site to site, but it is likely a lot higher than the percentage of visitors who subscribe to the site after being presented with the choice to subscribe to read the desired article.

Masquerade your browser

It is no secret that news sites allow access to news aggregators and search engines. If you check Google News or Search for instance, you will find articles from sites with paywalls listed there.

In the past, news sites allowed access to visitors coming from major news aggregators such as Reddit, Digg or Slashdot, but that practice seems to be as good as dead nowadays.

Another trick, to paste the article title into a search engine to read the cached story on it directly, does not seem to work properly anymore as well as articles on sites with paywalls are not usually cached anymore.

User-Agent and Referrer

You are probably wondering how sites block or allow access to the site's content. The methods have have improved over the years, and it is no longer enough to simply change the referrer of the browser to https://www.google.com/ to gain full access to a site's content.

Instead, sites use various checks that include user-agent, referrer and cookies, and sometimes even more than that, to determine the legitimacy of access.

General information

Probably the best way to masquerade the browser is to make it appear to be Googlebot.

Referrer: https://www.google.com/
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html

Firefox

referrer

Firefox users need two browser add-ons for that: the first, RefControl, to change the referrer value when visiting news sites, the second, User Agent Switcher, to change the user agent of the browser.

Download and install both extensions in the Firefox web browser.
Tap on the Alt-key, and select Tools > RefControl Options.
Click on "add site", enter a domain name under site, select custom action, and enter https://www.google.com/ as the referrer.
Repeat this for all news sites you want to access (some may not work even if you make the changes, so keep that in mind).
When you are done, close the configuration window.
Tap on the Alt-key again, and select Tools > Default User Agent > Edit User Agents from the menu.
Select New > User Agent, and replace the string in the User Agent field with Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). Name it Googlebot.
Exit the menu.
Before you access these sites, tap on Alt, and select Default User Agent > Googlebot.

This is all there is to it. It is a bit unfortunate that there is no extension for Firefox that changes the user agent automatically based on the sites you visit.

Google Chrome

Google Chrome users can install extensions like User Agent Switcher and Referer Control that are available for the browser to do the same.

There is however another possibility, and that is to create a custom extension which automates the process in the browser.

Instructions are provided on Elaineou. All it takes, basically, is to create a new directory on the local computer, create the two files background.js and manifest.json inside it, and copy and paste the code found on the site into the files.

You need to enable "developer mode" on chrome://extensions/, and can then select "load unpacked extension" to pick the folder you have created the two files in to load the extension in Chrome.

You may modify the list of sites it supports to add new ones.

This article was first seen on ComTek's "TekBits" Technology News

HOME