It’s understood… that EU dudes… sell GDPRization…

I’ve been recently thinking of GDPR, and its influence on the non-EU websites… in particular, I was curious how the legislation affects the user experience for non-EU sites for visitors from EU. We hear about many websites in US simply denying the access e.g. LA Times:

but I was curious how many other web sites really do so…

I came up with a quick & dirty (and pretty simple) idea of checking how the popular web sites respond to the regulation… by visiting them and taking a screenshot.

Of course, manual check would be too labor-intensive, so I automated it.

First, I needed a list of top world web sites so I downloaded the Cisco Umbrella list. I know it’s biased, but don’t know any better source (since the  free Alexa top 1M is long gone, and others – I really don’t know how accurate they are).

I then created a simple script in perl to extract the first 10000 top unique domains from the list (and exclude all subdomains on the way):

use strict;
use warnings;
my %h;
my $cnt=0;
while (<>)
{
  if (/,([^\.]+\.[^\.]+$)/)
  {
    if (!defined($h{$1}))
    {
      print "$1";
      $h{$1}++;
      $cnt++;
      exit if $cnt >= 10000;
    }
  }
}

Next, I wrote a simple phantomjs script to grab a screenshot of these domains (all accessed via http and then rerunning for https for these that didn’t work):

system = require('system')
var page = require('webpage').create();
     page.viewportSize = { width: 1024, height: 768 };
     page.clipRect = { top: 0, left: 0, width: 1024, height: 768 };
address = system.args[1];
output  = system.args[2];
page.open(address, function() {
  page.render(output);
  phantom.exit();
});

And then I ran the phantomjs on domains from this data set… each page visited is saved as a png.

To my surprise, the experiment didn’t work as I anticipated.

Most of web sites visited didn’t really make any comment on GDPR and it was business as usual. Some offer an option to accept new privacy policies. In the end I only came up with a bunch of examples.

Still, it was worth trying…

Lessons learned…

  • Some web sites detect phantom JS as a bot – they will block your IP, or offer a captcha challenge
  • Lots of top domains don’t even host a web site; you can see default IIS, Nginx pages, errors (404, 403s ;))
  • Privacy banners, if they exist, are handled in many different ways – from simple OK, to more advanced settings with a multi-choice questionnaire; I include some example below
  • Many non-English web sites provide information about privacy in their native language; this is an interesting conundrum to solve in general – how a non-speaker can use the web site w/o an ability to understand the Privacy Policies? I provide some examples in French, Italian and Dutch (and of course, English)
  • Way too many advertising and marketing web sites, all united to promise you the best monetization ever; and yes, AI-based advertising is already here 🙂

I am wondering if the methodology I used was incorrect? Perhaps it would be faster to just query google for all the web sites that refer to GDPR? I couldn’t come up with a good google dork though. And searching still brings many of such geo-locked web sites and include them in ‘normal’ results. You only learn about GDPR stuff when you try to visit the actual page. Google cache is still available though in some cases. So… I guess this transitional stage will last for some time. If you have any idea on how to run a research like this better, please let me know.

And finally some screenshots

diynetwork.com

goodrx.com

chicago tribune

Collect and gather

Everquote

Fubo.tv

Gannet

Orlando Sentinel

Pandora (not sure if it is GDPR related though)

Myspace

Ebates

European Union page itself

Atlas Obscura

At Hoc

Cosmopolitan

My recipes

piwikpro

Simpsons World

Le Monde

Meteo IT

NOS

And finally NSFW, all the screenshots related to porn.