In Soviet Russia, ad finds you

This is a post I wrote for work to explain the basics and foundations of ad tracking. It was meant for a fairly nontechnical audience, so there are probably some oversimplifications. There are also possibly some mistakes because I'm just looking at ads from the outside - I've never really dealt with them professionally. Just, like, every day on the internet.


So there you are, hanging out, talking about your favorite cleaning product. Next thing you know, you see an ad for Tide on Instagram. What happened? Is your phone listening to you? Or is there something deeper and more complex going on behind everything? Hopefully this article will shed some light, though maybe not on the conspiracy you thought. I'll try to keep it not very technical.

You know what happens to nosy fellows?

Cookies

BUT FIRST: cookies! You need to know about cookies. Remember back in the 90's when you thought they sounded cute? They're the source of everything, but they’re just tiny bits of text that are attached to a website - little files, really. They're pretty helpful - they keep us logged in, they save things that we add to a cart, they save our Dark Mode preferences, and our darkest secrets.

You can see them if you open up your Chrome dev tools and go to the Application Tab like below. This is just a sample of the cookies we load on carhartt.com.

Devtools cookies

Looks pretty boring, right? Well, look at the Domain column there...see anything funny? Facebook? Yahoo? doubleclick.net? channeladvisor.com? Those aren't Carhartt at all. They are, in fact, third-party cookies. How did they get added to carhartt.com?

HTTP requests!

Sorry, did I say this wouldn't get technical? You also need to know about HTTP requests. You simply must. HTTP is what the internet is built on. It's how you view webpages and look at good dogs all day. Whenever you go to a website, your browser asks that website for the page's code (that's called an HTTP request), and the web server has a record of all of the requests. That record can look something like this:

Access log

It's also important to note that everything else that a website pulls in goes through the same process. When you think of everything else that might be pulled in as part of a website, think: images, videos, code to make the page fancy and good-looking, code for analytics, etc. These are called "resources" in the biz. So up there, you might see requests for images (there's a line for GET /apache_pb.gif).

One of the cool parts of the internet is that people are able to kind of borrow other people's images. When they do that, a request is sent out to the site that "owns" the image ("hosts" would be the technical word). When you load a page where a picture is being pulled in like that, what you don't see is that the "owner" of the picture gets a tiny bit of information about you. For instance, if you're on carhartt.com, and we're loading an image from Scene7, then Scene7 will know exactly where that image is going. Or if I'm clicking on a page from Google, the site will know that, too. It's called the "Referer" [sic]. Along with that information, there are also some cookies that are being passed back and forth.

Here's what Chrome is sending to a website after I google "What are my http headers" and click on one of the links:

HTTP headers

The referer is telling that website that I came from Google. The User-Agent is also interesting. You can check it out yourself here. Note that these headers are easily spoofed, so it's not necessarily the best way of collecting information.

Bringing it all together

Probably not super exciting so far, but this is where everything finally starts to tie together. Cookies and HTTP are important to know about because most websites load things from all over, which is why privacy has become such a big concern. Remember that sample of cookies I showed above? Carhartt.com is loading things from all of those sites and more. And when you load a resource, any cookies for that site are sent along with the rest of the information.

Let’s see how that applies to Facebook and Google. Not that they are the only examples - adtech is an area with a whole lot of different businesses involved. But they are two that everyone knows, and that probably have the most complex operations.

Facebook

Let's say 69% of Americans have a Facebook account. Let's say you logged into Facebook last night and were cruising your news feed. Imagine that "like" buttons are still popular (they don't need "like" buttons to get this info anymore). And also imagine that Carhartt has a "like" button on our product pages.

That "like" button is really just a snippet of code that Facebook tells the website people to copy-paste into the page source. In that code, there is an image that's loaded from facebook.com. Say you're looking at our classic Chore Coat, and that there is a "like" button underneath. Just by going to that page, Facebook knows that you're considering purchasing that chore coat. And they know that it might be ill-fitting what with your philosophy and computer science degree and that she never did love you after all...just for example...

How do they know all that? Well - remember cookies? You've been to Facebook.com, so there is a cookie set for that site. The cookie is essentially the key to your profile - it’s how you can close your tab and then come back to it later without having to sign in again. Facebook.com has a lot of your personal information - like a list of your fav post rock bands from 2005. You are loading their image, so you are sending them the referer (the Carhartt Chore coat page) and your profile (via the cookie). Now Facebook has a network of sites that you've visited and products that you've viewed and can tie that right to your identity.

Now imagine that you also have a mobile device. Imagine it's smart and that you visit websites with that device. Including Facebook. Now Facebook can tie different devices to your profile and can probably infer which are for work. And they definitely know your location (google device graphing).

Now imagine that you have friends. Imagine that those friends, too, have mobile devices. Imagine that they also visit Facebook and give them their location without necessarily realizing it. Imagine that you hang out with those friends. Well, Facebook already knew that - they have your friend network and everyone's locations.

Google + other adtech companies

My guess is that all ad networks would kind of work the same, but with less data than Google or Facebook (so they can argue that their ads are better and sell them for more $$).

When a website joins an ad network, let's say Google's specifically, they tell you to put a script in your site. A script can run any code it wants once it's loaded, so it's hard to say exactly what is in that script. But at a minimum, you'll be sharing the same information you're giving Facebook through those "like" buttons. Google is able to build up a network of sites that you've visited, so they know what to show you, or can guess, or just show something generic or profitable. They also know what you click on, and probably what your mouse hovers over, or if you've stopped scrolling so you can read the ad.

Not to get conspiratorial, but...

Remember Google+? My guess would be that they created that social network so they could tie ads directly to your profile. And it kind of worked. I'm always signed into Google when I'm searching, so they can tie everything right back to me.

Remember Chrome? Boy is it a popular browser. And they made it super easy to stay signed into Google and sync between devices.

Do you ever use Google Maps or Waze?

Have an Android phone?

Anyways.

Beyond cookies

I mentioned above that Facebook no longer needs “Like” buttons to track your browsing habits. Now that Facebook has proved itself to be a very valuable advertising tool, people will happily let them take whatever information they want (see: Facebook pixel). Facebook isn’t the only company that does this, of course, they are just a stand-in for any other company that likes data. Google and Amazon definitely do it. I’m sure Adobe does. Even educated fleas do it. Any third-party script that someone adds to their page can take whatever information they want from the user.

What is a script? A script in this context is a chunk of code that people add to their site, and it runs whenever the page is loaded. Take Google Analytics - they give you a snippet of code, which in turn loads some of their code, and gives you the user journeys on your site. There are even some websites that use their users to mine bitcoin (not fantastic for battery life). That code is often obfuscated (“minified”), so again, it’s hard to say exactly what each script does. You mostly have to trust the organization whose code you’re pulling in.

Now that all of these websites have scripts from Facebook and Google added willy nilly, and since those are two of the major online advertisers (and gatekeepers for your online identity), there is less of a need for third-party cookies. Which is convenient, because the world is actually moving away from third-party cookies as a means of tracking (Mozilla and Safari block them by default + Chrome will be doing it soon).

I don't really want to get into mobile apps here, but assume that they are, in some ways, even more permissive than scripts. Apple won't let apps do any sneaky bitcoin mining, but the apps are often given access to your location, photos, contacts, etc.

Forget it, Jake. It's Chinatown

Targeted advertising

The companies have all of your data - now let’s look at what they do with it:

Credit card ad

Do you see that, or have you unconsciously blocked it?

For all of the data they collect, what you see is still just some banner ad. Maybe it’s for a brand that’s a bit more relevant. There’s that cliche: “The greatest minds of our time are thinking about how to make people click ads”, which...yeah, probably. The relatively new Amazon ad team started bringing in $1 billion in like a year. That’s because with all of the knowledge that they have of consumers, these companies can charge a lot of money - up to 500% more - for targeted ads vs. untargeted ads. That is, ads that show you something tailored to your personal profile vs. ads that show you something less bespoke. That is, ads that track you vs. ads that don’t. There are reports that targeted advertising isn’t worth the extra money, so we’ll see how that shakes out, but people generally seem to think they are. Contextual ads are one of the alternatives.

There are other applications of your data, too, like training facial recognition algorithms based off of all of the labelled pictures that have been posted to Facebook and Instagram or creating recommendation engines (think Netflix or Amazon's suggestions).

Privacy

The reason that a lot of sites join ad networks or bring in third-party scripts from adtech companies is simple - they want to make money. They want each person that reads their blog to give them a fraction of a penny so maybe they can dream of paying their bills with the proceeds.

When businesses do it, it’s typically because they want to sell more of their product. Large parts of Facebook, etc. are dedicated to proving to their advertisers that the ad spend of those companies is generating large returns (how honest they are is up for debate). So companies put the scripts on their site.

User privacy usually isn’t part of the equation, or if it is, well, they have to drive traffic, convert users, and sell things. I mean, imagine if a company didn’t do that - it would be crazy - just completely irresponsible. Who knows - they might go out of business because of it. And, what, big tech knows that the user is looking at well-made, hardworking apparel? That’s hardly the end of the world.

And, as a web developer myself, that stuff is hard to avoid adding. You need to know your traffic and the users on your site, and there isn’t time to build something new. So if Google has a tool I can add that’s free or cheap, and that has been tested by essentially the entire internet, I’m probably going to go with that. Repeat that process for other foundational tools.

The problem is that the number of sites with those tools add up, creating a huge network, with the effect that big tech knows about 90% of the sites you visit, and can track you everywhere you go (via phone location data). How much of what you think about is reflected in search terms or online research? Messages to friends? Photos, and online posts? That's a whole lot of your physical, mental, and emotional state that is being tracked by these companies. And oftentimes sold, directly or indirectly.

How can you guard against that? Here is a guide. Or, off the top of my head, you can block third-party cookies, for one. Turn off location sharing in your apps + exit them when you aren’t using them. And/or opt for their websites instead. Firefox also has better privacy defaults than Chrome. Delete Facebook? Or you can use Tor and never sign into social networks. Eh - better just use that guide.

There are also some interesting ideas where you generate thousands of online profiles with your name and a bunch of different interests so that the real you is obfuscated.

Conclusion

So that's how, once you go to a company's website, you start seeing their logo on top of half the internet. It began with cookies and HTTP and has since evolved. I hope that answered all of your- what's that? The Instagram ad for Tide? Oh, no, Facebook was definitely listening to you.

Etc.

Yeah, pretty heavy on The Verge...