Real-Time Bidding transparency via Ads.txt
Web privacy and transparency engineering can work both for businesses and users. Sometimes technologies can effectively open the opportunities for enhancing transparency even without a clear intention of doing so.
In this post, I’m analyzing the extension to OpenRTB (Real-Time Bidding specification) which is meant to decrease the rate of fraud in programmatic advertising. One of the side-effects of the specification is also boosting transparency for users.
Real-Time Bidding (RTB) is a system fuelling a multi-billion dollar programmatic advertising industry. In this paradigm, bidders compete for impressions in online auctions. Imagine tens or hundreds of bidders attempting to buy the opportunity to direct a message (such as ad) to the user. I am interested in those systems for a while now, especially in the security and privacy aspects. In the past I’ve made some works in the domain (e.g. Privacy & value analysis, flows, transparency). Back in 2013 I was analyzing how RTB auctions can be abused and how messages of political nature may be channeled via these systems. These days we either call it political PR, or sometimes FakeNews. It’s actually a day-to-day reality.
Another aspect of my consideration was fraud.
To keep it short. Real-Time Bidding involves auctions selling users data. They are also routinely misused to serve malware (malvertising). Additionally, fraudsters can benefit from the system to, well, obtain monetary income via fraudulent activities. Think click-fraud - fake clicks on ads. Fraud is a real issue and the expected loss is supposed to reach as high as $16 billion.
But in the RTB world, fraud can be even more clever. In a RTB spoofing attack, attackers can claim they’re selling ads on valuable sites, for example New York Times, while in fact they only offer space on pretty much unknown website. In other words - publishers such as NYT are discovering that someone (for simplicity - a rogue Ad Exchange) is offering ads on their sites, though in reality this is not the case. Spoofing frauds are enabled by the fact that RTB has no notion of trust or an option to verify whether the offered ad space is really on sale. There are many other abuse methods, but this particular one might soon be solved.
Internet Advertising Bureau introduced a specification which can allow publishers (such as New York Times or CNN) to indicate which parties can offer ad space on their site. The standard is called Ads.txt (Authorized Digital Sellers) and is pretty close to the well-known robots.txt. But Ads.txt is strictly about programmatic advertising. The specification itself can be found here.
The specification contributors listed in the specification come pretty often from key industry players. It’s a good question why the spec is designed in a way making it seemingly difficult to use the actual solution, or to develop and extend it. And it will apparently be extended to support subdomains, mobile apps, and so on.
The specification is very new and is not yet well adopted, but you can see it for example on Washington Post site:
indexexchange.com, 183960, DIRECT
adtech.com, 10316, DIRECT
aolcloud.net, 10316, DIRECT
appnexus.com, 7466, DIRECT
google.com, pub-3980300725513096, DIRECT
c.amazon-adsystem.com, 3041, DIRECT
openx.com, 537108359, DIRECT
openx.com, 537143344, RESELLER
teads.tv, 12293, DIRECT, 15a9c44f6d26cbe1
teads.tv, 10794, DIRECT, 15a9c44f6d26cbe1
google.com, pub-1995032544933848, DIRECT
I’ll simplify the description here. It’s a comma-separated file, where each line indicates a domain of the ad exchange (holds RTB auctions), the account ID (i.e. the ID of New York Times in AppNexus or Doubleclick ad exchanges), and the relationship. DIRECT means direct sell of ads spaces, RESELLER means ability to resell it forward.
The mechanism is pretty simple. Just a text file, rather manageable, and may deliver on its promise - after all, if publishers are concerned with fraudulent traffic, they should be interested in the little effort of boosting transparency/authenticity of programmatic ads. Again, it’s transparency in programmatic ads, not for web users.
In 2014, I co-authored the first privacy analysis of RTB (Selling Off Privacy at Auction). I also released a transparency enhancing technology which showed the price (in dollars) that someone has been paying for a visit of a particular user to a website. It all worked in real-time. Back then, it was necessary to perform a lot of processing to detect that: a site is displaying ads served via RTB, and the monetary value.
With Ads.txt, though it was not the design goal (intermezzo: European Union works on ePrivacy which might result in requiring more transparency in RTB), it’s pretty simple to conclude that:
- A site is taking advantage of RTB ads
- A site is offering its traffic (so also: visitor’s data) to third parties, the ad exchanges
You can also easily see the names of Ad Exchanges.
Transparency Measurement
I made a little transparency measurement. My goal was to check the current adoption of Ads.txt, to see how it’s actually used, and perhaps discover something of interest.
Results
I crawled the most popular 10,000 sites (from Alexa). Some early results below. The method of data gathering was the good old curl and a bunch of shell hackery.
Not many users
The specification is relatively new but is already gaining adoption. I detected 26 sites using Ads.txt. This indicated that crawling beyond the first 10,000 sites was not justified at the moment. Among the most prominent were Washington Post, New York Times, FoxNews, The Economist, but also collegehumor.com. I expect this number to grow.
I made a plot which displays the total number of Ad Exchange domain names in Ads.txt from particular websites. The plots also contain the numbers of Ad Exchanges marked as “resellers” and “direct”s. As an example, New York Times includes Google, Facebook and Amazon, Washington Post’s site includes Google and Amazon. The website with the largest number of entries in the Ads.txt was dailymail.co.uk, a quality tabloid. The lowest was Business Insider. The total number of Ad Exchanges found in all the Ads.txt was 58, with examples including OpenX (I know them pretty well and you may too), Rubiconproject, Facebook or Google. Each of these has tens or hundreds of bidders who bid in online auctions.
Caveat: adding the number of “reseller” and “direct” domains do not equal the “total”, as sometimes domain names are included twice or even more. Sometimes an Ad Exchanges is included twice. As a “reseller” and “direct”. In some of the cases, the publisher id has been identical which indicates that some entries are redundant. It probably doesn’t change anything but it says one thing: the Ads.txt specification is relatively simple, but still not clear enough for some. I’m wondering what kind of “fun” may result from this in the future?
Another informative observation is that Publishers sometimes have a number of ID’s in the same Ad Exchange systems
Flow analysis
Ads.txt enables a linkage analysis. The things linking separate site A and site B are common domains in the Ads.txt. The picture below (I made it in Python NetworkX) is a graph of relationship indicating which websites (blue nodes) offer their ads spaces/viewership/users to which Ad Exchanges (red nodes). For instance, it shows that only one site provides data to A9 (Amazon’s Ad Exchange), while Rubiconproject and Google are much more popular. The graph is already rather dense, even with the small data sample (26 domain names with Ads.txt). When the specification becomes more popular - expect the network structure change significantly and emerge more visibly. The true picture will be thousands of sites connected to only about an hundred or so (an original image size can be accessed here).
Summary
I’m always fascinated with the challenges posed by RTB - in terms of security, privacy, and fraud.
Transparency and trust engineering is very important for many industries. Ads.txt is a work-in-progress specification for programmatic advertising.
Transparency engineering is also important for users. It allows understanding which of their data, and how - are being processed. In itself, it builds trust (in the user, and between the user and a service).
I’ll consider updating/improving the study in the future. I think it Ads.txt can be easily used to improve the transparency of web browsing. Here also I have an idea. Perhaps I’ll write about in the future, or do it. I actually also have some ideas of improving the Ads.txt standard itself.
I’m always open to making interesting projects, analyses or works. If you’re interested, feel free to contact me