About FreshRSS

There are new available articles, click to refresh the page.

Before yesterdayYour RSS feeds

Krebs on Security
Mozilla Drops Onerep After CEO Admits to Running People-Search Networks
March 22^nd 2024 at 19:02

Mozilla Drops Onerep After CEO Admits to Running People-Search Networks

By BrianKrebs

The nonprofit organization that supports the Firefox web browser said today it is winding down its new partnership with Onerep, an identity protection service recently bundled with Firefox that offers to remove users from hundreds of people-search sites. The move comes just days after a report by KrebsOnSecurity forced Onerep’s CEO to admit that he has founded dozens of people-search networks over the years.

Mozilla Monitor. Image Mozilla Monitor Plus video on Youtube.

Mozilla only began bundling Onerep in Firefox last month, when it announced the reputation service would be offered on a subscription basis as part of Mozilla Monitor Plus. Launched in 2018 under the name Firefox Monitor, Mozilla Monitor also checks data from the website Have I Been Pwned? to let users know when their email addresses or password are leaked in data breaches.

On March 14, KrebsOnSecurity published a story showing that Onerep’s Belarusian CEO and founder Dimitiri Shelest launched dozens of people-search services since 2010, including a still-active data broker called Nuwber that sells background reports on people. Onerep and Shelest did not respond to requests for comment on that story.

But on March 21, Shelest released a lengthy statement wherein he admitted to maintaining an ownership stake in Nuwber, a consumer data broker he founded in 2015 — around the same time he launched Onerep.

Shelest maintained that Nuwber has “zero cross-over or information-sharing with Onerep,” and said any other old domains that may be found and associated with his name are no longer being operated by him.

“I get it,” Shelest wrote. “My affiliation with a people search business may look odd from the outside. In truth, if I hadn’t taken that initial path with a deep dive into how people search sites work, Onerep wouldn’t have the best tech and team in the space. Still, I now appreciate that we did not make this more clear in the past and I’m aiming to do better in the future.” The full statement is available here (PDF).

Onerep CEO and founder Dimitri Shelest.

In a statement released today, a spokesperson for Mozilla said it was moving away from Onerep as a service provider in its Monitor Plus product.

“Though customer data was never at risk, the outside financial interests and activities of Onerep’s CEO do not align with our values,” Mozilla wrote. “We’re working now to solidify a transition plan that will provide customers with a seamless experience and will continue to put their interests first.”

KrebsOnSecurity also reported that Shelest’s email address was used circa 2010 by an affiliate of Spamit, a Russian-language organization that paid people to aggressively promote websites hawking male enhancement drugs and generic pharmaceuticals. As noted in the March 14 story, this connection was confirmed by research from multiple graduate students at my alma mater George Mason University.

Shelest denied ever being associated with Spamit. “Between 2010 and 2014, we put up some web pages and optimize them — a widely used SEO practice — and then ran AdSense banners on them,” Shelest said, presumably referring to the dozens of people-search domains KrebsOnSecurity found were connected to his email addresses (dmitrcox@gmail.com and dmitrcox2@gmail.com). “As we progressed and learned more, we saw that a lot of the inquiries coming in were for people.”

Shelest also acknowledged that Onerep pays to run ads on “on a handful of data broker sites in very specific circumstances.”

“Our ad is served once someone has manually completed an opt-out form on their own,” Shelest wrote. “The goal is to let them know that if they were exposed on that site, there may be others, and bring awareness to there being a more automated opt-out option, such as Onerep.”

Reached via Twitter/X, HaveIBeenPwned founder Troy Hunt said he knew Mozilla was considering a partnership with Onerep, but that he was previously unaware of the Onerep CEO’s many conflicts of interest.

“I knew Mozilla had this in the works and we’d casually discussed it when talking about Firefox monitor,” Hunt told KrebsOnSecurity. “The point I made to them was the same as I’ve made to various companies wanting to put data broker removal ads on HIBP: removing your data from legally operating services has minimal impact, and you can’t remove it from the outright illegal ones who are doing the genuine damage.”

Playing both sides — creating and spreading the same digital disease that your medicine is designed to treat — may be highly unethical and wrong. But in the United States it’s not against the law. Nor is collecting and selling data on Americans. Privacy experts say the problem is that data brokers, people-search services like Nuwber and Onerep, and online reputation management firms exist because virtually all U.S. states exempt so-called “public” or “government” records from consumer privacy laws.

Those include voting registries, property filings, marriage certificates, motor vehicle records, criminal records, court documents, death records, professional licenses, and bankruptcy filings. Data brokers also can enrich consumer records with additional information, by adding social media data and known associates.

The March 14 story on Onerep was the second in a series of three investigative reports published here this month that examined the data broker and people-search industries, and highlighted the need for more congressional oversight — if not regulation — on consumer data protection and privacy.

On March 8, KrebsOnSecurity published A Close Up Look at the Consumer Data Broker Radaris, which showed that the co-founders of Radaris operate multiple Russian-language dating services and affiliate programs. It also appears many of their businesses have ties to a California marketing firm that works with a Russian state-run media conglomerate currently sanctioned by the U.S. government.

On March 20, KrebsOnSecurity published The Not-So-True People-Search Network from China, which revealed an elaborate web of phony people-search companies and executives designed to conceal the location of people-search affiliates in China who are earning money promoting U.S. based data brokers that sell personal information on Americans.

Related tags
March 22^nd 2024 at 19:02

Troy Hunt
Inside the Massive Alleged AT&T Data Breach
March 19^th 2024 at 06:39

Inside the Massive Alleged AT&T Data Breach

By Troy Hunt

I hate having to use that word - "alleged" - because it's so inconclusive and I know it will leave people with many unanswered questions. (Edit: 12 days after publishing this blog post, it looks like the "alleged" caveat can be dropped, see the addition at the end of the post for more.) But sometimes, "alleged" is just where we need to begin and over the course of time, proper attribution is made and the dots are joined. We're here at "alleged" for two very simple reasons: one is that AT&T is saying "the data didn't come from us", and the other is that I have no way of proving otherwise. But I have proven, with sufficient confidence, that the data is real and the impact is significant. Let me explain:

Firstly, just as a primer if you're new to this story, read BleepingComputer's piece on the incident. What it boils down to is in August 2021, someone with a proven history of breaching large organisations posted what they claimed were 70 million AT&T records to a popular hacking forum and asked for a very large amount of money should anyone wish to purchase the data. From that story:

From the samples shared by the threat actor, the database contains customers' names, addresses, phone numbers, Social Security numbers, and date of birth.

Fast forward two and a half years and the successor to this forum saw a post this week alleging to contain the entire corpus of data. Except that rather than put it up for sale, someone has decided to just dump it all publicly and make it easily accessible to the masses. This isn't unusual: "fresh" data has much greater commercial value and is often tightly held for a long period before being released into the public domain. The Dropbox and LinkedIn breaches, for example, occurred in 2012 before being broadly distributed in 2016 and just like those incidents, the alleged AT&T data is now in very broad circulation. It is undoubtedly in the hands of thousands of internet randos.

AT&T's position on this is pretty simple:

AT&T continues to tell BleepingComputer today that they still see no evidence of a breach in their systems and still believe that this data did not originate from them.

The old adage of "absence of evidence is not evidence of absence" comes to mind (just because they can't find evidence of it doesn't mean it didn't happen), but as I said earlier on, I (and others) have so far been unable to prove otherwise. So, let's focus on what we can prove, starting with the accuracy of the data.

The linked article talks about the author verifying the data with various people he knows, as well as other well-known infosec identities verifying its accuracy. For my part, I've got 4.8M Have I Been Pwned (HIBP) subscribers I can lean on to assist with verification, and it turns out that 153k of them are in this data set. What I'll typically do in a scenario like this is reach out to the 30 newest subscribers (people who will hopefully recall the nature of HIBP from their recent memory), and ask them if they're willing to assist. I linked to the story from the beginning of this blog post and got a handful of willing respondents for whom I sent their data and asked two simple questions:

Does this data look accurate?
Are you an AT&T customer and if not, are you a customer of another US telco?

The first reply I received was simple, but emphatic:

This individual had their name, phone number, home address and most importantly, their social security number exposed. Per the linked story, social security numbers and dates of birth exist on most rows of the data in encrypted format, but two supplemental files expose these in plain text. Taken at face value, it looks like whoever snagged this data also obtained the private encryption key and simply decrypted the vast bulk (but not all of) the protected values.

The above example simply didn't have plain text entries for the encrypted data. Just by way of raw numbers, the file that aligns with the "70M" headline actually has 73,481,539 lines with 49,102,176 unique email addresses. The file with decrypted SSNs has 43,989,217 lines and the decrypted dates of birth file only has 43,524 rows. (Edit: the reason for this later became clear - there is only one entry per date of birth which is then referenced from multiple records.) The last file, for example, has rows that look just like this:

.encrypted_value='*0g91F1wJvGV03zUGm6mBWSg==' .decrypted_value='1996-07-18'

That encrypted value is precisely what appears in the large file hence providing an easy way of matching all the data together. But those numbers also obviously mean that not every impacted individual had their SSN exposed, and most individuals didn't have their date of birth leaked. (Edit: per above, the same entries in the DoB file are referenced by multiple source records so whilst not every record had a DoB recorded, the difference isn't as stark as I originally reported.)

As I'm fond of saying, there's only one thing worse than your data appearing on the dark web: it's appearing on the clear web. And that's precisely where it is; the forum this was posted to isn't within the shady underbelly of a Tor hidden service, it's out there in plain sight on a public forum easily accessed by a normal web browser. And the data is real.

That last response is where most people impacted by this will now find themselves - "what do I do?" Usually I'd tell them to get in touch with the impacted organisation and request a copy of their data from the breach, but if AT&T's position is that it didn't come from them then they may not be much help. (Although if you are a current or previous customer, you can certainly request a copy of your personal information regardless of this incident.) I've personally also used identity theft protection services since as far back as the 90's now, simply to know when actions such as credit enquiries appear against my name. In the US, this is what services like Aura do and it's become common practice for breached organisations to provide identity protection subscriptions to impacted customers (full disclosure: Aura is a previous sponsor of this blog, although we have no ongoing or upcoming commercial relationship).

What I can't do is send you your breached data, or an indication of what fields you had exposed. Whilst I did this in that handful of aforementioned cases as part of the breach verification process, this is something that happens entirely manually and is infeasible en mass. HIBP only ever stores email addresses and never the additional fields of personal information that appear in data breaches. In case you're wondering why that is, we got a solid reminder only a couple of months ago when a service making this sort of data available to the masses had an incident that exposed tens of billions of rows of personal information. That's just an unacceptable risk for which the old adage of "you cannot lose what you do not have" provides the best possible fix.

As I said in the intro, this is not the conclusive end I wanted for this blog post... yet. As impacted HIBP subscribers receive their notifications and particularly as those monitoring domains learn of the aliases in the breach (many domain owners use unique aliases per service they sign up to), we may see a more conclusive outcome to this incident. That may not necessarily be confirmation that the data did indeed originate from AT&T, it could be that it came from a third party processor they use or from another entity altogether that's entirely unrelated. The truth is somewhere there in the data, I'll add any relevant updates to this blog post if and when it comes out.

As of now, all 49M impacted email addresses are searchable within HIBP.

Edit (31 March): AT&T have just released a short statement making 2 important points:

AT&T data-specific fields were contained in a data set

it is not yet known whether the data in those fields originated from AT&T or one of its vendors

They've also been mass-resetting account passcodes after TechCrunch apparently alerted AT&T to the presence of these in the data set. That article also includes the following statement from AT&T:

Based on our preliminary analysis, the data set appears to be from 2019 or earlier, impacting approximately 7.6 million current AT&T account holders and approximately 65.4 million former account holders

Between originally publishing this blog post and AT&T's announcements today, there have been dozens of comments left below that attribute the source of the breach to AT&T in ways that made it increasingly unlikely that the data could have been sourced from anywhere else. I know that many journos (and myself) reached out to folks in AT&T to draw their attention to this, I'm happy to now end this blog post by quoting myself from the opening para 😊

But sometimes, "alleged" is just where we need to begin and over the course of time, proper attribution is made and the dots are joined.

Related tags
- ❌
- Have
- I
- Been
- Pwned
- Security
March 19^th 2024 at 06:39

Troy Hunt
The Data Breach "Personal Stash" Ecosystem
January 29^th 2024 at 08:07

The Data Breach "Personal Stash" Ecosystem

By Troy Hunt

I've always thought of it a bit like baseball cards; a kid has a card of this one player that another kid is keen on, and that kid has a card the first one wants so they make a trade. They both have a bunch of cards they've collected over time and by virtue of existing in the same social circles, trades are frequent, and cards flow back and forth on a regular basis. That's the analogy I often use to describe the data breach "personal stash" ecosystem, but with one key difference: if you trade a baseball card then you no longer have the original card, but if you trade a data breach which is merely a digital file, it replicates.

There are personal stashes of data breaches all over the place and they're usually presented like this one:

You'll recognise many of those names because they're noteworthy incidents that received a bunch of press. My Space. Adobe. LinkedIn. Ashley Madison.

The same incidents appear here:

And so on and so forth. Stashes of breaches like this are all over the place and they fuel an exchange ecosystem that replicates billions of records of personal data over and over again. Your data. My data. The data of a significant portion of the global internet-using population, just freely flowing backwards and forwards not just in the shady corners of "the dark web" but traded out there in the clear on mainstream websites. Until inevitably:

Diogo Santos Coelho was 14 when he started RaidForums, and was 21 by the time he was arrested for running the service 2 years ago. A kid, exchanging data without the maturity to understand the consequences of his actions. RaidForums left a void that was quickly filled by BreachForums:

Conor Fitzpatrick was 20 years old when he was finally picked up for running the service last year. Still just a kid, at least in the colloquial fashion in which we refer to youngsters as when we get a bit older, but surely still legally a minor when he chose to begin collecting data breaches.

Websites like these are taken down for a simple reason:

The ecosystem of personal stashes exchanged with other parties fuels crime.

For example, data breaches seed services set up with the express intent of monetising a broad range of personal attributes to the detriment of people who are already victims of a breach. Call them shady versions of Have I Been Pwned if you will, and this talk I gave at AusCERT a couple of years ago is a great explainer (deep-linked to the start of that segment):

The first service I spoke about in that segment was We Leak Info and it was run by two 22 year old guys. The website first appeared 3 years earlier - only a year after the creators had left childhood - and it allowed anyone with the money to access anyone else's personal data including:

names, email addresses, usernames, phone numbers, and passwords

One of the duo was later sentenced to 2 years in prison for his role, and when you read the sorts of conversations they were having, you can't help but think they behaved exactly like you'd expect a couple of young guys who thought they were anonymous would:

In the video, I mentioned Jordan Bloom in relation to LeakedSource, a veritable older gentleman of this class of crime being 24 when the site first appeared.

The company operating LeakedSource, Defiant Tech Inc, which was founded by Jordan Bloom, eventually entered a guilty plea to charges that included trafficking in identity information and when you read what that involved, you can see why this would attract the ire of law enforcement agencies:

However, unlike other breach notification services, such as Have I Been Pwned, LeakedSource also gave subscribers access to usernames, passwords (including in clear text), email addresses and IP addresses. LeakedSource services were often advertised on hacking forums and there was suspicion that its operators were actively looking to hack organizations whose data they could add to their database.

In 2016, a well-wisher purchased my own data from LeakedSource and sent over a dozen different records similar to this one:

Not mentioned in my talk but running in the same era was Leakbase, yet another service that collated huge volumes of sensitive data and sold it to absolutely anyone:

And just like all the other ones, the same data appeared over and over again:

It went dark at the end of 2017 amidst speculation the disappearance was tied to the takedown of the Hansa dark web market. If that was the case, why did we never hear of charges being laid as we did with We Leak Info and LeakedSource? Could it be that the operator of Leakbase was only ever so slightly younger than the other guys mentioned above and not having yet reached adulthood, managed to dodge charges? It would certainly be consistent with the demographic pattern of those with personal stashes of data breaches.

Speaking of patterns: We Leak Info, LeakedSource, Leakbase - it's like there's a theme of shady services attached to the word. As I say in the video, there's also a theme of attempting to remain anonymous (which clearly hasn't worked very well!), and a theme of attempting to eschew legal responsibility for how the data is used by merely putting words in the terms of service. For example, here's Jordan's go at deflecting his role in the ecosystem and yes, this was the entire terms of service:

I particularly like this clause:

You may only use this tool for your own personal security and data research. You may only search information about yourself, or those you are authorized in writing to do so.

That's not going to keep you out of trouble! Time and time again, I see this sort of wording on services used as if it's going to make a difference when the law comes asking hard questions; "Hey we literally told people to play nice with the data!"

We Leak Info used similar entertaining wording with some of the highlights including:

We Leak Info strictly prohibits the use of its Services to cause damage or harm to others
You may not use Our Services in acts deemed illegal by the laws in Your region
We Leak Info does not knowingly participate in the act of obtaining or distributing Data
We Leak Info will cooperate with any legal investigations that it determines worthy and valid at its own discretion

That last one in particular is an absolute zinger! But again, remember, we're talking about guys who stood this service up as teenagers and literally worked on the assumption of "as [l]ong as we cooperate they [the FBI] won't fuck with us" 🤦‍♂️ The ignorance of that attitude whilst advertising services on criminal forums is just mind-blowing, even for kids.

All of which brings me to the inspiration for this blog post:

Interesting find by @MayhemDayOne, wonder if it was from a shady breach search service (we’ve seen a bunch shut down over the years)? Either way, collecting and storing this data is now trivial so not a big surprise to see someone screw up their permissions and (re)leak it all. https://t.co/DM7udeUcRk
— Troy Hunt (@troyhunt) January 22, 2024

It's like I've seen it all before! No, really, because only a couple of days later someone running a service popped up and claimed responsibility for having exposed the data due to "a firewall misconfiguration". I'm not going to name or link the service, but I will describe a few key features:

After purchasing access, it returns extensive personal information exposed in data breaches including names, email addresses, usernames, phone numbers, and passwords
The operator is clearly trying to remain anonymous with no discoverable information about who is running it
It has ToS that include: "You may only use this service for your own personal security and research. Furthermore, you may only search for information about yourself or those who you are authorized in writing to do so." (I know what you're thinking, so I diff'd it for you)
The name of the service starts with the word "leak"

I could write predictions about the future of this service but if you've read this far and paid attention to the precedents, you can reliably form your own conclusion. The outcome is easily predictable and indeed it was the predictability of the whole situation when I started getting bombarded with queries about the "Mother of all Breaches" that frustrated me; of course it was someone's personal stash, because we've seen it all before and we live in an era where it's dead easy to build services like this. Cloud is ubiquitous and storage is cheap, you can stand up great looking websites in next to no time courtesy of freely available templates, and the whole data breach trading ecosystem I referred to earlier can easily seed services like this.

Maybe the young guy running this service (assuming the previously observed patterns apply) will learn from history and quietly exit while the getting is good, I don't know, time will tell. At the very least, if he reads this and takes nothing else away, don't go driving around in a bright green Lamborghini!

Edit: In the original version of this blog post, it was incorrectly implied that Jordan Bloom may have been the person who pled guilty to charges when in fact it was the company that ran LeakedSource, Defiant Tech Inc, that the plea was entered under. To the extent that the blog contained words to the effect of, or otherwise implied or contained innuendo that Mr Bloom engaged in criminal or otherwise illegal conduct, or pled guilty to trafficking identify information, I apologise and unreservedly retract such statements and this blog has been edited to ensure that the facts involved in this matter are accurately portrayed.

Related tags
- ❌
- Have
- I
- Been
- Pwned
January 29^th 2024 at 08:07

Troy Hunt
Inside the Massive Naz.API Credential Stuffing List
January 17^th 2024 at 14:05

Inside the Massive Naz.API Credential Stuffing List

By Troy Hunt

It feels like not a week goes by without someone sending me yet another credential stuffing list. It's usually something to the effect of "hey, have you seen the Spotify breach", to which I politely reply with a link to my old No, Spotify Wasn't Hacked blog post (it's just the output of a small set of credentials successfully tested against their service), and we all move on. Occasionally though, the corpus of data is of much greater significance, most notably the Collection #1 incident of early 2019. But even then, the rapid appearance of Collections #2 through #5 (and more) quickly became, as I phrased it in that blog post, "a race to the bottom" I did not want to take further part in.

Until the Naz.API list appeared. Here's the back story: this week I was contacted by a well-known tech company that had received a bug bounty submission based on a credential stuffing list posted to a popular hacking forum:

Whilst this post dates back almost 4 months, it hadn't come across my radar until now and inevitably, also hadn't been sent to the aforementioned tech company. They took it seriously enough to take appropriate action against their (very sizeable) user base which gave me enough cause to investigate it further than your average cred stuffing list. Here's what I found:

319 files totalling 104GB
70,840,771 unique email addresses
427,308 individual HIBP subscribers impacted
65.03% of addresses already in HIBP (based on a 1k random sample set)

That last number was the real kicker; when a third of the email addresses have never been seen before, that's statistically significant. This isn't just the usual collection of repurposed lists wrapped up with a brand-new bow on it and passed off as the next big thing; it's a significant volume of new data. When you look at the above forum post the data accompanied, the reason why becomes clear: it's from "stealer logs" or in other words, malware that has grabbed credentials from compromised machines. Apparently, this was sourced from the now defunct illicit.services website which (in)famously provided search results for other people's data along these lines:

I was aware of this service because, well, just look at the first example query 🤦‍♂️

So, what does a stealer log look like? Website, username and password:

That's just the first 20 rows out of 5 million in that particular file, but it gives you a good sense of the data. Is it legit? Whilst I won't test a username and password pair on a service (that's way too far into the grey for my comfort), I regularly use enumeration vectors on websites to validate whether an account actually exists or not. For example, take that last entry for racedepartment.com, head to the password reset feature and mash the keyboard to generate a (quasi) random alias @hotmail.com:

And now, with the actual Hotmail address from that last line:

The email address exists.

The VideoScribe service on line 9:

Exists.

And even the service on the very first line:

From a verification perspective, this gives me a high degree of confidence in the legitimacy of the data. The question of how valid the accompanying passwords remain aside, time and time again the email addresses in the stealer logs checked out on the services they appeared alongside.

Another technique I regularly use for validation is to reach out to impacted HIBP subscribers and simply ask them: "are you willing to help verify the legitimacy of a breach and if so, can you confirm if your data looks accurate?" I usually get pretty prompt responses:

Yes, it does. This is one of the old passwords I used for some online services.

When I asked them to date when they might have last used that password, they believed it was was either 2020 or 2021.

And another whose details appears alongside a Webex URL:

Yes, it does. but that was very old password and i used it for webex cuz i didnt care and didnt use good pass because of the fear of leaking

And another:

Yes these are passwords I have used in the past.

Which got me wondering: is my own data in there? Yep, turns out it is and with a very old password I'd genuinely used pre-2011 when I rolled over to 1Password for all my things. So that sucks, but it does help me put the incident in more context and draw an important conclusion: this corpus of data isn't just stealer logs, it also contains your classic credential stuffing username and password pairs too. In fact, the largest file in the collection is just that: 312 million rows of email addresses and passwords.

Speaking of passwords, given the significance of this data set we've made sure to roll every single one of them into Pwned Passwords. Stefán has been working tirelessly the last couple of days to trawl through this massive corpus and get all the data in so that anyone hitting the k-anonymity API is already benefiting from those new passwords. And there's a lot of them: it's a rounding error off 100 million unique passwords that appeared 1.3 billion times across the corpus of data 😲 Now, what does that tell you about the general public's password practices? To be fair, there are instances of duplicated rows, but there's also a massive prevalence of people using the same password across multiple difference services and completely different people using the same password (there are a finite set of dog names and years of birth out there...) And now more than ever, the impact of this service is absolutely huge!

When we weren't looking, @haveibeenpwned's Pwned Passwords rocketed past 7 *billion* requests in a month 😲 pic.twitter.com/hVDxWp3oQG
— Troy Hunt (@troyhunt) January 16, 2024

Pwned Passwords remains totally free and completely open source for both code and data so do please make use of it to the fullest extent possible. This is such an easy thing to implement, and it has a profound impact on credential stuffing attacks so if you're running any sort of online auth service and you're worried about the impact of Naz.API, this now completely kills any attack using that data. Password reuse remain rampant so attacks of this type prosper (23andMe's recent incident comes immediately to mind), definitely get out in front of this one as early as you can.

So that's the story with the Naz.API data. All the email addresses are now in HIBP and searchable either individually or via domain and all those passwords are in Pwned Passwords. There are inevitably going to be queries along the lines of "can you show me the actual password" or "which website did my record appear against" and as always, this just isn't information we store or return in queries. That said, if you're following the age-old guidance of using a password manager, creating strong and unique ones and turning 2FA on for all your things, this incident should be a non-event. If you're not and you find yourself in this data, maybe this is the prompt you finally needed to go ahead and do those things right now 🙂

Edit: A few clarifications based on comments:

The blog post refers to both stealer logs and classic credential stuffing lists. Some of this data does not come from malware and has been around for a significant period of time. My own email address, for example, accompanied a password not used for well over a decade and did not accompany a website indicating it was sourced from malware.
If you're in this corpus of data and are not sure which password was compromised, 1Password can automatically (and anonymously) scan all your passwords against Pwned Passwords which includes all passwords from this corpus of data.
It's already in the last para of the blog post but given how many comments have asked the question: no, we don't store any data beyond the email addresses in the breach. This means we don't store any additional data from the breach such as if a specific website was listed next to a given address.

Related tags
- ❌
- Have
- I
- Been
- Pwned
January 17^th 2024 at 14:05

Troy Hunt
A Decade of Have I Been Pwned
December 4^th 2023 at 07:05

A Decade of Have I Been Pwned

By Troy Hunt

A decade ago to the day, I published a tweet launching what would surely become yet another pet project that scratched an itch, was kinda useful to a few people but other than that, would shortly fade away into the same obscurity as all the other ones I'd launched over the previous couple of decades:

It's alive! "Have I been pwned?" by @troyhunt is now up and running. Search for your account across multiple breaches http://t.co/U0QyHZxP6k
— Have I Been Pwned (@haveibeenpwned) December 4, 2013

And then, as they say, things kinda escalated quickly. The very next day I published a blog post about how I made it so fast to search through 154M records and thus began a now 185-post epic where I began detailing the minutiae of how I built this thing, the decisions I made about how to run it and commentary on all sorts of different breaches. And now, a 10th birthday blog post about what really sticks out a decade later. And that's precisely what this 185th blog post tagging HIBP is - the noteworthy things of the years past, including a few things I've never discussed publicly before.

Pwned?

You know why it's called "Have I Been Pwned"? Try coming up with almost any conceivable normal sounding English name and getting a .com domain for it. Good luck! That was certainly part of it, but another part of the name choice was simply that I honestly didn't expect this thing to go anywhere. It's like I said in the intro of this post where I fully expected this to be another failed project, so why does the name matter?

But it's weird how "pwned" has stuck and increasingly, become synonymous with HIBP. For many people, the first time they ever hear the word is in the context of "Have I Been..." with an ensuing discussion often explaining the origins of the term as it relates to gaming culture. And if you do go and look for a definition of the term online, you'll come across resources such as How “PWNED” went from hacker slang to the internet’s favourite taunt:

Then in 2013, when various web services and sites saw an uptick in personal data breaches, security expert Troy Hunt created the website “Have I Been Pwned?” Anyone can type in an email address into the site to check if their personal data has been compromised in a security breach.

And somehow, this little project is now referenced in the definition of the name it emerged from. Weird.

But, because it's such an odd name that has so frequently been mispronounced or mistyped, I've ended up with a whole raft of bizarre domain names including haveibeenpaened.com, haveibeenpwnded.com, haveibeenporned.com and my personal favourite, haveibeenprawned.com (because a journo literally pronounced it that way in a major news segment 🤦‍♂️). Not to mention all the other weird variations including haveibeenburned.com, haveigotpwned.com, haveibeenrekt.com and after someone made the suggestion following the revelation that PornHub follows me, haveibeenfucked.com 🤷‍♂️

Press

It's difficult to even know where to start here. How does the little site with the weird name end up in the press? Inevitably, "because data breaches", and it's nuts just how much exposure this project has had because of them. These are often mainstream news events and what reporters often want to impart to people is along the lines of "Here's what you should do if you've been impacted", which often boils down to checking HIBP.

Press is great for raising awareness of the project, but it has also quite literally DDoS'd the service with the Martin Lewis Money Show in the UK knocking it offline in 2016. Cool! No, for real, I learned some really valuable lessons from that experience which, of course, I shared in a blog post. And then ensured could never happen again.

Back in 2018, Gizmodo reckoned HIBP was one of the top 100 websites that shaped the internet as we knew it, alongside the likes of Wikipedia, Google, Amazon and Goatse (don't Google it). Only the year after it launched, TIME magazine reckon'd it was one of the 50 best websites of the year. And every time I do a Google search for a major news outlet, I find this little website. The Wall Street Journal. The Standard (nice headline!) USA Today. Toronto Star. De Telegraaf. VG. Le Monde. Corriere della Sera. It's wild - I just kept Googling for the largest newspapers in various parts of the world and kept getting hits!

The point is that it's had impact, and nobody is more surprised about that than me.

Congress

How on earth did I end up here?!

6 years and a few days ago now, I found myself in a place I'd only ever seen before in the movies: Congress. American Congress. Saying "pwned"!

For reasons I still struggle to completely grasp, the folks there thought it would be a good idea if I flew to the other side of the world and talked about the impact of data breaches on identity verification. "You know they're just trying to get you to DC so they can arrest you for all that stolen data you have, right?! 🤣", the internet quipped. But instead, I had one of the most memorable moments of my career as I read my testimony (these are public hearings so it's all recorded and available to watch), responded to questions from congressmen and congresswomen and rounded out the trip staring down at where they inaugurate presidents:

Today, that photo adorns the wall outside my office and dozens of times a day I look at it and ask the same question - how did it all lead to this?!

Svalbard

The potential sale of HIBP was a very painful, very expensive chapter of life, announced in a blog post from June 2019. For the most part, I was as transparent and honest as I could be about the reasons behind the decision, including the stress:

To be completely honest, it's been an enormously stressful year dealing with it all.

More than one year later, I finally wrote about the source of so much of that stress: divorce. Relationship circumstances had put a huge amount of pressure on me and I needed a relief valve which at the time, I thought would be the sale of the project I loved so much but was becoming increasingly demanding. Ultimately, Project Svalbard (the code name for the sale of HIBP), had the opposite effect as years of bitter legal battles with my ex ensued, in part due to the perceived value that would have been realised had it been sold and some big tech company owned my arse for years to come. The project I built out of a passion to do community good was now being used as a tool to extract as much money out of me as possible. There's a wild story to be told there one day but whilst that saga is now well and truly behind me, the scars are still raw.

There were many times throughout Project Svalbard where I felt like I was living out an episode of Silicon Valley, especially as I hopped between interviews at the who's-who of tech firms in San Francisco to meet potential acquirers. But there was one moment in particular that I knew at the time would form an indelible memory, so I took a photo of it:

I'm sitting in a rental car in Yosemite whilst driving from the aforementioned meetings in SF and onto Vegas for the annual big cyber-events. I had a scheduled call with a big tech firm who was a potential acquirer and should that deal go through, the guy I was speaking to would be my new boss. I'd done that dozens of times by now and I don't know if it was because I was especially tired or emotional or if there was something in the way he phrased the question, but this triggered something deep inside me:

So Troy, what would your perfect day in the office look like?

I didn't say it this directly, but I kid you not this is exactly what popped into my mind:

I get on my jet ski and I do whatever the fuck I want

My potential new overlord had somehow managed to find exactly the raw nerve to touch that made me realise how valuable independence had become to me. 6 months later, Project Svalbard was dead after a deal I'd struck fell through. I still can't talk about the precise circumstances due to being NDA'd up to wazoo, but the term we chose to use was "a change of business circumstances on behalf of the purchaser". With the benefit of hindsight, I've never been so happy to have lost so much 😊

The FBI

10 years ago, I certainly didn't see this on the cards:

This is so cool, thanks @FBI 😊 pic.twitter.com/aqMi3as91O
— Troy Hunt (@troyhunt) June 28, 2023

Nor did I expect them to be actively feeding data into HIBP. Or the UK's NCA to be feeding data in. Or various other law enforcement agencies the world over. And I never envisioned a time where dozens of national governments would be happy to talk about using the service.

A couple of months ago, the ABC wrote a long piece on how this whole thing is, to use their term, a strange sign of the times.

He’s just “a dude on the web”, but Troy Hunt has ended up playing an oddly central role in global cybersecurity.

It's strange until you look at through the lens of aligned objectives: the whole idea of HIBP was "to do good things after bad things happen" which is well aligned with the mandates of law enforcement agencies. You could call it... common ground:

This is something I suspect a lot of people don't understand - that law enforcement agencies often work in conjunction with private enterprise to further their goals of protecting people just like you and me. It's something I certainly didn't understand 10 years ago, and I still remember the initial surprise when agencies started reaching out. Many years on, these have become really productive relationships with a bunch of top notch people, a number of whom I now count as friends and make an effort to spend time with on my travels.

Passwords

This was never on the cards originally. In fact, I'd always been adamant that there should never be passwords in HIBP although in my defence, the sentiment was that they should never appear next to the username to which they originally accompanied. But looking at passwords through the lens of how breach data can be used to do good things, a list of known compromised passwords disassociated from any form of PII made a lot of sense. So, in 2017, Pwned Passwords was born. You know what I was saying earlier about things escalating quickly? Yeah:

Setting all new records for Pwned Passwords this week: biggest day ever yesterday at 282M requests and biggest rolling 30 days ever, now passing the 6 *billion* requests mark! pic.twitter.com/dQiuQim3da
— Troy Hunt (@troyhunt) September 12, 2023

As if to make the point, I just checked the latest stats and last week we did 301.6M requests in a single day. 100% of those requests - and that's not a rounded number either, it's 100.0000000000% - were served from Cloudflare's cache 🤯

There's so much I love about this service. I love that it's free, there's no auth, it's entirely open source (both code and data), the FBI feeds data into it and perhaps most importantly, it has real impact on security. It's such a simple thing, but every time you see a headline such as "Big online website hit with credential stuffing attack", a significant portion of the accounts being taken over have passwords that could easily have been blocked.

The Paradox of Handling Data Breaches

On multiple occasions now, I've had conversations that can best be paraphrased as follows:

Random Internet Person: I'm going to report you to the FBI for having all that stolen data

Me: Maybe you should start by Googling "troy hunt fbi" first...

But I understand where they're coming from and the paradox I refer to is the perceived conflict between handling what is usually the output of a crime whilst simultaneously trying to perform a community good. It's the same discussion I've often had with people citing privacy laws in their corner of the world (often the EU and GDPR) as the reason why HIBP shouldn't exist: "but you're processing data without informed consent!", they'll claim. The issue of there being other legal bases for processing aside, nobody consents to being in a data breach! The natural progression of that conversation is that being in a data breach is a parallel discussion to HIBP then indexing it and making it searchable, which is something I've devoted many words to addressing in the past.

But for all the bluster the occasional random internet person can have (and honestly, I could count the number of annual instances of this on one hand), nothing has come of any complaints. And when I say "complaints", it's often nothing more than a polite conversation which may simply conclude with an acknowledgment of opposing views and that's it. There has been one exception in the entire decade of running this service where a complaint did come via a government privacy regulator, I responded to all the questions that were asked and that was the end of it.

People

When you have a pet project like HIBP was in the beginning, it's usually just you putting in the hours. That's fine, it's a hobby and you're scratching an itch, so what does it matter that there's nobody else involved? Like many similar passion projects, HIBP consumed a lot of hours from early on, everything from obviously building the service then sourcing data breaches, verifying and disclosing them, writing up descriptions and even editing every single one of those 700+ logos by hand to be just the right dimensions and file size. But in the beginning, if I'd just stopped one day, what would happen? Nothing. But today, a genuinely important part of the internet that a huge number of individuals, corporations and governments have built dependencies on would stop working if I lost interest.

The dependency on just me was partly behind the possible sale in 2019, but clearly that didn't eventuate. There was always the option to employ people and build it out like most people would a normal company, but every time I gave that consideration it just didn't stack up for a whole bunch of reasons. It was certainly feasible from the perspective of building some sort of valuable commercial entity, but in just the same way as that question about my perfect day in the office sucked the soul from my body, so did the prospect of being responsible for other people. Employment contracts. Salary negotiations. Performance reviews. Sick leave and annual leave and all sorts of other people issues from strangers I'd need to entrust with "my baby". So, bringing in more people was a really unattractive idea, with 2 exceptions:

In early 2021, my (soon to be at the time) wife Charlotte started working for HIBP.

Charlotte had spent the last 8 years working with people just like me; software nerds. As a project manager for the NDC conferences based out of Norway, she'd dealt with hundreds of speakers (including me on many occasions), and thousands of attendees at the best conference I've ever been a part of. Plus, she spent a great deal of time coordinating sponsors, corporate attendees and all sorts of other folks that live in the tech world HIBP inhabited. For Charlotte, even though she's not a technical person (her qualifications are in PR and entrepreneurial studies), this was very familiar territory.

So, for the last few years, Charlotte has done absolutely everything that she can to ensure that I can focus on the things that need my attention. She onboards new corporate subscribers, handles masses of tickets for API and domain subscribers and does all the accounting and tax work. And she does this tirelessly every single day at all sorts of hours whether we're at home or travelling. She is... amazing 🤩

Earlier this year, Stefán Jökull Sigurðarson started working for us part time writing code, cleaning up code, migrating code and, well, doing lots of different code things.

Just today I asked Stefán what I should write about him, thinking he'd give me some bullet points I'd massage and then incorporate into this blog post. Instead, I reckon what he wrote was so spot on that I'm just going to quote the entire thing here:

"Just" that having had my eye on the service since it was released and then developing one of the first big integrations with the PwnedPasswords v2 API in EVE, coinciding with us meeting for the first time at NDC Oslo in 2018 shortly after, HIBP has managed to take me on this awesome journey where it has been a part of launching my public speaking career, contributing to OSS with Pwned Passwords, becoming an MVP and helped me meet a bunch of awesome people and allowed me to contribute to a better and hopefully safer internet. I'm very happy and honoured to a be a part of this project which is full of awesome challenges and interesting problems to deal with. Having meeting invites from the FBI in my inbox a few years after doing a few experimental rest calls to the Pwned Passwords API in early 2018 was definitely not something I was expecting 😅

What really resonated with me in Stefán's message is that for him, this isn't just a job, it's a passion. His journey is my journey in that we freely devoted our time to do something we love and it led to many wonderful things, including MVP roles and speaking at "Charlotte's" conference, NDC. Stefán is based in Iceland, but we've still had many opportunities to share beers together and establish a relationship that transcends merely writing code. I can't think of anyone better to do what he does today.

Breaches

731 breaches later, here we are. So, what stands out? Just going off the top of my head here:

Ashley Madison. Every knows the name so it needs no introduction, but that incident in 2015 had a major impact on HIBP in terms of use of the service, and also a major impact on me in terms of the engagements I had with impacted parties. My blog post on Here’s what Ashley Madison members have told me still feels harrowing to read.

Collection #1. This is the one that really contributed to my stress levels in early 2019 and had a profound impact on my decision to look at selling the service. Read about where those 773M records came from (still the largest breach in HIBP to date).

Rosebutt. Don't make a joke about it, don't make a joke about it, don't... aw man, thanks The Register! (link to an archive.org version as they seem to have thought better of their image choice later on...) The point is that even serious data breaches can have their moments of levity.

Shit Express. Sometimes, you just need a bit of hilarity in your data breach. Shit Express is literally a site to send other people pieces of that - anonymously - and they got breached, thus somewhat affecting their anonymity. The more serious point is that as I later wrote, claims of anonymity are often highly misleading.

Future

I often joke about my life being very much about getting up each morning, reading my emails and events from overnight and then just winging it from there. Of course there are the occasional scheduled things not to mention travel commitments, but for the most part it's very much just rolling with whatever is demanding attention on the day. This is also probably a significant part of why I don't really want to see this thing grow into a larger concern with more responsibilities, I just don't want to lose that freedom. Yet...

We're gradually moving in a direction where things become more formalised. 3 years ago, I did 100% of everything myself. 1 year ago, I did everything technical myself. 6 months ago, we had no ticketing system for support. But these are small, incremental steps forward and that's what I'd like to see continuing. I want HIBP to outlive me, I just don't want it to become a burden I'm beholden to in the process. I'd like to have more people involved but as you can see from above, that's been a very slow process with only those very close to me playing a role.

The only thing I have real certainty on at the moment is that there will be more breaches. I've commented many times recently that the scourge that is ransomware feels like it's really accelerated lately, I wonder how many of the people in the emails and documents and all sorts of other data that get dumped there ever learn of their exposure? It's a non-trivial exercise to index that (for all sorts of reasons), but it also seems like an increasingly worthy exercise. Who knows, let's see how I feel when I get up tomorrow morning 🙂

Finally, for this week's regular video, I'm going to make a birthday special and do it live with Charlotte. Please come and join us, I'm not entirely sure what we'll cover (I'll work it out on the morning!) but let's make a virtual 10th birthday party out of it 🎂

Related tags
- ❌
- Have
- I
- Been
- Pwned
December 4^th 2023 at 07:05

Troy Hunt
Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data
November 15^th 2023 at 07:22

Acuity Who? Attempts and Failures to Attribute 437GB of Breached Data

By Troy Hunt

Allegedly, Acuity had a data breach. That's the context that accompanied a massive trove of data that was sent to me 2 years ago now. I looked into it, tried to attribute and verify it then put it in the "too hard basket" and moved onto more pressing issues. It was only this week as I desperately tried to make some space to process yet more data that I realised why I was short on space in the first place:

Ah, yeah - Acuity - that big blue 437GB blob. What follows is the process I went through trying to work out what an earth this thing is, the confusion surrounding the data, the shady characters dealing with it and ultimately, how it's now searchable in Have I Been Pwned (HIBP), which may be what brought you to this blog post in the first place.

One of the first things I do after receiving a data breach is to literally just Google it: acuity data breach. Which immediately yielded this top result from June:

Ah, so Acuity is a healthcare company. But wait - here's the next result:

That's not about healthcare, that's Acuity Brands. How many companies called "Acuity" that have been breached are there?! Let's see what references I have in my email:

Another one 🤦‍♂️ That "breach" could be circumstantial, so we'll call it a "maybe", but it's yet another Acuity with a question mark next to it. So how many "Acuity" companies are out there in total?! Just in the course of investigating this data, I came across a total of 6 of them that as far as I can tell, are completely unrelated:

Acuity Healthcare (definitely breached): acuity.healthcare
Acuity Brands (definitely breached): acuitybrands.com
Acuity Scheduling (maybe breached): acuityscheduling.com
Acuity Insurance: acuity.com
Acuity "Innovative technical solutions for Federal agencies that support the National Security & Public Safety missions": myacuity.com
Acuity Ads: acuityads.com (now redirects to illumin.com)

Ugh, great. We'll work through them and try to figure out where they fit into the picture in a moment, but first let's look at the actual data. We already know it's 437GB, but it's the breadth of column headings that's most stunning; here's all 414 of them:

Just by eyeballing these, it really doesn't feel like the sort of data that comes from a healthcare provider, a brands company or a scheduler. The other 3, however... Maybe.

Some more data points before going further:

The files is named "ACUITY_MASTER_18062020.csv" (this is the date I've elected to stamp the breach with - 18 June 2020)
There are 21,873,706 email addresses in the file
Of those, "only" 14,055,729 are unique so there's some redundancy
The data is cleansed and formatted in a fashion that definitely isn't reflective of how data is entered by end users

On that final point, here's an example of what I'm talking about:

The last names are the same, as are the salutations. The physical addresses are spot on accurate in their structure as are the phone numbers; there are no spaces, no dashes and no other artifacts typical of millions of different humans entering data. This is clean - too clean.

The "datasource" field is another interesting data point with the top 10 values being:

Buy.com
Popularliving.com
studentsreview.com
TAGGED.COM
jamster.com
Expedia.com
cbsmarketwatch.com
netflix.com
selfwealthsystem.com
gocollegedegree.com

Each of these entries appeared at least hundreds of thousands of times, if not millions. Does that mean that Netflix, for example, provided customer data to this list? Almost certainly no, but it does feel reminiscent of the Acxiom / Live Ramp misattribution post I wrote a year ago where I listed full counts of a similar column. One of the top values there was also "TAGGED.COM" (also all in uppercase), alongside several other values that also appeared in both sources.

Back to attribution and a post on a popular hacking forum jumps out:

Many things here line up, for example the column names that are very unique to this data source, including "estimatedincomecode", "del_point_check_digit" and "secondaryaddresspresent". The attribution is to the insurance company named "Acuity", but is that accurate? Insurance companies collect a lot of data as it's relevant to how they run their business, but that data is highly unlikely to include fields such as:

SpectatorSportsBasketball
SewingKnittingNeedlework
PresenceOfUpscaleRetailCard

That's much more in the "data enrichment" space where a company sells a massive data set so that it can expand the profile data of the purchaser's existing customer base. It's a ~~legitimate~~, ~~honest~~, legal business model. It's also indistinguishable from this:

Hey, it's 437GB! And the column names line up! And it's called Acuity! Slightly different column count to mine (and similar but different to the hacker forum post), and slightly different email count, but the similarities remain striking. How I got to this resource is also interesting, having come by someone I was discussing the data with a couple of years ago:

The YouTube video is a walkthrough of a campaign management tool to send emails to customers. Could that indicate the data as coming from Acuity Ads (now Illumin)? No, not in and of itself, the walkthrough there isn't that dissimilar to other campaign tools I've used in the past. No matter how much I looked, I just couldn't find a solid lead back to Acuity Ads and anything even remotely related was merely circumstantial. It could be from them, but it could also be from many other places and the mere fact that a near identical corpus of data was sitting there on an outright spam site only makes the whole mystery that much deeper. There was just one more interesting data point in that email:

i myself am in that dataset and i've been getting 100x more phishing/scam calls, emails, and physical mail

Let me end this with a best guess: this feels like the same situation as the massive Master Deeds incident in South Africa in 2017. In that case, a legally operating data aggregator (I think you know how I feel about those by now...) sold personal information to a real estate business who then left it publicly exposed. I say it feels the same because it's just such a clean set of data and it's clearly very comprehensive in terms of the columns. It's exactly what I'd expect a data aggregator to prepare and sell to other businesses so they could identify which of their existing customers likes needlework.

In the past, publishing blog posts like this has helped identify an origin service and if that happens again here then I'll be sure to provide an update. For now, I've loaded it into HIBP and flagged it as a spam list which means it won't impact the size of anyone's domains and bump them into a different subscription level. If you do have any interesting insights on this data, please leave a comment below and with any luck, one of the Acuity entities out there will emerge as the source.

Note: just after loading the data, I ran the calcs on how many of the addresses were pre-existing in HIBP. This seems like a statistically significant number 😲

So, 100% (just under actually, but it rounded up). Working through a bunch of sample addresses, they appeared across all sorts of other existing spam lists and dodgy data aggregator breaches. Who knows which ones came first, just more data in the big swimming pool of breaches. https://t.co/Ux2rw6uaAk
— Troy Hunt (@troyhunt) November 15, 2023

Related tags
- ❌
- Have
- I
- Been
- Pwned
November 15^th 2023 at 07:22

Troy Hunt
Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset
November 7^th 2023 at 07:20

Hackers, Scrapers & Fakers: What's Really Inside the Latest LinkedIn Dataset

By Troy Hunt

Edit (1 day later): After posting this, the party responsible for leaking the data turned around and said "that was only a small part of it, here's the whole thing", and released records encompassing a further 14M records. I've added those into HIBP and will shortly be re-sending notifications to people monitoring domains as the count of impacted addresses will likely have changed. Everything else about the subsequent dataset is consistent with what you'll read below in terms of structure, patterns and conclusions.

The same threat actor has leaked larger amounts of data from LinkedIn dated 2023. They claim this new data contains 35M lines and is 12 GB uncompressed. They also issue an apology to @troyhunt. #Breach #Clearnet #DarkWeb #DarkWebInformer #Database #Leaks #Leaked #LinkedIn https://t.co/qBFAofvppU pic.twitter.com/Clg5o92b6t
— Dark Web Informer (@DarkWebInformer) November 7, 2023

I like to think of investigating data breaches as a sort of scientific search for truth. You start out with a theory (a set of data coming from an alleged source), but you don't have a vested interested in whether the claim is true or not, rather you follow the evidence and see where it leads. Verification that supports the alleged source is usually quite straightforward, but disproving a claim can be a rather time consuming exercise, especially when a dataset contains fragments of truth mixed in with data that is anything but. Which is what we have here today.

To lead with the conclusion and save you reading all the details if you're not inclined, the dataset so many people flagged me this week titled "Linkedin Database 2023 2.5 Millions" turned out to be a combination of publicly available LinkedIn profile data and 5.8M email addresses mostly fabricated from a combination of first and last name. It all began with this tweet:

A threat actor has allegedly leaked a database from LinkedIn @LinkedIn dated 2023. They claim the database shows emails, profile data, phones, full names, and more confidential info. #Breach #Clearnet #DarkWeb #DarkWebInformer #Database #Leaks #Leaked #LinkedIn pic.twitter.com/8MQecKc1vz
— Dark Web Informer (@DarkWebInformer) November 4, 2023

All good lies are believable at face value; is it feasible a massive corpus of LinkedIn data is floating around? Well, they were proper breached in 2012 to the tune of 164M records (by which I mean that incident was genuinely internal data such as email addresses and passwords extracted out by a vulnerability), then they were massively scraped in 2021 with another 126M records going into Have I Been Pwned (HIBP). So, when you see a claim like the one above, it seems highly feasible at face value which is what many people take it at. But I'm a bit more suspicious than most people 🙂

First, the claim:

This one is similar to my twitter data scrapped [sic] but for linkedin plus 2023

Now, there's a whole debate about whether scraped data is breached data and indeed whether the definition of it even matters. With the rising prevalence of scraped data, this topic came up enough that I wrote a dedicated blog post about it a couple of years ago and concluded the following in terms of how we should define the term "breach":

A data breach occurs when information is obtained by an unauthorised party in a fashion in which it was not intended to be made available

Which makes scrapes like this alleged one a breach. If indeed it was accurate, LinkedIn data had been taken and redistributed in a way it was never intended to be by either the service itself or the individuals whose data was in this corpus. So, it's something to take seriously, and that warranted further investigation.

I scrolled through the 10M+ rows of data (many records spanned multiple rows due to line returns), and my eyes fell on a fellow Aussie who for the purposes of this exercise we'll call "EM", being the initials of her first and last name. Whilst the data I'm going to refer to is either public by design or fabricated, I don't want to use a real person as an example without their consent so let's just play it safe. Here's a fragment of EM's record:

There are 5 noteworthy parts of this I that immediately caught my attention:

There are 5 different email addresses here with the alias for each one represented in "[first name].[last name]@" form. These exist in a column titled "PROFILE_USERNAMES". (Incidentally, this is why the headline of 2.5M accounts expands out to 5.8M email addresses as there are often multiple addresses per account.)
There's a LinkedIn profile ID in the form of "[first name]-[last name]-[random hexadecimal chars]" under a column titled "PROFILE_LINKEDIN_ID". That successfully loaded EM's legitimate profile at https://www.linkedin.com/in/[id]/
The numeric value in the "PROFILE_LINKEDIN_MEMBER_ID" column matched with the value on EM's profile from the previous point.
The 2 dates starting with "2020-" are in columns titled "PROFILE_FETCHED_AT" and "PROFILE_LINKEDIN_FETCHED_AT". I assume these are self-explanatory.
EM's first and last name, precisely as it appears in each of her 5 email addresses.

On its own, this record would be unremarkable. It'd be entirely feasible - this could very well be legit - except when you keep looking through the remainder of the data. A pattern quickly emerged and I'm going to bold it here because it's the smoking gun that ultimately indicates that a bunch of this data is fake:

Every single record with multiple email addresses had exactly the same alias on completely unrelated domains and it was almost always in the form of "[first name].[last name]@".

Representing email addresses in this fashion is certainly common, but it's far from ubiquitous, and that's easy to demonstrate. For example, I have tons of emails from Pluralsight so I dig one out from my friend "CU":

There's no dot, rather a dash. Every single real Pluralsight email address I looked at was a dash rather than a dot, yet when I delved into the alleged LinkedIn data and dig out another sample Pluralsight address, here's what I found:

That's not LM's real address because it has a dot instead of a dash. Every. Single. One. Is. Fake.

Let's try this the other way around and load up the existing breached accounts in HIBP for the domain of one of EM's alleged email addresses and see how they're formed:

That's definitely not the same format as EM's address, not by a long shot. And time and time again, the same pattern of addresses in the corpus of data in the original tweet emerged, drawing me to what seems to be a pretty logical conclusion:

Each email address was fabricated by taking the actual domain of a company the individual legitimately worked at and then constructing the alias from their name.

And these are legitimate companies too because every single LinkedIn profile I checked had all the cues of accurate information and each domain I checked in the corpus of data was indeed the correct one for the company they worked at. I imagine someone has effectively worked through the following logic:

Get a list of LinkedIn profiles whether that be by ID or username or simply parsing them out of crawler results
Scrape the profiles and pull down legitimate information about each individual, including their employment history
Resolve the domain for each company they worked at and construct the email addresses
Profit?

On that final point, what is the point? The data wasn't being sold in that original tweet, rather it was freely downloadable. But per the date on EM's profile, the data could have been obtained much earlier and previously monetised. And on that, the date wasn't constant across records, rather there was a broad range of them as recent as July last year and as old as... well, I stopped when the records got older than me. What is this?!

I suspect the answer may partly lie in the column headings which I've pasted here in their entirety:

"PROFILE_KEY", "PROFILE_USERNAMES", "PROFILE_SPENDESK_IDS", "PROFILE_LINKEDIN_PUBLIC_IDENTIFIER", "PROFILE_LINKEDIN_ID", "PROFILE_SALES_NAVIGATOR_ID", "PROFILE_LINKEDIN_MEMBER_ID", "PROFILE_SALESFORCE_IDS", "PROFILE_AUTOPILOT_IDS", "PROFILE_PIPL_IDS", "PROFILE_HUBSPOT_IDS", "PROFILE_HAS_LINKEDIN_SOURCE", "PROFILE_HAS_SALES_NAVIGATOR_SOURCE", "PROFILE_HAS_SALESFORCE_SOURCE", "PROFILE_HAS_SPENDESK_SOURCE", "PROFILE_HAS_ASGARD_SOURCE", "PROFILE_HAS_AUTOPILOT_SOURCE", "PROFILE_HAS_PIPL_SOURCE", "PROFILE_HAS_HUBSPOT_SOURCE", "PROFILE_FETCHED_AT", "PROFILE_LINKEDIN_FETCHED_AT", "PROFILE_SALES_NAVIGATOR_FETCHED_AT", "PROFILE_SALESFORCE_FETCHED_AT", "PROFILE_SPENDESK_FETCHED_AT", "PROFILE_ASGARD_FETCHED_AT", "PROFILE_AUTOPILOT_FETCHED_AT", "PROFILE_PIPL_FETCHED_AT", "PROFILE_HUBSPOT_FETCHED_AT", "PROFILE_LINKEDIN_IS_NOT_FOUND", "PROFILE_SALES_NAVIGATOR_IS_NOT_FOUND", "PROFILE_EMAILS", "PROFILE_PERSONAL_EMAILS", "PROFILE_PHONES", "PROFILE_FIRST_NAME", "PROFILE_LAST_NAME", "PROFILE_TEAM", "PROFILE_HIERARCHY", "PROFILE_PERSONA", "PROFILE_GENDER", "PROFILE_COUNTRY_CODE", "PROFILE_SUMMARY", "PROFILE_INDUSTRY_NAME", "PROFILE_BIRTH_YEAR", "PROFILE_MARVIN_SEARCHES", "PROFILE_POSITION_STARTED_AT", "PROFILE_POSITION_TITLE", "PROFILE_POSITION_LOCATION", "PROFILE_POSITION_DESCRIPTION", "PROFILE_COMPANY_NAME", "PROFILE_COMPANY_LINKEDIN_ID", "PROFILE_COMPANY_LINKEDIN_UNIVERSAL_NAME", "PROFILE_COMPANY_SALESFORCE_ID", "PROFILE_COMPANY_SPENDESK_ID", "PROFILE_COMPANY_HUBSPOT_ID", "PROFILE_SKILLS", "PROFILE_LANGUAGES", "PROFILE_SCHOOLS", "PROFILE_EXTERNAL_SEARCHES", "PROFILE_LINKEDIN_HEADLINE", "PROFILE_LINKEDIN_LOCATION", "PROFILE_SALESFORCE_CREATED_AT", "PROFILE_SALESFORCE_STATUS", "PROFILE_SALESFORCE_LAST_ACTIVITY_AT", "PROFILE_SALESFORCE_OWNER_CONTACT_ID", "PROFILE_SALESFORCE_OWNER_CONTACT_NAME", "PROFILE_SPENDESK_SIGNUP_AT", "PROFILE_SPENDESK_DELETED_AT", "PROFILE_SPENDESK_ROLES", "PROFILE_SPENDESK_AVERAGE_NPS_SCORE", "PROFILE_SPENDESK_NPS_SCORES_COUNT", "PROFILE_SPENDESK_FIRST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE_SENT_AT", "PROFILE_SPENDESK_PAYMENTS_COUNT", "PROFILE_SPENDESK_TOTAL_EUR_SPENT", "PROFILE_SPENDESK_ACTIVE_SUBSCRIPTIONS_COUNT", "PROFILE_SPENDESK_LAST_ACTIVITY_AT", "PROFILE_AUTOPILOT_MAIL_CLICKED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_CLICKED_AT", "PROFILE_AUTOPILOT_MAIL_OPENED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_OPENED_AT", "PROFILE_AUTOPILOT_MAIL_RECEIVED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_RECEIVED_AT", "PROFILE_AUTOPILOT_MAIL_UNSUBSCRIBED_AT", "PROFILE_AUTOPILOT_MAIL_REPLIED_AT", "PROFILE_AUTOPILOT_LISTS", "PROFILE_AUTOPILOT_SEGMENTS", "PROFILE_HUBSPOT_CFO_CONNECT_SLACK_MEMBER_STATUS", "PROFILE_HUBSPOT_IS_CFO_CONNECT_MEETUPS_MEMBER", "PROFILE_HUBSPOT_CFO_CONNECT_AREAS_OF_EXPERTISE", "PROFILE_HUBSPOT_CORPORATE_FINANCE_EXPERIENCE_YEARS_RANGE"

Check out some of those names: LinkedIn is obviously there, but so is Salesforce and Spendesk and Hubspot, among others. This reads more like an aggregation of multiple sources than it does data solely scraped from LinkedIn. My hope is that in posting this someone might pop up and say "I recognise those column headings, they're from..." Who knows.

So, here's where that leaves us: this data is a combination of information sourced from public LinkedIn profiles, fabricated emails address and in part (anecdotally based on simply eyeballing the data this is a small part), the other sources in the column headings above. But the people are real, the companies are real, the domains are real and in many cases, the email addresses themselves are real. There are over 1.8k HIBP subscribers in the data set and this is folks that have double opted-in so they've successfully received an email to that address in the past. Further, when the data was loaded into HIBP there were nearly a million email addresses that were already in the system so evidently, they were addresses that had previously been in use. Which stands to reason because even if every address was constructed by an algorithm, the pattern is common enough that there'll be a bunch of hits.

Because the conclusion is that there's a significant component of legitimate data in this corpus, I've loaded it into HIBP. But because there are also a significant number of fabricated email addresses in there, I've flagged it as a spam list which means the addresses won't impact the scale of anyone's paid subscription if they're monitoring domains. And whilst I know some people will suggest it shouldn't go in at all, time and time again when I've polled the public about similar incidents the overwhelming majority of people have said "we want to know about it then we'll make up our own minds what action needs to be taken". And in this case, even if you find an email address on your domain that doesn't actually exist, that person who either currently works at your company or previously did has still had their personal data dumped in this corpus. That's something most people will still want to know.

Lastly, one of the main reasons I decided to invest hours into this today is that I loathe disinformation and I hate people using that to then make statements that are completely off base. I'm looking at my Twitter feed now and see people angry at LinkedIn for this, blaming an insider due to recent layoffs there, accusing them of mishandling our data and so on and so forth. No, not this time, the evidence has led us somewhere completely different.

Related tags
- ❌
- Have
- I
- Been
- Pwned
November 7^th 2023 at 07:20

Troy Hunt
68k Phishing Victims are Now Searchable in Have I Been Pwned, Courtesy of CERT Poland
August 31^st 2023 at 05:59

68k Phishing Victims are Now Searchable in Have I Been Pwned, Courtesy of CERT Poland

By Troy Hunt

Last week I was contacted by CERT Poland. They'd observed a phishing campaign that had collected 68k credentials from unsuspecting victims and asked if HIBP may be used to help alert these individuals to their exposure. The campaign began with a typical email requesting more information:

In this case, the email contained a fake purchase order attachment which requested login credentials that were then posted back to infrastructure controlled by the attacker:

All in all, CERT Poland identified 202 other phishing campaigns using the same infrastructure which has subsequently been taken offline. Data accumulated by the malicious activity spanned from October 2022 until just last week.

The advice to impacted individuals is as follows:

Get a digital password manager to help you make all passwords strong and unique
If you've been reusing passwords, change them to strong and unique versions now, starting with the most important services you use
Turn on multi-factor authentication wherever it's available, especially for important accounts such as email, social media and banking
Never open attachments or follow links unless you're confident in the trustworthiness of their origin and if in doubt, delete the email

Related tags
- ❌
- Have
- I
- Been
- Pwned
August 31^st 2023 at 05:59

Troy Hunt
Data From The Qakbot Malware is Now Searchable in Have I Been Pwned, Courtesy of the FBI
August 29^th 2023 at 19:45

Data From The Qakbot Malware is Now Searchable in Have I Been Pwned, Courtesy of the FBI

By Troy Hunt

Today, the US Justice Department announced a multinational operation involving actions in the United States, France, Germany, the Netherlands, and the United Kingdom to disrupt the botnet and malware known as Qakbot and take down its infrastructure. Beyond just taking down the backbone of the operation, the FBI began actively intercepting traffic from the botnet and instructing infected machines the uninstall the malware:

To disrupt the botnet, the FBI was able to redirect Qakbot botnet traffic to and through servers controlled by the FBI, which in turn instructed infected computers in the United States and elsewhere to download a file created by law enforcement that would uninstall the Qakbot malware

As part of the operation, the FBI have requested support from Have I Been Pwned (HIBP) to help notify impacted victims of their exposure to the malware. We provided similar support in 2021 with the Emotet botnet, although this time around with a grand total of 6.43M impacted email addresses. These are now all searchable in HIBP albeit with the incident is flagged as "sensitive" so you'll need to verify you control the email address via the notification service first, or you can search any domains you control via the domain search feature. Further, the passwords from the malware will shortly be searchable in the Pwned Passwords service which can either be checked online or via the API. Pwned Passwords is presently requested 5 and a half billion times each month to help organisations prevent people from using known compromised passwords.

Guidance for those impacted by this incident is the same tried and tested advice given after previous malware incidents:

Keep security software such as antivirus up to date with current definitions. I personally use Microsoft Defender which is free, built into Windows and updates automatically via Windows Update.
If you're reusing passwords across services, get a password manager and change them to be strong and unique.
Enable multi-factor authentication where supported, at least for your most important services (email, banking, social, etc.)
For administrators with affected users, CISA has a report which explains the malware in more detail, including links to YARA rules to help identify the presence of the malware within your network.

Related tags
- ❌
- Have
- I
- Been
- Pwned
August 29^th 2023 at 19:45

Troy Hunt
Fighting API Bots with Cloudflare's Invisible Turnstile
August 21^st 2023 at 08:45

Fighting API Bots with Cloudflare's Invisible Turnstile

By Troy Hunt

There's a "hidden" API on HIBP. Well, it's not "hidden" insofar as it's easily discoverable if you watch the network traffic from the client, but it's not meant to be called directly, rather only via the web app. It's called "unified search" and it looks just like this:

It's been there in one form or another since day 1 (so almost a decade now), and it serves a sole purpose: to perform searches from the home page. That is all - only from the home page. It's called asynchronously from the client without needing to post back the entire page and by design, it's super fast and super easy to use. Which is bad. Sometimes.

To understand why it's bad we need to go back in time all the way to when I first launched the API that was intended to be consumed programmatically by other people's services. That was easy, because it was basically just documenting the API that sat behind the home page of the website already, the predecessor to the one you see above. And then, unsurprisingly in retrospect, it started to be abused so I had to put a rate limit on it. Problem is, that was a very rudimentary IP-based rate limit and it could be circumvented by someone with enough IPs, so fast forward a bit further and I put auth on the API which required a nominal payment to access it. At the same time, that unified search endpoint was created and home page searches updated to use that rather than the publicly documented API. So, 2 APIs with 2 different purposes.

The primary objective for putting a price on the public API was to tackle abuse. And it did - it stopped it dead. By attaching a rate limit to a key that required a credit card to purchase it, abusive practices (namely enumerating large numbers of email addresses) disappeared. This wasn't just about putting a financial cost to queries, it was about putting an identity cost to them; people are reluctant to start doing nasty things with a key traceable back to their own payment card! Which is why they turned their attention to the non-authenticated, non-documented unified search API.

Let's look at a 3 day period of requests to that API earlier this year, keeping in mind this should only ever be requested organically by humans performing searches from the home page:

This is far from organic usage with requests peaking at 121.3k in just 5 minutes. Which poses an interesting question: how do you create an API that should only be consumed asynchronously from a web page and never programmatically via a script? You could chuck a CAPTCHA on the front page and require that be solved first but let's face it, that's not a pleasant user experience. Rate limit requests by IP? See the earlier problem with that. Block UA strings? Pointless, because they're easily randomised. Rate limit an ASN? It gets you part way there, but what happens when you get a genuine flood of traffic because the site has hit the mainstream news? It happens.

Over the years, I've played with all sorts of combinations of firewall rules based on parameters such as geolocations with incommensurate numbers of requests to their populations, JA3 fingerprints and, of course, the parameters mentioned above. Based on the chart above these obviously didn't catch all the abusive traffic, but they did catch a significant portion of it:

If you combine it with the previous graph, that's about a third of all the bad traffic in that period or in other words, two thirds of the bad traffic was still getting through. There had to be a better way, which brings us to Cloudflare's Turnstile:

With Turnstile, we adapt the actual challenge outcome to the individual visitor or browser. First, we run a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. Those challenges include, proof-of-work, proof-of-space, probing for web APIs, and various other challenges for detecting browser-quirks and human behavior. As a result, we can fine-tune the difficulty of the challenge to the specific request and avoid ever showing a visual puzzle to a user.

"Avoid ever showing a visual puzzle to a user" is a polite way of saying they avoid the sucky UX of CAPTCHA. Instead, Turnstile offers the ability to issue a "non-interactive challenge" which implements the sorts of clever techniques mentioned above and as it relates to this blog post, that can be an invisible non-interactive challenge. This is one of 3 different widget types with the others being a visible non-interactive challenge and a non-intrusive interactive challenge. For my purposes on HIBP, I wanted a zero-friction implementation nobody saw, hence the invisible approach. Here's how it works:

Get it? Ok, let's break it down further as it relates to HIBP, starting with when the front page first loads and it embeds the Turnstile widget from Cloudflare:

<script src="https://challenges.cloudflare.com/turnstile/v0/api.js" async defer></script>

The widget takes responsibility for running the non-interactive challenge and returning a token. This needs to be persisted somewhere on the client side which brings us to embedding the widget:

<div ID="turnstileWidget" class="cf-turnstile" data-sitekey="0x4AAAAAAADY3UwkmqCvH8VR" data-callback="turnstileCompleted"></div>

Per the docs in that link, the main thing here is to have an element with the "cf-turnstile" class set on it. If you happen to go take a look at the HIBP HTML source right now, you'll see that element precisely as it appears in the code block above. However, check it out in your browser's dev tools so you can see how it renders in the DOM and it will look more like this:

Expand that DIV tag and you'll find a whole bunch more content set as a result of loading the widget, but that's not relevant right now. What's important is the data-token attribute because that's what's going to prove you're not a bot when you run the search. How you implement this from here is up to you, but what HIBP does is picks up the token and sets it in the "cf-turnstile-response" header then sends it along with the request when that unified search endpoint is called:

So, at this point we've issued a challenge, the browser has solved the challenge and received a token back, now that token has been sent along with the request for the actual resource the user wanted, in this case the unified search endpoint. The final step is to validate the token and for this I'm using a Cloudflare worker. I've written a lot about workers in the past so here's the short pitch: it's code that runs in each one of Cloudflare's 300+ edge nodes around the world and can inspect and modify requests and responses on the fly. I already had a worker to do some other processing on unified search requests, so I just added the following:

const token = request.headers.get('cf-turnstile-response');

if (token === null) {
    return new Response('Missing Turnstile token', { status: 401 });
}

const ip = request.headers.get('CF-Connecting-IP');

let formData = new FormData();
formData.append('secret', '[secret key goes here]');
formData.append('response', token);
formData.append('remoteip', ip);

const turnstileUrl = 'https://challenges.cloudflare.com/turnstile/v0/siteverify';
const result = await fetch(turnstileUrl, {
    body: formData,
    method: 'POST',
});
const outcome = await result.json();

if (!outcome.success) {
    return new Response('Invalid Turnstile token', { status: 401 });
}

That should be pretty self-explanatory and you can find the docs for this on Cloudflare's server-side validation page which goes into more detail, but in essence, it does the following:

Gets the token from the request header and rejects the request if it doesn't exist
Sends the token, your secret key and the user's IP along to Turnstile's "siteverify" endpoint
If the token is not successfully verified then return 401 "Unauthorised", otherwise continue with the request

And because this is all done in a Cloudflare worker, any of those 401 responses never even touch the origin. Not only do I not need to process the request in Azure, the person attempting to abuse my API gets a nice speedy response directly from an edge node near them 🙂

So, what does this mean for bots? If there's no token then they get booted out right away. If there's a token but it's not valid then they get booted out at the end. But can't they just take a previously generated token and use that? Well, yes, but only once:

If the same response is presented twice, the second and each subsequent request will generate an error stating that the response has already been consumed.

And remember, a real browser had to generate that token in the first place so it's not like you can just automate the process of token generation then throw it at the API above. (Sidenote: that server-side validation link includes how to handle idempotency, for example when retrying failed requests.) But what if a real human fails the verification? That's entirely up to you but in HIBP's case, that 401 response causes a fallback to a full page post back which then implements other controls, for example an interactive challenge.

Time for graphs and stats, starting with the one in the hero image of this page where we can see the number of times Turnstile was issued and how many times it was solved over the week prior to publishing this post:

That's a 91% hit rate of solved challenges which is great. That remaining 9% is either humans with a false positive or... bots getting rejected 😎

More graphs, this time how many requests to the unified search page were rejected by Turnstile:

That 990k number doesn't marry up with the 476k unsolved ones from before because they're 2 different things: the unsolved challenges are when the Turnstile widget is loaded but not solved (hopefully due to it being a bot rather than a false positive), whereas the 401 responses to the API is when a successful (and previously unused) Turnstile token isn't in the header. This could be because the token wasn't present, wasn't solved or had already been used. You get more of a sense of how many of these rejected requests were legit humans when you drill down into attributes like the JA3 fingerprints:

In other words, of those 990k failed requests, almost 40% of them were from the same 5 clients. Seems legit 🤔

And about a third were from clients with an identical UA string:

And so on and so forth. The point being that the number of actual legitimate requests from end users that were inconvenienced by Turnstile would be exceptionally small, almost certainly a very low single-digit percentage. I'll never know exactly because bots obviously attempt to emulate legit clients and sometimes legit clients look like bots and if we could easily solve this problem then we wouldn't need Turnstile in the first place! Anecdotally, that very small false positive number stacks up as people tend to complain pretty quickly when something isn't optimal, and I implemented this all the way back in March. Yep, 5 months ago, and I've waited this long to write about it just to be confident it's actually working. Over 100M Turnstile challenges later, I'm confident it is - I've not seen a single instance of abnormal traffic spikes to the unified search endpoint since rolling this out. What I did see initially though is a lot of this sort of thing:

By now it should be pretty obvious what's going on here, and it should be equally obvious that it didn't work out real well for them 😊

The bot problem is a hard one for those of us building services because we're continually torn in different directions. We want to build a slick UX for humans but an obtrusive one for bots. We want services to be easily consumable, but only in the way we intend them to... which might be by the good bots playing by the rules!

I don't know exactly what Cloudflare is doing in that challenge and I'll be honest, I don't even know what a "proof-of-space" is. But the point of using a service like this is that I don't need to know! What I do know is that Cloudflare sees about 20% of the internet's traffic and because of that, they're in an unrivalled position to look at a request and make a determination on its legitimacy.

If you're in my shoes, go and give Turnstile a go. And if you want to consume data from HIBP, go and check out the official API docs, the uh, unified search doesn't work real well for you any more 😎

Related tags
- ❌
- Cloudflare
- Have
- I
- Been
- Pwned
August 21^st 2023 at 08:45

Troy Hunt
All New Have I Been Pwned Domain Search APIs and Splunk Integration
August 14^th 2023 at 19:55

All New Have I Been Pwned Domain Search APIs and Splunk Integration

By Troy Hunt

I've been teaching my 13-year old son Ari how to code since I first got him started on Scratch many years ago, and gradually progressed through to the current day where he's getting into Python in Visual Studio Code. As I was writing the new domain search API for Have I Been Pwned (HIBP) over the course of this year, I was trying to explain to him how powerful APIs are:

Think of HIBP as one website that does pretty much one thing; you load it in your browser and search through data breaches which then display on the screen. But when you have an API, it's no longer just locked into your browser, it's in all sorts of other systems. Mobile apps, other websites, dashboards and if you really want, you can even integrate the lights in your room with HIBP! Why? How? Well, there's a Home Assistant integration for HIBP and being pwned in a new breach could raise an event there you can then use YAML to perform an action with, for example flashing a light red. That might be weird and unnecessary, but when you have an API, suddenly all these things you never thought of are possible.

It took Brett Adams less than a day after we released the new domain search API last Monday for him to reach out to me with one of those ideas. He wanted to build a Splunk app (Brett is a Splunk MVP so this was right up his alley) to surface breached data about an organisation's domains right into the place where so many security engineers spend their days. He just wanted 2 new APIs to make the user experience the best it could be:

One that can show you the subscription level for someone's key
One that can show you all the domains they're monitoring

That seems so ridiculously obvious, why didn't I think of that originally?! But hey, easy fix, so the next day Brett had his APIs. And today, you also have the APIs because they're now all publicly documented and ready for you to consume. You also have Brett's Splunk app and because he's published it to Splunkbase, you can go and pull it into your own Splunk instance, plug in your HIBP API key and it's job done!

I'll leave you with a bunch of screen caps from Brett's work, starting with a zoomed in grab of what I suspect folks will find the most valuable - the addresses on their domains and their appearances across breaches:

That's a fragment of the broader dashboard that also breaks down the incidents over time:

The starting point for this is simply plugging your API key into the interface:

I like these headline figures and I picture particularly large organisations that have gone through various acquisitions of different brands with various domains finding this really useful:

And speaking of breaches, there's a lot of them which Brett has visualised across the course of time:

So that's it, you can see all the APIs documented on the HIBP website and you can grab Brett's app right now from Splunkbase. You can also find all the code for this in Brett's GitHub repo should you wish to have a read through it.

The HIBP APIs are there for other people to build awesome things. If you're one of those people, please get in touch with me and show me what you've created, I can't wait to see more integrations like Brett's 😊

Related tags
- ❌
- Have
- I
- Been
- Pwned
August 14^th 2023 at 19:55

Troy Hunt
Welcome to the New Have I Been Pwned Domain Search Subscription Service
August 7^th 2023 at 07:47

Welcome to the New Have I Been Pwned Domain Search Subscription Service

By Troy Hunt

This is a big one. A massive one. It's the culmination of a solid 7 months of work that finally, as of now, is live. The full back story is in my blog post from mid-June about The Big 5 Announcements but to save you trawling through all of that, here are the cliff notes:

Domain searches in HIBP are resource intensive and the impact was becoming increasingly obvious
More than half the Fortune 500 are using this feature, along with a who's who of big brands
We decided to introduce pricing tiers to the largest domain searches...
...but also add stuff, most notably domain searches by API and formal support...
...and remove stuff, most notably the need for verifying control of a domain after you've done it once

I've spent the last 8 weeks since publishing that post crunching numbers, writing code, doing loads of formal things (namely terms of use and privacy policy), and regularly talking about it on my weekly video. I've had loads of enormously useful feedback, much of which has shaped the state of the services we're launching here today. Thank you everyone who contributed, now let me get into it and explain exactly what we've come up with 🙂

The Pricing Structure

We've been thinking about the best way to structure this since January. How do we take something that has been provided for free for almost a decade and put a reasonable price on it? That's a highly subjective word - reasonable - and there'll never be complete consensus, so it's more about passing the pub test where your average person will look at this and go "yeah, that seems fair enough". Let me explain the thinking and how we reached the pricing structure you'll see further down:

Firstly, we wanted most domain searches to remain free. This keeps with the spirit of HIBP's roots being a community service and ensures the data is accessible without barrier to the majority of people. It would also mean that for most people, these changes would have absolutely no impact on the way they've been using the service, not unless they want access to the new bits.

Next, we wanted to divide the commercial offerings into a manageable number of tiers. The public API key has 4 tiers and I reckon that's the sweet spot; it's not too many options, but it's enough to provide a good separation between the scale of each. We then wanted to distribute the number of domains that would fall into the commercial category roughly equally between those 4 tiers, so it was pretty much a matter of taking what was left after the free ones and dividing them into 4 groups and putting a price on them.

Finally, we wanted the first commercial tier to be easily affordable so that most people could access it without thinking twice about it. My measure for that has always been "the cost of a cup of coffee", so I went down to my favourite local and checked what I was blindly paying when I waved my watch in the general direction of the EFTPOS machine:

$6 Aussie, or just under $4 in USD. Which led us to here (all in USD from now on):

Plan	Breached addresses	Percent of all domains	Price / m
Pwned 0	Up to 10	60%	Free!
Pwned 1	Up to 25	10%	$3.95
Pwned 2	Up to 100	10%	$16.95
Pwned 3	Up to 500	10%	$28.50
Pwned 4	Unlimited	10%	$115.00

What you're looking at here is a list of plan names (more on that soon), the size of the domain it covers (expressed in the number of breached email addresses on it), what percentage of all domains presently being monitored in HIBP this represents and, of course, the monthly price. As with the public API, if you subscribe annually then it's "pay for 10, get 12" which means that "Pwned 1" price works out at only $3.25 a month. As I flagged in the earlier post, this is all based around the number of addresses that appear in a breach, with one important caveat I'll expand on later: this number excludes all breaches flagged as a spam list. As a rough rule of thumb, over the years I've found approximately 20% of addresses on a domain have been breached so by that logic, you'll need 55 actual email addresses on a domain before there's a cost. Or up to 130 before it costs more than a coffee a month. (If you're a stickler for detail and are thinking those percentages are too perfect, I've rounded them from their actual values of 59.1%, 9.7%, 11.3%, 10.4% and 9.4%.)

But what if you have multiple domains? Easy - the one plan will cover all your domains within the size of that plan. For example, if you have 3 domains and one has 5 breached addresses, one has 20 and one has 90, you can get a single "Pwned 2" plan and cover them all. Or get a single "Pwned 1" plan and cover just the first 2. It's pretty simple.

So that was our initial thinking - stand this up as a product that sits alongside the existing API key one then you just purchase whichever one you want. Then, Brendan gave me a much better idea - combine them altogether! You can see the gears turning around in my head as I read his suggestion and as the days progressed and I gave it more thought, it became a brilliant idea. It massively simplifies the code base, it removes a lot of confusion that I'm sure would have otherwise ensued and perhaps most importantly, it gives you all something more than you would have had otherwise. The one fly in the ointment was the price disparity; the above prices are 13% to 15% higher than the old corresponding API key ones. So, what we've decided to do is run the old prices until 8 October then revise everything to the new prices above. That gives more than 60 days' notice to everyone with an existing API key (we'll have to email everyone anyway as the terms of use have changed to incorporate the domain bits), and there's clear verbiage everywhere about the change for anyone purchasing a new subscription. Plus, it gives everyone a little incentive to lock in for a year now and delay the increase until later in 2024. Thanks Brendan! 😊

So that's the rationale. There's no change for 60% of domains that have previously been searched, a negligible cost for the next 10% of them with the remainder paying commensurately more based on their scale. But we didn't just want to whack a cost on an existing service and you're down a few bucks a month with nothing more to show for it, let's talk about new stuff!

But Wait, There's More!

There are two brand new features we're now offering to all commercial subscribers. Even if your domain is small and has less than 10 breached addresses on it, you can still get access to these features via the entry level plan and they're both pretty self-explanatory: API-level access and formal support.

API first as I think it's the coolest and it's exactly what it sounds like: there's now a public endpoint you can throw a domain at and get a JSON response of breached aliases and the incidents they've appeared in. It looks just like this:

GET https://haveibeenpwned.com/api/v3/breacheddomain/{domain}
hibp-api-key: [your key]

Which then responds like this:

{
  "alias1": [
   "Adobe"
  ],
  "alias2": [
    "Adobe",
    "Gawker",
    "Stratfor"
  ],
  "alias3": [
    "AshleyMadison"
  ]
}

If you're already paying for an API key, you have immediate access to this! Same key, same logic in terms of resolving the returned breach name to the full thing via the unauthenticated API that returns breach metadata, the only caveats are that is has to be a domain you've previously demonstrated you control and it has to be within your plan size (e.g. you have a Pwned 1 plan and your domains don't exceed 25 breached addresses). Otherwise:

Subscription upgrade required.

Just one more thing with the domain search API: it only makes sense to hit it after a new breach is loaded. There's absolutely no point in hammering away at it non-stop as you'll only get the same result so instead, try polling the brand new API we've just added to return only the most recent breach (it's massively cached at Cloudflare anyway) and just hit the domain search API when there's a new one. But because not everybody will do this and domain searches are expensive relative to other queries, the terms and conditions include this clause:

Controls such as rate limiting may be added to the domain search API if excessive API requests are made despite no new breaches appearing since the last request.

There is a rate limit based on a variety of factors and it's possible you may receive an HTTP 429 if you request it more frequently than is necessary. The only reason I'm not going into the details of how that works here is that I expect it will adapt and change pretty frequently in response to how people use the service. What I can confidently say now though, is that if you use the domain search feature in the way it's intended to work - querying each domain after a new breach is added - you won't have a problem with rate limits.

I'm really excited to see how people will integrate this data into their existing tooling, do please let me know if you do something awesome 😊

Then there's the formal support which we offer via Zendesk at support.haveibeenpwned.com. That launched with the API key upgrades last November and since that time, we've answered almost 600 tickets. We've been trying to fine tune things to the extent that the knowledge base there answers the most common questions, but there's certainly a great deal of time that still goes into supporting the questions that pop up. Adding domain searches to the mix will inevitably increase that, possibly by a significant order of magnitude which is why we're only making this available to commercial subscribers.

So, that's the new bits. If you're in that 60% group of people with smaller domains outside of the commercial tiers, you can get access to both the API and support by subscribing to the smallest possible plan for that cup of coffee a month. We feel that's a pretty reasonable balance, and I hope you do too.

Speaking of reasonable, about those spam lists...

Data Breaches Ain't Data Breaches

I mentioned sharing as much as I could in my weekly update videos, including the intended pricing structure and how it would be based on the number of breached email addresses on a domain. Several people raised a very important point as it related to the calculations: data breaches ain't data breaches or more specifically, there are breaches in HIBP that shouldn't be treated like the other ones as they artificially inflate the pwn count. Could these be excluded?

The Onliner Spambot incident was the worst culprit and in the case of one person that contacted me, it caused his personal domain to read as though hundreds of addresses had been breached when the correct number was... zero. Someone else had their domain pegged at 40 breached addresses whereas once you took this breach out, the number came down to 13. This created somewhat of a rock and hard place situation because whilst those aliases did appear in this incident, they weren't real addresses. But what's a "real" email address anyway? Or more specifically, how can I tell via a string alone whether an address is real or not? A decade ago now I wrote about how hard this is and per the comments on that post, concluded that the only way to tell for sure is to send an email and have the recipient perform some sort of explicit action such as clicking on a link. Clearly, that's not feasible in this situation but equally, putting a price on a service based on a metric that has been artificially inflated just wasn't fair.

Adding spam lists back in 2016 was the right thing to do but equally, excluding them from the number that determines the pricing tier is also the right thing to do. We've tried to make this logic as clear as possible throughout the system and focus on a simple UX that's explicit but can also provide more insight if required,

And if you're interested in which breaches specifically have been classified as a spam list, I've added a filter to the API that lists all breaches. It's an unauthenticated API you can load directly in your browser via GET request and at the time of writing, has 11 breaches on it with nearly 1.4 billion records.

The very last thing from that screen cap is the "Enable debug mode" link and for that, we need to talk about "domain creep".

Domain Creep, and Getting What You Paid For

Data breaches are obviously an ongoing thing. Always have been, always will be so what that means is when you look at a domain today and see, say, 20 breached accounts on it, that might be 30 breached accounts tomorrow. I think everyone who uses HIBP understands that, but it does create a bit of a problem when domain searches are priced on a metric that can "creep". What if you've just paid for a year's worth of Pwned 1 subscription and per the example here, you've suddenly got more than 25 breached accounts on your domain and can no longer search it?

The sentiment of how this should be handled was always obvious: people have to get what they pay for. We didn't want a situation where someone could be left disappointed, and our fear was that the organic increase in breaches could lead to that event. The solution was easy: when you buy a subscription at a certain scale, every domain you're currently monitoring that can be searched on the first day of the subscription can still be searched on the last day of the subscription. If you take out one year of Pwned 1 today and per the example above, the domain creeps beyond 25 breached accounts tomorrow, it'll have zero impact for the next 364 days.

I'm conscious that this concept can get confusing: domain searches are based on the number of breached accounts on the domain but not including spam lists and then locked in at the size of the domain until the next subscription renew... phew! The debug mode link mentioned above aims to show all this logic in its raw detail:

Even though domain1.com in this example has grown to 26 breached addresses, because it was 22 breached addresses when the subscription was taken out then that's the number it's locked at until it renews in August next year. I hope this is clear enough, do please leave a comment if we can do better.

Lastly, let me put some raw numbers around the "domain creep" situation as I foresee this causing concern beyond what might be warranted. Let's start with the number of unique email addresses which is approximately 6 billion. There have been about 723M records added in the last 12 months and a bunch of those will be for the same email address (shout out to everyone who was pwned again in the last year!) Further, of that number, most email addresses were already pwned. That's a link through to the Twitter feed where I broadcast the percentage of previously seen addresses and you'll see that number is regularly around the 60% to 70% range. In other words, it's probably in the order of 250M new addresses we've seen in the last year which is appx 4% of the entire corpus. So, yes, over the course of time we'll see domains slip into higher plans, but only at about the rate of CPI.

Lastly, locking domain counts for the duration of the subscription creates additional incentive to make it an annual one, and that's beyond the existing incentive of "buy 10 months, get 12 months". That's also in addition to massively cutting down on the number of times you may need to deal with corporate bureaucracy. Speaking of which...

Satisfying Corporate Bureaucracy

Let me start with a story: Many years ago during my lengthy tenure at Pfizer, I pushed hard to drive us away from traditional hosting models and towards modern cloud paradigms, namely the Azure App Service. Here we had a model where you could self-service provision resources that cost about $50 per month and completely replaced a model that was costing us tens of thousands a year. It was an easy win, however... the organisation demanded vendor assessments, compliance paperwork and a billing model which, of course, was favourable to them. But Microsoft's model was "chuck your credit card in and off you go", so that's what one of my colleagues did. And paid for it himself, entirely out of his own pocket in order to save one of the world's largest companies money. My point is that I've done time on the inside and I understand the barriers organisations put in place "because reasons". I touched on this in the June post about the upcoming domain changes:

To be honest, the experience with the public API keys has taught me that it's usually not money that's the barrier to using commercial services, it's corporate procurement bureaucracy. Onboarding documentation. Vendor assessments. Tax forms.

And so too, I have the experience from the outside having regularly received requests to invest hours doing manual labour for the sake of something an organisation is paying a few bucks a month for. That simply doesn't scale and the whole point of providing services like this at volume is that you can go and set everything up yourself with nothing more than a credit card. This one came in while preparing this blog post:

My company is looking to purchase an API key so we can automate user lookups on your site. Our procurement process is wildly complex and I was wondering if we have the option of submitting a Purchase Order instead of using the Stripe credit card payment method?

If this situation resonates, you have my sympathies and my own corporate bureaucracy scars are still raw! If there's more we can do to ease the onboarding path without creating manual labour on a per-customer basis then please let me know. I'm sure there are improvements that can be made, the last thing I want to see is you ending up like my old mate from Pfizer 😞

We've tried to do everything possible to remove barriers. We've made significant investments in legal counsel to get the terms of use and privacy policy right and we've tried to provide answers to all the regular questions in the FAQs. We've even publicly provided a W-8BEN-E US tax form which was often requested by folks in the US. But it won't be enough for some organisations, which is why we do exactly the same thing as Pfizer often found themselves doing which is to provide an enterprise-orientated process where we deal with all this rigmarole... and charge accordingly. If that's you, then get in touch with me.

But What About...?

There will be lots of "but what about...?" edge cases. Let me give you some examples and our views on them:

But what about addresses that don't actually exist?
For most data breaches, email addresses are extracted using a regular expression run over the entire corpus of data. You can see what this looks like in the open source email address extractor used to process breaches. So, what is an email address? Per my earlier explanation, it's anything that matches the regex when run across the breach. That could mean strings that aren't actually an address on a domain get caught up and reported incorrectly. It happens, but there's no way to practically stop it and it's extraordinarily rare.

But what about email addresses from years ago that still appear as breached on a domain?
The argument here is that whilst these are genuine addresses that did indeed exist at one point, they aren't really relevant anymore either due to their age or the address no longer existing (e.g. ex staff). I have both a philosophical and a technical view on this, with the former being that data breaches are immutable. At a point in time, addresses were exposed, and that fact can never be reversed. As for the latter point, those addresses remain in a storage construct we need to continue to support, and every single domain query needs to pick those addresses up and return them to the code processing the search (the design of HIBP means that Azure's Table Storage returns the entire partition on each domain query). Further, in most cases, that doesn't change the total number of breached accounts being a reasonable metric for organisation size and subsequently, the pricing tier they should fit into.

But what about old breaches I don't care about any more causing me to require a higher plan?
It's a similar answer to the previous point insofar as the immutability of history and the need to store the data. It also remains the most reliable metric we have to determine the size of the domain and in many cases, the organisation that owns it. Think of this measurement primarily as a means of slicing up the corpus of data within HIBP and distributing the cost as equitably as possible across the organisations using the domain search feature.

But what about people who don't want to use a credit card?
I'll give you a two-part answer on this, beginning with the recognition that cards can pose legitimate challenges for some people. Just as I was drafting this blog post, someone trying to sign up to the public API reached out after failing to subscribe multiple times with different cards:

For a variety of reasons, I believe the guy is legit, but Stripe reports two payments declined by his bank and another due to an invalid CVC. But using Stripe doesn't just mean credit cards, it also means Apple Pay and Google Pay, WeChat Pay in China, EPS in Belgium, Afterpay in Australia and a raft of other payment mechanisms in different parts of the world. It's hard to imagine a legitimate case where someone does not have access to any of the available payment mechanisms, which brings me to the second part:

The reason we don't support the likes of anonymous cryptocurrency and rely solely on fiat money payments is that it very quickly weeds out the bad actors. That was the whole rationale for putting a payment gateway on the public API back in 2019 - to cut out the abuse. It turns out that once you have to pass the sort of KYC barriers financial institutions put in place, people don't misbehave under their own identity. And yes, there's always fraudulent use of cards, but Stripe has gotten so good at handling that (we pay for their Radar service as well), our dispute rate is only one in many thousands of transactions.

But what about [other reasons related to calculations and costs]?
Amongst the corpus of 12.6 billion records, there will be anomalies. It'll almost certainly be sub-1% and the anomalies won't be evenly distributed across domains; they'll affect some more than others. It's infeasible to ever get that down to zero and it's also infeasible to respond to every single request I know will come through asking for an anomaly to be rectified. The most practical way we could find to deal with this is to keep the pricing structure such that anomalies will be unlikely to have much impact of consequence.

We're also conscious that some people will challenge the cost and it happens all the time with the existing public API key either because of the individual's position in life or the nature of the organisation they work in. But this is why we've structured it as we have, with the majority of domains being within that free tier and the entry level cost being the cup of coffee that gets you access to things like API level access and formal support. This was the most reasonable, equitable model we could come up with and I hope that shines through in the explanations above.

Summary

I know there'll be individuals with catch all domains that have ended up in a couple of dozen data breaches and they think paying $3.95 to see them is unreasonable. I know there'll be organisations with much larger numbers who feel it's unreasonable because similarly sized orgs are more profitable. But I also know that I've been running domain searches totally out of my own pocket for almost a decade so whilst I'm sympathetic to anyone who now needs to pay for a service that was previously free, I'm also comfortable that a reasonable and well thought out model has been arrived at.

I'm excited to see what people do with the new API. The email address search one is presently requested millions of times a day and people have built all sorts of amazing things with it, everything from corporate awareness campaigns to tooling to help protect customers from account takeover attacks to integration within the corporate SOC. It's cases like that last one where I think the domain search API will really shine and if you do something awesome with it, please get in touch and let me know.

I know this was a long read, I hope it adequately explains the rationale for the subscription service and that you use it to do amazing things 😊

You can get started right now from the domain search page on HIBP.

Update: Following feedback and consultation with a range of existing users of the service, we now provide a model for the education and non-profit sectors. See the KB titled Do you provide discounts based on the nature of the organisation? for more information.

Related tags
- ❌
- Have
- I
- Been
- Pwned
August 7^th 2023 at 07:47

Troy Hunt
Have I Been Pwned Domain Searches: The Big 5 Announcements!
June 15^th 2023 at 07:17

Have I Been Pwned Domain Searches: The Big 5 Announcements!

By Troy Hunt

There are presently 201k people monitoring domains in Have I Been Pwned (HIBP). That's massive! That's 201k people that have searched for a domain, left their email address for future notifications when the domain appears in a new breach and successfully verified that they control the domain. But that's only a subset of all the domains searched, which totals 231k. In many instances, multiple people have searched for the same domain (most likely from the same company given they've successfully verified control), and also in many instances, people are obviously searching for and monitoring multiple domains. Companies have different brands, mergers and acquisitions happen and so on and so forth. Larger numbers of domains also means larger numbers of notifications; HIBP has now sent out 2.7M emails to those monitoring domains after a breach has occurred. And the largest number of the lot: all those domains being monitored encompass an eye watering 273M breached email addresses 😲

The point is, just as HIBP itself has escalated into something far bigger than I ever expected, so too has the domain search feature. Today, I'm launching an all new domain search experience and 5 announcements about major changes surrounding it. Let's jump into it!

Announcement 1: There's an all new domain search dashboard

Every time I look at numbers related to domain searches, they stagger me. One of the stats I found particularly interesting was that of those 200k people monitoring domains, 23k of them were monitoring 2 or more domains. 8.5k were monitoring 3 or more. 4.6k were 4 or more and so on and so forth. The point being that there are a very large number of people monitoring multiple domains. In fact, 1k people are monitoring 9 or more and hundreds have gone through the manual verification process at least 2 dozen times.

To make life much, much easier on those folks monitoring multiple domains, they're now all bundled up into a centralised dashboard accessible from the existing Domain search link on the website. Because I already know who is monitoring which domains and the email address they're using for notifications, that same email address can be used to verify your identity and drop you straight into the dashboard. Here's mine:

One of the problems the dashboard approach helps tackle is unsubscribing on an individual domain basis. In the past, the only way to unsubscribe from domain notifications was to wait until one landed in your inbox then unsubscribe from every single monitored domain in one go. It was an all or nothing affair that nuked the lot of them whereas now, it's a domain-by-domain exercise.

Another problem this solves is how I respond to an often-received question: "Hey, can you tell me which domains I'm currently subscribed to". Uh, the ones you verified? Like, possibly almost a decade ag... ah, yeah, that's a poor answer! The dashboard now makes the answer crystal clear.

And finally, another massive problem it helps tackle is verification, and that brings me to the second big announcement:

Announcement 2: From now on, domain verification only needs to happen once

I originally introduced domain searches to HIBP only 6 weeks after the project first launched. Up until this week, it functioned exactly the same way for almost a decade: plug in a domain name, verify control of it then see the results. Each and every time. What it meant is that if you wanted to search a domain, you successfully demonstrated control then you came back later and tried to search it again, you had to go back through the same process:

You'd be surprised at how many emails I get about the difficulty this poses. We don't have any of those 4 aliases on our domain. We can't add a meta tag. We can't upload a file. We can't touch DNS. It leaves me prone to asking "well do you really have control of the domain?" Thing is, "control" is a bit of a nuanced term; there are many people in roles where they don't have access to any of the above means of verification but they're legitimately responsible for infosec and responding to precisely the sorts of notifications HIBP sends out after a breach. Usually in these cases they can get support to go through the verification process, but it involves formal internal processes, ticketing, documentation and having to explain to some IT ops person why a data breach website with a funny name needs one of the above things to happen. This doesn't fix the pain of doing it once, but it does mean that it's now a one-off pain.

Announcement 3: Domain searches are now entirely "serverless"

As the popularity of HIBP and domain searches has grown over the years, another challenge has emerged. Let me illustrate by example: in January this year, I loaded a rather large breach into HIBP:

New scraped data: Twitter had over 200M accounts scraped from a vulnerable API in 2021. Email addresses were passed in and Twitter profiles returned. 98% were already in @haveibeenpwned. Read more: https://t.co/FRBDFk3nkp
— Have I Been Pwned (@haveibeenpwned) January 5, 2023

That's a sizeable whack of data, in fact it was the 14th largest in HIBP out of the existing 644 in there at the time. It also had a massive impact on HIBP subscribers; I sent over 1 million emails to individuals using the notification service which made it the single largest corpus of notification emails we'd ever sent by a significant margin. But further, I also sent 60,851 emails to people monitoring domains. And that's when this started happening:

6 minutes later...

And so on and so forth until my inbox looked like this:

This was Azure auto-scale doing its thing and it was one of the early attractions for me building HIBP on Microsoft's PaaS offering way back in 2013. Need more resources? Just add more cloud! Job done, next problem. Except there are 2 major drawbacks with this:

Auto-scale is reactive. You get extra capacity in response to demand but if demand spikes too fast, you're left without sufficient resources. I learned this the hard way and wrote about it in detail in 2016.
I pay for it. When load spikes and additional instances are scaled out, I'm billed for it whilst those instances are spun up. It's great that domain searches are free for the end user, but they're not free for me 😔

Domain searches were actually one of the last remaining remnants of a resource intensive process still running on PaaS; most of the other important bits (namely email address searches and Pwned Password's k-anonymity searches) had been on Azure Functions for ages. Functions are awesome as they're "serverless" (except for the servers they run on, but don't let me get in the marketing team's way here), in that you're never deploying large logical containers of compute like with auto-scale so that solves problem 1 above.

As of now, all domain searches run on Azure Functions. There's literally no domain search logic remaining in the Azure App Service PaaS model, it's all gone. That moves things over to much more scalable infrastructure and massively reduces the likelihood of a timeout when searching a larger domain.

Announcement 4: There are lots of little optimisation tweaks

I didn't just want to ship a model from years ago and reproduce all the assumptions of the day, so I made a bunch of tweaks to further optimise things. These are all things that benefit both those searching domains and me running the platform as they reduce overhead on everyone.

For example, there was no point searching for a domain then listing every alias on it "@domain.com" so now you'll just see "alias@" instead. Doesn't sound like a lot, but imagine a domain with tens of thousands of results and then a heap of orgs running searches on them. More data equals more processing equals more egress bandwidth equals more latency and more cost. (Sidenote: if you're wondering "how costly can a bit of bandwidth really be", read my post from last year on How I Got Pwned by My Cloud Costs.)

The same logic extended to exporting the domain search results in Excel or JSON format - strip out the redundant data. I went even harder on the JSON front as this format is primarily used for ingestion into other apps where there's a large amount of programmatic control. So, rather than returning a heap of redundant breach metadata over and over again, now each alias just lists the name of the breach and you can match that up to the data from the breaches API. To be clear, the domain search JSON format itself was never an "API"; it wasn't designed for programmatic consumption, it required manual verification first and I set no expectation of stability. That's something that will change soon - there'll be a proper API - but I'll come back to that at the end of this post.

Something else I've been working away on in the background is to better leverage Cloudflare's WAF to minimise the impact on the origin services. For example, last week I did a thread on blocking 401 and excessive 428 responses at the edge rather than having to process them (and pay to process them) at the origin. I've been using similar logic to keep some, well, let's just call it "very excessive" domain queries under control. For example, one particular domain was searched 140 times after a breach was loaded in April, followed by another 40 times immediately after a breach the following month:

Clearly, this is just unnecessary. Remember how domain searches are a resource intensive process that hits my bottom line pretty hard? Yeah, well, not any more!

And finally on the performance front, if you were previously monitoring multiple domains and you got a breach alert, you could run a single search that bundled all the results in together. You reckon searching for one domain can be resource intensive? Try throwing a bunch of them into the one search! As the system grew and grew, this model became increasingly hard to sustain and equally, it became increasingly noisy. So now, exactly the same domains can be searched one by one which breaks the processing down into smaller, more manageable units. Hey, wouldn't it be great to have an API around that so you could just automate the entire thing? Read on!

All these tweaks along with the move to Azure Functions has made a massive difference to the performance problem mentioned earlier, but another problem remains: I'm still paying for your domain searches. Azure Functions are charged based on a combination of how long they run for and how many resources they consume. Both those factors are extraordinarily small for individual email address searches, but they're not for domain searches. That's why soon, the largest users of the service are going to see a small fee.

Announcement 5: Searches for small domains will remain free whilst larger domains will soon require a commercial subscription

Pick a brand. A big brand. If I was to bet you that either the brand directly or its parent company has used the HIBP domain search feature in the past, I'd win. I wouldn't win every bet, but I'd come out on top over a bunch of them and I know this because I have the data to be confident of my odds 🙂

Knowing which big brands use which domains for their email is actually a hard metric to define:

Anyone know where I can find a list of the Fortune 500’s domains used for email accounts? There may be more than 1 per company and it may be different to their primary website.
— Troy Hunt (@troyhunt) January 15, 2023

But by cobbling enough OSINT data together, I was able to confidently demonstrate that more than half the Fortune 500 have used this service and the vast majority of those continue to do so via ongoing domain monitoring. That's awesome! And that pattern extends all the way down to much more localised brands too; My bank. My telco. My supermarket. All sorts of commercial organisations running businesses and using data sourced from HIBP to help them do so.

I started analysing the metrics back at that tweet in Jan, just the week after all the domain searches following the scraped Twitter data going into HIBP. For the last 5 months, I've been trawling through the usage patterns and watching how organisations are using the service. I also paid a lot of attention to the reactions following the change in rate limits and annual billing for the public API that enables email address searches last Nov. That's now given me a pretty good sense of how to structure a commercial domain search model. It's not final yet, but I do hope to put the finishing touches on it next month and in the interim, welcome feedback on the high-level overview of how it'll work that I'll list here in point form:

I can reliably establish the size of a domain based on the number of email addresses that appear against it in breaches
There is a size at which domain searches should remain totally free and that size will usually indicate a small business or website or a personal domain (certainly every domain you see in the hero image of this blog post, for example)
Like with the aforementioned API for email address searches, there should be tiers of scale that reflect domain size and increase proportionately in price for larger organisations
Commercial subscribers should get more than they do now - they should get domain searches by API!

That last point in particular is hotly requested and as of a couple of months ago, already under development:

UserVoice suggestion for @haveibeenpwned to add domain search capability to the API now started! Follow along, vote and subscribe to updates here: https://t.co/Z32eC0d9nb
— Troy Hunt (@troyhunt) April 20, 2023

I'm still working through the mechanics of all this, both technically and commercially. One part of that is looking at raw numbers, for example about half of all the domains being monitored have 10 or less breached accounts on them. These aren't commercial entities of any scale and whilst I'm not saying "10 is the free tier number", clearly there are a massive number of domains that are tiny and shouldn't be at all impacted by this.

To be honest, the experience with the public API keys has taught me that it's usually not money that's the barrier to using commercial services, it's corporate procurement bureaucracy. Onboarding documentation. Vendor assessments. Tax forms. All sorts of things that demand hours of our time, often for the sake of only $3.50 per month. So we politely decline 😊 I know that will be an issue, in fact I suspect it will be the issue and a lot of the work we've been doing this year is to try and ease that pain to the fullest extent possible. I'll talk more about that once things finally launch but for now, that's the direction we're heading and the sorts of issues we're tackling in preparation.

Summary

As we approach the 10th birthday of HIBP later this year, it's hard not to look back and reflect. So much has changed in that time, yet the service still feels very much like what it was on day 1. The challenge for me over this time has been to work out how to adapt to the changes whilst keeping true to the original intent of service. Nothing has happened quickly in that regard, and the transparent fashion in which I've chosen to run HIBP has made the rationale for any change very clear to everyone. Even this blog post has been 5 months in the making, gradually evolving to reflect my thinking on the issues until I was confident enough in the path forward.

Go and use the new dashboard. Give it a good run and let me know what you think as I'm sure there are many things we can do better. And do provide your feedback on the both the changes announced here and those to come regarding the commercial tiers too, the more input we get on this the better equipped we are to make good decisions.

Related tags
- ❌
- Have
- I
- Been
- Pwned
June 15^th 2023 at 07:17

Troy Hunt
Seized Genesis Market Data is Now Searchable in Have I Been Pwned, Courtesy of the FBI and "Operation Cookie Monster"
April 5^th 2023 at 12:02

Seized Genesis Market Data is Now Searchable in Have I Been Pwned, Courtesy of the FBI and "Operation Cookie Monster"

By Troy Hunt

A quick summary first before the details: This week, the FBI in cooperation with international law enforcement partners took down a notorious marketplace trading in stolen identity data in an effort they've named "Operation Cookie Monster". They've provided millions of impacted email addresses and passwords to Have I Been Pwned (HIBP) so that victims of the incident can discover if they have been exposed. This breach has been flagged as "sensitive" which means it is not publicly searchable, rather you must demonstrate you control the email address being searched before the results are shown. This can be done via the free notification service on HIBP and involves you entering the email address then clicking on the link sent to your inbox. Specific guidance prepared by the FBI in conjunction with the Dutch police on further steps you can take to protect yourself are detailed at the end of this blog post on the gold background. That's the short version, here's the whole story:

Ever heard that saying about how "data is the new oil"? Or that "data is the currency of the digital economy"? You've probably seen stories and infographics about how much your personal information is worth, both to legitimate organisations and criminal networks. Like any valuable commodity, marketplaces selling data inevitably emerge, some operating as legal businesses and others, well, not so much. In its simplest form, the illegal data marketplace has long involved the exchange of currency for personal records containing attributes such as email addresses, passwords, names, etc. Cybercriminals then use this data for purposes ranging from identity theft to phishing attacks to credential stuffing. So, we (the good guys) adapt and build better defences. We block known breached passwords. We implement two factor authentication. We roll out user behavioural analytics that identifies abnormalities in logins (why is Joe suddenly logging in from the other side of the world with a new machine?) And in turn, the criminals adapt, which brings us to Genesis Market.

Until this week, Genesis had been up and running for 4 years. This is an excellent primer from Catalin Cimpanu, and it describes how in order to circumvent the aforementioned fraud protection measures, cybercriminals are increasingly relying on obtaining more abstract pieces of information from victims in order to gain access to their accounts. Rather than relying on the credentials themselves and then being subject to all the modern fraud detection services mentioned above, criminals instead began to trade in a combination of "fingerprints" and "cookies". The latter will be a familiar term to most people (and was obviously the inspiration for the name behind the FBI's operation), whilst the former refers to observable attributes of the user and their browser. To see a very easy demonstration of what fingerprinting involves, go and check out amiunique.org and hit the "View my browser fingerprint" button. You'll get something similar to this:

Among more than 1.6M sampled clients, nobody has the same fingerprint as me. Somehow, using the current version of Chrome on the current version of Windows, I am a unique snowflake. Why I'm so unique is partly explained by my time zone which is shared by less than half a percent of people, but it's when that's combined with the other observable fingerprint attributes that you realise just how special I really am. For example, less than 0.01% of people have a content language request header of "en-US,en,en-AU". Only 0.12% of people share a screen width of 5,120 pixel (I'm using an ultrawide monitor). And so on and so forth. Because they're so unique, fingerprints are increasingly used as a fraud detection method such that if a malicious party attempts to impersonate a legitimate users with otherwise correct attributes (for example, the correct cookies) but the wrong fingerprint, they're rejected. Which is why we now have IMPaaS.

There's an excellent IMPaaS explanation from the Eindhoven University of Technology in the Netherlands via a paper titled Impersonation-as-a-Service: Characterising the Emerging Criminal Infrastructure for User Impersonation at Scale. Released only a year and a half after the emergence of Genesis, the paper explains the mechanics of IMPaaS:

IMPaaS allows attackers to systematically collect and enforce user profiles (consisting of user credentials, cookies, device and behavioural fingerprints, and other metadata) to circumvent risk-based authentication system and effectively bypass multifactor authentication mechanisms

In other words, if you have all the bits of information a website requires to persist authenticated state after the login process has successfully completed (including after any 2FA requirements), you can perform a modern equivalent of session hijacking. Obtaining this level of information is typically done via malicious software running on the victim's machine which can then grab anything useful and send it off to a C2 server where it can then be sold and used to commit fraud (from the IMPaaS paper):

Catalin's story from the early days of Genesis showed how buyers could browse through a list of compromised victims and pick their target based on the various services they had authenticated too, along with their operating system and location. Pricing was inevitably based on the value of those services with the examples below going for $41.30 each (and just like a legitimate marketplace, these were marked down prices so a real bargain!)

To make things as turn-key as possible for the criminals, buyers would then run a browser extension from Genesis that would reconstruct the required fingerprint based on the information the malware had obtained and grant them access to the victims' accounts (I'm having flashbacks of Firesheep here). It was that simple... until this week. As of now, the following banner greets anyone browsing to the Genesis website:

The aptly named "Operation Cookie Monster" is a joint effort between the FBI and a coalition of law enforcement agencies across the globe who have now put an abrupt end to Genesis. I imagine they'll be having some "discussions" with those involved in running the service, but what about the individuals who are the victims? These are the people whose identities have been put up for sale, purchased by other criminals and then abused to their detriment. The FBI approached me and asked if HIBP could be used as a mechanism to help warn victims of their exposure in the same way as we'd previously done with the Emotet malware a couple of years ago. This is well aligned with the mantra of HIBP - to do good and constructive things with data breaches after they occur - and I was happy to provide support.

There are 2 separate things that have now been loaded into HIBP, each disassociated from the other:

Millions of compromised passwords that are now searchable via Pwned Passwords
Millions of email addresses that are now searchable after verifying control of the address using the notification service

The Pwned Passwords API is presently hit more than 4 billion times each month, and the downloadable data set is hit, well, I don't know because anyone can grab it run it offline. The point is that password corpuses loaded into HIBP have huge reach and are used by thousands of different online services to help people make better password choices. You're probably using it without even knowing it when you signup or login to various services but if you want to check it directly, you can browse to the web interface. (If you're worried about the privacy of your password, there's a full explainer on how the service preserves anonymity but I also suggest testing it after you've changed it as a generally good practice.)

The email address search is what HIBP is so well known for and that's obviously what will help you understand if you've been impacted. Per the opening paragraph, this breach is flagged as "sensitive" so you will not get a result when searching directly from the front page or via the API, rather you'll need to use the free notification service. This approach was chosen to avoid the risk of people being further targeted as a result of their inclusion in Genesis. All existing HIBP subscribers have been sent notification emails and between individuals and those monitoring domains, tens of thousands of emails have now been sent out. Whilst the volume of accounts represented is "8M", please note that this is merely an approximation (hence the perfectly round number on HIBP), intended to be an indicative representation of scale as many of the breached accounts didn't include email addresses. This number only represents the number of unique email addresses which showed up in the data set so consider it a subset of a much larger corpus.

Let me add some final context and this is important if you do find yourself in the Genesis data: due to the nature of how the malware collected personal information and the broad range of different services victims may have been using at the time, the exposed data can differ significantly person by person. What's been provided by the FBI is one set of passwords (incidentally, as SHA-1 and NTLM hash pairs fed into the law enforcement ingestion pipeline), one set of email addresses and a list of meta data. Beyond the data already listed here, the meta data includes names, physical addresses, phone numbers and full credit card details among other personal attributes. This does not mean that all impacted individuals had each of those data classes exposed. The hope is that by listing these fields it will help victims understand, for example, why they may have observed fraudulent transactions on their card, and they can then take informed and appropriate steps to better protect themselves.

Lastly, as flagged in the intro, following is the guidance prepared by the FBI and Dutch police on how people can safeguard themselves if they get a hit in the Genesis data or frankly, just want to better protect themselves in future:

The FBI reached out to Have I Been Pwned (HIBP) to continue sharing efforts to help victims determine if they've been victimized. In this instance, the data shared emanates from the Initial Access Broker Marketplace Genesis Market. The FBI has taken action against Genesis Market, and in the process has been able to extract victim information for the purposes of alerting victims.

In all, millions of passwords and email addresses were provided which span a wide range of countries and domains. These emails and passwords were sold on Genesis Market and were used by Genesis Market users to access the various accounts and platforms that were for sale.

Prepared in conjunction with the FBI, following is the recommended guidance for those that find themselves in this collection of data:

To safeguard yourself against fraud in the future, it is important that you immediately remove the malware from your computer and then change all your passwords. Do this as follows:

Log out of all open sessions in all web browsers on your computer.
Remove all cookies and temporary internet files.
Then choose one of the following two options:
1. Update the virus scanner on your computer.
  1. Then carry out a virus scan on your computer.
  2. The malware will be removed.
  3. Then (and only then) change all your passwords. Don’t do this any earlier, as otherwise the cybercriminals will see the new passwords.
    
    OR
2. Reset the infected computer to the factory default settings:
  1. Then (and only then) change all your passwords. Don’t do this any earlier, as otherwise the cybercriminals will see the new passwords.

How can I prevent my data being stolen (again)?

Use a virus scanner and keep it up to date.
Use strong passwords that are unique for each account/website.
Use multifactor authentication. If you use a fingerprint, facial recognition, or approval on another device (such as a phone) to confirm your identity on login, it is harder for someone to access your accounts.
Never download or install illegal software. This is a very common source of malware infection.
When installing legal software, always check that the website is genuine.

Just one more thing to end on a lighter note: a quick shoutout to whoever at the bureau slipped a half-eaten cookie into the takedown image, having been munched on by what I can only assume is a very satisfied FBI agent after a successful "Operation Cookie Monster" 😊

Related tags
- ❌
- Have
- I
- Been
- Pwned
April 5^th 2023 at 12:02

Troy Hunt
To Infinity and Beyond, with Cloudflare Cache Reserve
March 10^th 2023 at 06:35

To Infinity and Beyond, with Cloudflare Cache Reserve

By Troy Hunt

What if I told you... that you could run a website from behind Cloudflare and only have 385 daily requests miss their cache and go through to the origin service?

No biggy, unless... that was out of a total of more than 166M requests in the same period:

Yep, we just hit "five nines" of cache hit ratio on Pwned Passwords being 99.999%. Actually, it was 99.9998% but we're at the point now where that's just splitting hairs, let's talk about how we've managed to only have two requests in a million hit the origin, beginning with a bit of history:

Optimising Caching on Pwned Passwords (with Workers)- @troyhunt - https://t.co/KjBtCwmhmT pic.twitter.com/BSfJbWyxMy
— Cloudflare (@Cloudflare) August 9, 2018

Ah, memories 😊 Back then, Pwned Passwords was serving way fewer requests in a month than what we do in a day now and the cache hit ratio was somewhere around 92%. Put another way, instead of 2 in every million requests hitting the origin it was 85k. And we were happy with that! As the years progressed, the traffic grew and the caching model was optimised so our stats improved:

There it is - Pwned Passwords is now doing north of 2 *billion* requests a month, peaking at 91.59M in a day with a cache-hit ratio of 99.52%. All free, open source and out there for the community to do good with 😊 pic.twitter.com/DSJOjb2CxZ
— Troy Hunt (@troyhunt) May 24, 2022

And that's pretty much where we levelled out, at about the 99-and-a-bit percent mark. We were really happy with that as it was now only 5k requests per million hitting the origin. There was bound to be a number somewhere around that mark due to the transient nature of cache and eviction criteria inevitably meaning a Cloudflare edge node somewhere would need to reach back to the origin website and pull a new copy of the data. But what if Cloudflare never had to do that unless explicitly instructed to do so? I mean, what if it just stayed in their cache unless we actually changed the source file and told them to update their version? Welcome to Cloudflare Cache Reserve:

Ok, so I may have annotated the important bit but that's what it feels like - magic - because you just turn it on and... that's it. You still serve your content the same way, you still need the appropriate cache headers and you still have the same tiered caching as before, but now there's a "cache reserve" sitting between that and your origin. It's backed by R2 which is their persistent data store and you can keep your cached things there for as long as you want. However, per the earlier link, it's not free:

You pay based on how much you store for how long, how much you write and how much you read. Let's put that in real terms and just as a brief refresher (longer version here), remember that Pwned Passwords is essentially just 16^5 (just over 1 million) text files of about 30kb each for the SHA-1 hashes and a similar number for the NTLM ones (albeit slight smaller file sizes). Here are the Cache Reserve usage stats for the last 9 days:

We can now do some pretty simple maths with that and working on the assumption of 9 days, here's what we get:

2 bucks a day 😲 But this has taken nearly 16M requests off my origin service over this period of time so I haven't paid for the Azure Function execution (which is cheap) nor the egress bandwidth (which is not cheap). But why are there only 16M read operations over 9 days when earlier we saw 167M requests to the API in a single day? Because if you scroll back up to the "insert magic here" diagram, Cache Reserve is only a fallback position and most requests (i.e. 99.52% of them) are still served from the edge caches.

Note also that there are nearly 1M write operations and there are 2 reasons for this:

Cache Reserve is being seeded with source data as requests come in and miss the edge cache. This means that our cache hit ratio is going to get much, much better yet as not even half all the potentially cacheable API queries are in Cache Reserve. It also means that the 48c per day cost is going to come way down 🙂
Every time the FBI feeds new passwords into the service, the impacted file is purged from cache. This means that there will always be write operations and, of course, read operations as the data flows to the edge cache and makes corresponding hits to the origin service. The prevalence of all this depends on how much data the feds feed in, but it'll never get to zero whilst they're seeding new passwords.

An untold number of businesses rely on Pwned Passwords as an integral part of their registration, login and password reset flows. Seriously, the number is "untold" because we have no idea who's actually using it, we just know the service got hit three and a quarter billion times in the last 30 days:

Giving consumers of the service confidence that not only is it highly resilient, but also massively fast is essential to adoption. In turn, more adoption helps drive better password practices, less account takeovers and more smiles all round 😊

As those remaining hash prefixes populate Cache Reserve, keep an eye on the "cf-cache-status" response header. If you ever see a value of "MISS" then congratulations, you're literally one in a million!

Full disclosure: Cloudflare provides services to HIBP for free and they helped in getting Cache Reserve up and running. However, they had no idea I was writing this blog post and reading it live in its entirety is the first anyone there has seen it. Surprise! 👋

Related tags
March 10^th 2023 at 06:35

Troy Hunt
Down the Cloudflare / Stripe / OWASP Rabbit Hole: A Tale of 6 Rabbits Deep 🐰 🐰 🐰 🐰 🐰 🐰
February 20^th 2023 at 07:47

Down the Cloudflare / Stripe / OWASP Rabbit Hole: A Tale of 6 Rabbits Deep 🐰 🐰 🐰 🐰 🐰 🐰

By Troy Hunt

I found myself going down a previously unexplored rabbit hole recently, or more specifically, what I thought was "a" rabbit hole but in actual fact was an ever-expanding series of them that led me to what I refer to in the title of this post as "6 rabbits deep". It's a tale of firewalls, APIs and sifting through layers and layers of different services to sniff out the root cause of something that seemed very benign, but actually turned out to be highly impactful. Let's go find the rabbits!

The Back Story

When you buy an API key on Have I Been Pwned (HIBP), Stripe handles all the payment magic. I love Stripe, it's such an awesome service that abstracts away so much pain and it's dead simple to integrate via their various APIs. It's also dead simple to configure Stripe to send notices back to your own service via webhooks. For example, when an invoice is paid or a customer is updated, Stripe sends information about that event to HIBP and then lists each call on the webhooks dashboard in their portal:

There are a whole range of different events that can be listened to and webhooks fired, here we're seeing just a couple of them that are self explanatory in name. When an invoice is paid, the callback looks something like this:

HIBP has received this call and updated it's own DB such that for a new customer, they can now retrieve an API key or for an existing customer whose subscription has renewed, the API key validity period has been extended. The same callback is also issued when someone upgrades an API key, for example when going from 10RPM (requests per minute) to 50RPM. It's super important that HIBP gets that callback so it can appropriately upgrade the customer's key and they can immediately begin making more requests. When that call doesn't happen, well, let's go down the first rabbit hole.

The Failed API Key Upgrade 🐰

This should never happen:

This came in via HIBP's API key support portal and is pretty self-explanatory. I checked the customer's account on Stripe and it did indeed show an active 50RPM subscription, but when drilling down into the associated payment, I found the following:

Ok, so at least I know where things have started to go wrong, but why? Over to the webhooks dashboard and into the failed payments and things look... suboptimal:

Dammit! Fortunately this is only a small single-digit percentage of all callbacks, but every time this fails it's either stopping someone like the guy above from making the requests they've paid for or potentially, causing someone's API key to expire even though they've paid for it. The latter in particular I was really worried about as it would nuke their key and whatever they'd built on top of it would cease to function. Fortunately, because that's such an impactful action I'd built in heaps of buffer for just such an occurrence and I'd gotten onto this issue quickly, but it was disconcerting all the same.

So, what's happening? Well, the response is HTTP 403 "Forbidden" and the body is clearly a Cloudflare challenge page so something at their end is being triggered. Looks like it's time to go down the next rabbit hole.

Cloudflare's Firewall and Logs 🐰 🐰

Desperate just to quickly restore functionality, I dropped into Cloudflare's WAF and allowed all Stripe's outbound IPs used for webhooks to bypass their security controls:

This wasn't ideal, but it only created risk for requests originating from Stripe and it got things up and running again quickly. With time up my sleeve I could now delve deeper and work out precisely what was going on, starting with the logs. Cloudflare has a really extensive set of APIs that can control a heap of features of the service, including pulling back logs (note: this is a feature of their Enterprise plan). I queried out a slice of the logs corresponding to when some of the 403s from Stripe's dashboard occurred and found 2 entries similar to this one:

{"BotScore":1,"BotScoreSrc":"Verified Bot","CacheCacheStatus":"unknown","ClientASN":16509,"ClientCountry":"us","ClientIP":"54.187.205.235","ClientRequestHost":"haveibeenpwned.com","ClientRequestMethod":"POST","ClientRequestReferer":"","ClientRequestURI":"[redacted]","ClientRequestUserAgent":"Stripe/1.0 (+https://stripe.com/docs/webhooks)","EdgeRateLimitAction":"","EdgeResponseStatus":403,"EdgeStartTimestamp":1674073983931000000,"FirewallMatchesActions":["managedChallenge"],"FirewallMatchesRuleIDs":["6179ae15870a4bb7b2d480d4843b323c"],"FirewallMatchesSources":["firewallManaged"],"OriginResponseStatus":0,"WAFAction":"unknown","WorkerSubrequest":false}

That's one of Stripe's outbound IP's on 54.187.205.235 and the "FirewallMatchesRuleIDs" collection has a value in it. Ergo, something about this request triggered the firewall and caused it to be challenged. I'm sure many of us have gone through the following thought process before:

What did I change?

Did I change anything?

Did they change something?

Except "they" could have been either Cloudflare or Stripe; if it wasn't me (and I was fairly certain it wasn't), was it a Cloudflare change to the rules or a Stripe change to a webhook payload that was now triggering an existing rule? Time to dig deeper again so it's over to the Cloudflare dashboard and down into the WAF events for requests to the webhook callback path:

Yep, something proper broke! Let's drill deeper and look at recent events for that IP:

As you dig deeper through troubleshooting exercises like this, you gradually turn up more and more information that helps piece the entire puzzle together. In this case, it looks like the "Inbound Anomaly Score Exceeded" rule was being triggered. What's that? And why? Time to go down another rabbit hole.

The Cloudflare OWASP Core Ruleset 🐰 🐰 🐰

So, deeper and deeper down the rabbit holes we go, this time into the depths of the requests that triggered the managed rule:

Well that's comprehensive 🙂

There's a lot to unpack here so let's begin with the ruleset that the previously identified "Inbound Anomaly Score Exceeded" rule belongs to, the Cloudflare OWASP Core Ruleset:

The Cloudflare OWASP Core Ruleset is Cloudflare’s implementation of the OWASP ModSecurity Core Rule SetOpen external link (CRS). Cloudflare routinely monitors for updates from OWASP based on the latest version available from the official code repository.

That link is yet another rabbit hole altogether so let me summarise succinctly here: Cloudflare uses OWASP's rules to identify anomalous traffic based on a customer-defined paranoia level (how strict you want to be) and then applies a score threshold (also customer-defined) at which an action will be taken, for example challenging the request. What I learned as this saga progressed is that the "Inbound Anomoly Score Exceeded" rule is actually a rollup of the rules beneath it. The OWASP score of "26" is the sum of the 6 rules listed beneath it and once it exceeds 25, the superset rule is triggered.

Further - and this is the really important bit - Cloudflare routinely updates the rules from OWASP which makes sense because these are ever-evolving in response to new threats. And when did they last upgrade the rules? It looks like they announced it right before I started having issues:

Whilst it's not entirely clear from above when this release was scheduled to occur, I did reach out to Cloudflare support and was advised it had already taken place:

Please note that we did bump the OWASP version, which we are integrating with to 3.3.4 as noted on our scheduled changes.

So maybe it's not Cloudflare's fault or Stripe's fault, but OWASP's fault? In fairness to all, I don't think it's anyone's fault per se and is instead just an unfortunate result of everyone doing their best to keep the bad guys out. Unless... it really is Stripe's fault because there's something in the request payload that was always fishy and is now being caught? But why for only some requests and not others? Next rabbit!

Cloudflare Payload Logging 🐰 🐰 🐰 🐰

Sometimes, people on the internet lose their minds a bit over things they really shouldn't. One of those things, in my experience, is Cloudflare's interception of traffic and it's something I wrote about in detail nearly 7 years ago now in my piece on security absolutism. Cloudflare plays an enormously valuable role in the internet's ecosystem and a substantial part of the value comes from being able to inspect, cache, optimise, and yes, even reject traffic. When you use Cloudflare to protect your website, they're applying rulesets like the aforementioned OWASP ones and in order to do that, they must be able to inspect your traffic! But they don't log it, not all of it, rather just "metadata generated by our products" as they refer to it on their logs page. We saw an example of that earlier on with Stripe's request from their IP showing it triggered a firewall rule, but what we didn't see is the contents of that POST request, the actual payload that triggered the rule. Let's go grab that.

Because the contents of a POST request can contain sensitive information, Cloudflare doesn't log it. Obviously they see it in transit (that's how OWASP's rules can be applied to it), but it's not stored anywhere and even if you want to capture it, they don't want to be able to see it. That's where payload logging (another Enterprise plan feature) comes in and what's really neat about that is every payload must be encrypted with a public key retained by Cloudflare whilst only you retain the private key. The setup looks like this:

Pretty self-explanatory and once done, right under where we previously saw the additional logs we now have the ability to decrypt the payload:

As promised, this requires the private key from earlier:

And now, finally, we have the actual payload that triggered the rule, seen here with my own test data:

[ " },\n \"billing_reason\": \"subscription_update\",\n \"charge\": null,\n \"collection_method\": \"charge_automatically\",\n \"created\": 1674351619,\n \"currency\": \"usd\",\n \"custom_fields\": null,\n \"customer\": \"cus_MkA71FpZ7XXRlt\",\n \"customer_address\": ", " },\n \"customer_email\": \"troy-hunt+1@troyhunt.com\",\n \"customer_name\": \"Troy Hunt 1\",\n \"customer_phone\": null,\n \"customer_shipping\": null,\n \"customer_tax_exempt\": \"none\",\n \"customer_tax_ids\": [\n\n ],\n \"default_payment_method\": null,\n \"default_source\": null,\n \"default_tax_rates\": [\n\n ],\n \"description\": \"You can manage your subscription (i.e. cancel it or regenerate the API key) at any time by verifying your email address here: https://haveibeenpwned.com/API/Key\",\n \"discount\": null,\n \"discounts\": [\n\n ],\n \"due_date\": null,\n \"ending_balance\": -11804,\n \"footer\": null,\n \"from_invoice\": null,\n \"hosted_invoice_url\": \"https://invoice.stripe.com/i/acct_1EdQYpEF14jWlYDw/test_YWNjdF8xRWRRWXBFRjE0aldsWUR3LF9OREo5SlpqUFFvVnFtQnBVcE91YUFXemtkRHFpQWNWLDY0ODkyNDIw02004bEyljdC?s=ap\",\n \"invoice_pdf\": \"https://pay.stripe.com/invoice/acct_1EdQYpEF14jWlYDw/test_YWNjdF8xRWRRWXBFRjE0aldsWUR3LF9OREo5SlpqUFFvVnFtQnBVcE91YUFXemtkRHFpQWNWLDY0ODkyNDIw02004bEyljdC/pdf?s=ap\",\n \"last_finalization_error\": null,\n \"latest_revision\": null,\n \"lines\": ", " ", " ],\n \"discountable\": false,\n \"discounts\": [\n\n ],\n \"invoice_item\": \"ii_1MSsXfEF14jWlYDwB1nfZvFm\",\n \"livemode\": false,\n \"metadata\": ", " },\n \"period\": ", " },\n \"plan\": ", " },\n \"nickname\": null,\n \"product\": \"prod_Mk4eLcJ7JYF02f\",\n \"tiers_mode\": null,\n \"transform_usage\": null,\n \"trial_period_days\": null,\n \"usage_type\": \"licensed\"\n },\n \"price\": ", " },\n \"nickname\": null,\n \"product\": \"prod_Mk4eLcJ7JYF02f\",\n \"recurring\": ", " },\n \"tax_behavior\": \"unspecified\",\n \"tiers_mode\": null,\n \"transform_quantity\": null,\n \"type\": \"recurring\",\n \"unit_amount\": 15000,\n \"unit_amount_decimal\": \"15000\"\n },\n \"proration\": true,\n \"proration_details\": ", " \"il_1MMjfcEF14jWlYDwoe7uhDPF\"\n ]\n }\n },\n \"quantity\": 1,\n \"subscription\": \"sub_1MMjfcEF14jWlYDwi8JWFcxw\",\n \"subscription_item\": \"si_N6xapJ8gSXdp7W\",\n \"tax_amounts\": [\n\n ],\n \"tax_rates\": [\n\n ],\n \"type\": \"invoiceitem\",\n \"unit_amount_excluding_tax\": \"-14304\"\n },\n ", " ],\n \"discountable\": true,\n \"discounts\": [\n\n ],\n \"livemode\": false,\n \"metadata\": ", " },\n \"period\": ", " },\n \"plan\": ", " },\n \"nickname\": null,\n \"product\": \"prod_Mk4lTSl4axd9mt\",\n \"tiers_mode\": null,\n \"transform_usage\": null,\n \"trial_period_days\": null,\n \"usage_type\": \"licensed\"\n },\n \"price\": ", " },\n \"nickname\": null,\n \"product\": \"prod_Mk4lTSl4axd9mt\",\n \"recurring\": ", " },\n \"tax_behavior\": \"unspecified\",\n \"tiers_mode\": null,\n \"transform_quantity\": null,\n \"type\": \"recurring\",\n \"unit_amount\": 2500,\n \"unit_amount_decimal\": \"2500\"\n },\n \"proration\": false,\n \"proration_details\": ", " },\n \"quantity\": 1,\n \"subscription\": \"sub_1MMjfcEF14jWlYDwi8JWFcxw\",\n \"subscription_item\": \"si_NDJ98tQrCcviJf\",\n \"tax_amounts\": [\n\n ],\n \"tax_rates\": [\n\n ],\n \"type\": \"subscription\",\n \"unit_amount_excluding_tax\": \"2500\"\n }\n ],\n \"has_more\": false,\n \"total_count\": 2,\n \"url\": \"/v1/invoices/in_1MSsXfEF14jWlYDwxHKk4ASA/lines\"\n },\n \"livemode\": false,\n \"metadata\": ", " },\n \"next_payment_attempt\": null,\n \"number\": \"04FC1917-0008\",\n \"on_behalf_of\": null,\n \"paid\": true,\n \"paid_out_of_band\": false,\n \"payment_intent\": null,\n \"payment_settings\": ", " },\n \"period_end\": 1674351619,\n \"period_start\": 1674351619,\n \"post_payment_credit_notes_amount\": 0,\n \"pre_payment_credit_notes_amount\": 0,\n \"quote\": null,\n \"receipt_number\": null,\n \"rendering_options\": null,\n \"starting_balance\": 0,\n \"statement_descriptor\": null,\n \"status\": \"paid\",\n \"status_transitions\": ", " },\n \"subscription\": \"sub_1MMjfcEF14jWlYDwi8JWFcxw\",\n \"subtotal\": -11804,\n \"subtotal_excluding_tax\": -11804,\n \"tax\": null,\n \"test_clock\": null,\n \"total\": -11804,\n \"total_discount_amounts\": [\n\n ],\n \"total_excluding_tax\": -11804,\n \"total_tax_amounts\": [\n\n ],\n \"transfer_data\": null,\n \"webhooks_delivered_at\": 1674351619\n }\n },\n \"livemode\": false,\n \"pending_webhooks\": 1,\n \"request\": ", " },\n \"type\": \"invoice.paid\"\n}" ]

But enough of what's present in the payload, it's what's absent that especially struck me. No obvious XSS patterns, nor SQL injection or any other suspicious looking strings. The request looked totally benign, so why did it trigger the rule?

I wanted to compare the payload of a blocked request with a similar request that wasn't blocked, but they're only logged at Cloudflare when they trigger a rule. No problem, it's easy to grab the full request from Stripe's webhook history so I found one that passed and one that failed and diff'd them both:

This clearly isn't the full 200 lines, but it's a very similar story over the remainder of the files; tiny differences largely down to dates, IDs, and of course, the customers themselves. No suspicious patterns, no funky characters, nothing visibly abnormal. It's a bit pointless to even mention it because they're near identical, but the payload on the left is the one that passed the firewall whilst the payload on the right was blocked.

Next rabbit hole!

Cloudflare's Internal Rules Engine 🐰 🐰 🐰 🐰 🐰

Completely running out of ideas and options, focus moved to the folks inside Cloudflare who were already aware there was an issue:

We are actively looking into this and will likely release an update to the Cloudflare OWASP ruleset soon
— Michael Tremante (@MichaelTremante) January 20, 2023

What followed was a period of back and forth initially with Cloudflare, then Stripe as well with everyone trying to nut out exactly where things were going wrong. Essentially, the process went like this:

Is Cloudflare inadvertently blocking the requests?

Is the OWASP ruleset raising false positives?

Is Stripe issuing requests that are deemed to be malicious?

And round and round we went. At one time, Cloudflare identified a change in the OWASP ruleset which appeared to have resulted in their implementation inadvertently triggering the WAF. They rolled it back and... the same thing happened. We deferred back to Stripe on the assumption that something must have changed on their end, but they couldn't identify any change that would have any sort of material impact. We were stumped, but we also had an easy fix just one last rabbit hole away...

Fine Tuning the Cloudflare WAF 🐰 🐰 🐰 🐰 🐰 🐰

The joy of a managed firewall is that someone else takes all the rigmarole of looking after it away. I'm going to talk more about that in the summary shortly but clearly, that also creates risk as you're delegating control of traffic flow to someone else. Fortunately, Cloudflare gives you a load of configurability with their managed rules which makes it easy to add custom exceptions:

This meant I could create a simple exception that was much more intelligent than the previous "just let all outbound Stripe IPs in" by filtering down to the specific path those webhooks were flowing in to:

And finally, because sequence matters, I dragged that rule right up to the top of the pile so it would cause matching inbound requests to skip all the other rules:

And finally, there were no more rabbits 😊

Lessons Learned

I know what you're thinking - "what was the actual root cause?" - and to be honest, I still don't know. I don't know if it was Cloudflare or OWASP or Stripe or if it even impacted other customers of these services and to be honest, yes, that's a little frustrating. But I learned a bunch of stuff and for that alone, this was a worthwhile exercise I took three big lessons away from:

Firstly, understanding the plumbing of how all these bits work together is super important. I was lucky this wasn't a time critical issue and I had the luxury of learning without being under duress; how rules, payload inspection and exception management all work together is really valuable stuff to understand. And just like that, as if to underscore my first point, I found this right before hitting the publish button on the blog post:

I added a couple more OWASP rules to the exception in Cloudflare (things like a MySQL rule that was adding 5 points), and we were back in business.

Secondly, I look at the managed WAF Cloudflare provides more favourably than I did before simply because I have a better understanding of how comprehensive it is. I want to write code and run apps on the web, that's my focus, and I want someone else to provide that additional layer on top that continuously adapts to block new and emerging threats. I want to understand it (and I now do, at least certainly better than before), but I don't want managing it day in and day out to be my job.

And finally, IMHO, Stripe needs a better mechanism to report on webhook failures:

In live mode you are notified after 3 days of trying. You can also query the events (https://t.co/0mujOPssV0) to create a running list of statuses on web hooks that have been sent and alert on that via your own app.
— Blake Krone (@blakekrone) January 19, 2023

Waiting until stuff breaks really isn't ideal and whilst I'm sure you could plug into the (very extensive) API ecosystem Stripe has, this feels like an easy feature for them to build in. So, Stripe friends, when you read this that's a big "yes" vote from me for some form of anomalous webhook response alerting.

This experience was equal parts frustration and fun and whilst the former is probably obvious, the latter is simply due to having an opportunity to learn something new that's a pretty important part of the service I run. May my frustrated fun story here make your life easier in the future if you face the same problems 😊

Related tags
February 20^th 2023 at 07:47

Troy Hunt
Pwned Passwords Adds NTLM Support to the Firehose
February 9^th 2023 at 08:07

Pwned Passwords Adds NTLM Support to the Firehose

By Troy Hunt

I think I've pretty much captured it all in the title of this post but as of about a day ago, Pwned Passwords now has full parity between the SHA-1 hashes that have been there since day 1 and NTLM hashes. We always had both as a downloadable corpus but as of just over a year ago with the introduction of the FBI data feed, we stopped maintaining downloadable behemoths of data.

A little later, we added the downloader to make it easy to pull down the latest and greatest complete data set directly from the same API that so many of you have integrated into your own apps. But because we only had an API for SHA-1 hashes, the downloader couldn't grab the NTLM versions and increasingly, we had 2 corpuses well out of parity.

I don't know exactly why, but just over the last few weeks we've had a marked uptick in requests for an updated NTLM corpus. Obviously there's still a demand to run this against local Active Directory environments and clearly, the more up to date the hashes are the more effective they are at blocking the use of poor passwords.

So, Chief Pwned Passwords Wrangler Stefán Jökull Sigurðarson got to work and just went ahead and built it all for you. For free. In his spare time. As a community contribution. Seriously, have a look through the public GitHub repos and it's all his work ranging from the API to the Cloudflare Worker to the downloader so if you happen to come across him say, at NDC Oslo in a few months' time, show your appreciation and buy the guy a beer 🍺

Lastly, every time I look at how much this tool is being used, I'm a bit shocked at how big the numbers are getting:

That's well more than double the number of monthly requests from when I wrote the blog post about the FBI and NCA only just over a year ago, and I imagine that will only continue to increase, especially with today's announcement about NTLM hashes. Thank you to everyone that has taken this data and done great things with it, we're grateful that it's been put to good use and has undoubtedly helped an untold number of people to make better password choices 😊

Related tags
- ❌
- Have
- I
- Been
- Pwned
- Pwned
- Passwords
February 9^th 2023 at 08:07

Troy Hunt
Pwned or Bot
January 19^th 2023 at 09:00

Pwned or Bot

By Troy Hunt

It's fascinating to see how creative people can get with breached data. Of course there's all the nasty stuff (phishing, identity theft, spam), but there are also some amazingly positive uses for data illegally taken from someone else's system. When I first built Have I Been Pwned (HIBP), my mantra was to "do good things after bad things happen". And arguably, it has, largely by enabling individuals and organisations to learn of their own personal exposure in breaches. However, the use cases go well beyond that and there's one I've been meaning to write about for a while now after hearing about it firsthand. For now, let's just call this approach "Pwned or Bot", and I'll set the scene with some background on another problem: sniping.

Think about Miley Cyrus as Hannah Montana (bear with me, I'm actually going somewhere with this!) putting on shows people would buy tickets to. We're talking loads of tickets as back in the day, her popularity was off the charts with demand well in excess of supply. Which, for enterprising individuals of ill-repute, presented an opportunity:

Ticketmaster, the exclusive ticket seller for the tour, sold out numerous shows within minutes, leaving many Hannah Montana fans out in the cold. Yet, often, moments after the shows went on sale, the secondary market flourished with tickets to those shows. The tickets, whose face value ranged from $21 to $66, were resold on StubHub for an average of $258, plus StubHub’s 25% commission (10% paid by the buyer, 15% by the seller).

This is called "sniping", where an individual jumps the queue and snaps up products in limited demand for their own personal gain and consequently, to the detriment of others. Tickets to entertainment events is one example of sniping, the same thing happens when other products launch with insufficient supply to meet demand, for example Nike shoes. These can be massively popular and, par for the course of this blog, released in short demand. This creates a marketplace for snipers, some of whom share their tradecraft via videos such as this one:

"BOTTER BOY NOVA" refers to himself as a "Sneaker botter" in the video and demonstrates a tool called "Better Nike Bot" (BnB) which sells for $200 plus a renewal fee of $60 every 6 months. But don't worry, he has a discount code! Seems like hackers aren't the only ones making money out of the misfortune of others.

Have a look at the video and watch how at about the 4:20 mark he talks about using proxies "to prevent Nike from flagging your accounts". He recommends using the same number of proxies as you have accounts, inevitably to avoid Nike's (automated) suspicions picking up on the anomaly of a single IP address signing up multiple times. Proxies themselves are a commercial enterprise but don't worry, BOTTER BOY NOVA has a discount code for them too!

The video continues to demonstrate how to configure the tool to ultimately blast Nike's service with attempts to purchase shoes, but it's at the 8:40 mark that we get to the crux of where I'm going with this:

Using the tool, he's created a whole bunch of accounts in an attempt to maximise his chances of a successful purchase. These are obviously just samples in the screen cap above, but inevitably he'd usually go and register a bunch of new email addresses he could use specifically for this purpose.

Now, think of it from Nike's perspective: they've launched a new shoe and are seeing a whole heap of new registrations and purchase attempts. In amongst that lot are many genuine people... and this guy 👆 How can they weed him out such that snipers aren't snapping up the products at the expense of genuine customers? Keeping in mind tools like this are deliberately designed to avoid detection (remember the proxies?), it's a hard challenge to reliably separate the humans from the bots. But there's an indicator that's very easy to cross-check, and that's the occurrence of the email address in previous data breaches. Let me phrase it in simple terms:

We're all so comprehensively pwned that if an email address isn't pwned, there's a good chance it doesn't belong to a real human.

Hence, "Pwned or Bot" and this is precisely the methodology organisations have been using HIBP data for. With caveats:

If an email address hasn't been seen in a data breach before, it may be a newly created one especially for the purpose of gaming your system. It may also be legitimate and the owner has just been lucky to have not been pwned, or it may be that they're uniquely subaddressing their email addresses (although this is extremely rare) or even using a masked email address service such as the one 1Password provides through Fastmail. Absence of an email address in HIBP is not evidence of possible fraud, that's merely one possible explanation.

However, if an email address has been seen in a data breach before, we can say with a high degree of confidence that it did indeed exist at the time of that breach. For example, if it was in the LinkedIn breach of 2012 then you can conclude with great confidence that the address wasn't just set up for gaming your system. Breaches establish history and as unpleasant as they are to be a part of, they do actually serve a useful purpose in this capacity.

Think of breach history not as a binary proposition indicating the legitimacy of an email address, rather as one of assessing risk and considering "pwned or bot" as one of many factors. The best illustration I can give is how Stripe defines risk by assessing a multitude of fraud factors. Take this recent payment for HIBP's API key:

There's a lot going on here and I won't run through it all, the main thing to take away from this is that in a risk evaluation rating scale from 0 to 100, this particular transaction rated a 77 which puts it in the "highest risk" bracket. Why? Let's just pick a few obvious reasons:

The IP address had previously raised early fraud warnings
The email was only ever once previously seen on Stripe, and that was only 3 minutes ago
The customers name didn't match their email address
Only 76% of transactions from the IP address had previously been authorised
The customer's device had previously had 2 other cards associated with it

Any one of these fraud factors may not have been enough to block the transaction, but all combined it made the whole thing look rather fishy. Just as this risk factor also makes it look fishy:

Applying "Pwned or Bot" to your own risk assessment is dead simple with the HIBP API and hopefully, this approach will help more people do precisely what HIBP is there for in the first place: to help "do good things after bad things happen".

Related tags
- ❌
- Have
- I
- Been
- Pwned
January 19^th 2023 at 09:00

Troy Hunt
Data Breach Misattribution, Acxiom & Live Ramp
November 22^nd 2022 at 20:06

Data Breach Misattribution, Acxiom & Live Ramp

By Troy Hunt

If you find your name and home address posted online, how do you know where it came from? Let's assume there's no further context given, it's just your legitimate personal data and it also includes your phone number, email address... and over 400 other fields of data. Where on earth did it come from? Now, imagine it's not just your record, but it's 246 million records. Welcome to my world.

This is a story about a massive corpus of data circulating widely within the hacking community and misattributed to a legitimate organisation. That organisation is Acxiom, and their business hinges on providing their customers with data on their customers. By the very nature of their business, they process large volumes of data that includes a broad set of personal attributes. By pure coincidence, there is nominal commonality between Acxiom’s records and the ones in the 246M corpus I mentioned earlier. But I'm jumping ahead to the conclusion, let's go back to the beginning:

Disclosure and Attribution Debunking

In June last year, I received an email from someone I trust who had sent me data for Have I Been Pwned (HIBP) in the past:

Have you seen Axciom [sic] data? It was just sent to us. Seems to being traded/sold on some forums. Have you received it yet? If not i can upload it for you. It's quite large tho, ~250M Records.

A corpus of data that size is particularly interesting as it impacts such a huge number of people. So, I reviewed the data and concluded... pretty much nothing. Looks legit, smells legit but there was absolutely nothing beyond the word of one person to tie it to Acxiom (and who knows who they got that word from). Burdened by other more immediately actionable data breaches, I filed it away until recently when that name popped up again, this time on a popular hacking forum:

It was referred to as "LiveRamp (Formerly Acxiom)" and before I go any further, let's just clarify the problem with that while you're looking at the image above: LiveRamp was previously a subsidiary of Acxiom, but that hasn't been the case since they separated businesses in 2018 so whoever put this together is referring back to a very old state of play. Regardless, those downloading it from the forum were clearly very excited about it. Seeing this for the second time and spreading far more broadly, I decided to reach out to the (alleged) source and ask Acxiom what was going on.

I dread this process - contacting an organisation about a breach - because I usually get either no response whatsoever or a standoffish one. Rarely do I find a receptive organisation willing to fully investigate an alleged incident, but that's exactly what I found on this occasion. Much of the reason why I wanted to write this post is because whilst I hate breached organisations not properly investigating an incident, I also hate seeing misattribution of a breach to an innocent party. That's a particularly sore point for me right now because of this incident just last week:

This is the dumbest infosec story I’ve read in… forever? It is so profoundly incorrect, poorly researched, never verified, rambling and indistinguishable from parody that I literally went looking for the parody reference. I think he’s actually serious! https://t.co/oLyIHxb8D3
— Troy Hunt (@troyhunt) November 15, 2022

I've had various public users of HIBP, commercial users and even governments reach out to ask what's going on because they were concerned about their data. Whilst this incident won't do HIBP any actual harm (and frankly, I'm stunned anyone took that story seriously), I can very easily see how misattribution can be damaging to an organisation, indeed that's a key reason why I invest so much effort into properly investigating these claims before putting anything into HIBP. But that ridiculous example is nothing compared to the amount of traction some misattributions get. Remember how just recently a couple of billion TikTok accounts had been "breached"? This made massive news headlines until...

The thread on the hacking forum with the samples of alleged TikTok data has been deleted and the user banned for “lying about data breaches” https://t.co/9ZKkKvu8JT
— Troy Hunt (@troyhunt) September 5, 2022

"Lying about data breaches". Ugh, criminals are so untrustworthy! This happens all the time and when I'm not sure of the origin of a substantial breach, I often write a blog post like this and on many occasions, the masses help establish the origin. So, here goes:

The Data

Let's jump into the data, starting with 2 of the most obvious things I look for in any new data breach:

The total number of unique email addresses is 51,730,831 (many records don't have this field populated)
The most recent data I can find is from mid-2020 (which also speaks to the inaccuracy of the LiveRamp association)

As to the aforementioned attributes, they total 410 different columns:

To my eye, this data is very generic and looks like a superset of information that may be collected across a large number of people. For example, the sort of data requested when filling out dodgy online competitions. However, unlike many large corpuses of aggregated data I've seen in the past, this one is... neat. For example, here's a little sample of the first 5 columns (redaction of some chars with a dash), note how the names are all uniformly presented:

120321486,4,BE-----,B,TAYLOR
120321487,2,JOY,M,----EY
120321466,1,DOYLE,E,------HAM
120321486,3,L----,,TAYLOR
120321486,2,R---,M,TAYLOR

Sure, this is just uppercasing characters but over and over again, I found data that was just too neat. The addresses. The phone numbers. Everything about it was far to curated to simply be text entered by humans. My suspicion is that it's likely a result of either a very refined collection process or in the case of addresses, matched using a service to resolve the human-entered address to a normalised form stored centrally.

Perhaps what I was most interested in though was the URL column as that seems to give some indication of where the data might have come from. I queried out the top 100 most common ones and took a look:

Eyeballing them, I couldn't help feel that my earlier hunch was on the money - "dodgy online competitions". Not just competitions but a general theme of getting stuff for cheap or more specifically, services that look like they've been built to entice people to part with their personal data.

Take the first one, for example, DIRECTEDUCATIONCENTER.COM. That's a dead domain as of now but check out what it looked line in March last year:

"I may be contacted by trusted partners and others". What's "others"? Untrusted partners? 🤷‍♂️

Let's try the next one being originalcruisegiveaway.com and again, the site is now gone so it's back over to archive.org:

It's different, but somehow the same. Clicking through to the claim form, it seems the only way you can enter is if you agree to receive comms from all sorts of other parties:

Ok, one more, this time free-ukstuff.com which is also now a dead site, and not even indexed by archive.org. Next then, is findyourdegreenow.com which is - you're not gonna believe this - a dead site! Here's what it used to look like:

And again, it feels the same. Same same, but different.

To try and get a sense of how localised this data was, I queried out all the values in the "state" column. Is this a US-only data set? If that column is anything to go by, yes:

Something didn't add up when I first saw that and after a quick check of the population of each US state, it become immediately obvious: there's no California, the most populous state in the country. Nor Texas, the second most populous state. In fact, with only 35 rows there's a bunch of US states missing. Why? Who knows, the only thing I can say for sure is that this is a subset of the population with some glaring geographical omissions.

Then there's another curveball - what about the URL quickquid.co.uk, that doesn't look very US-centric. Heading over there redirects to casheuronetukadministration.grantthornton.co.uk which advises that as of last month, "The Administration of CashEuroNet UK, LLC has closed and the Joint Administrators have ceased to act". So something has obviously been wound up, wonder what was there originally? I had to go back a few years to find this:

To my mind, this is more of the same ilk in terms of a service targeted at people after quick money. But it's clearly all in GBP and with a .co.uk TLD, this being right after I've just said all the states are in the US, what gives? Back to the source data, filter out the records based on that URL and sure enough, everyone has a US address. Grabbing a random selection of IP addresses had them all resolving to the US too so I have absolutely no idea how his geographically inconsistent set of data came to being.

And that's really the theme across the data set when doing independent analysis - how is this so? What service or process could have pulled the data together in this way? Maybe the people who this data actually refers to will have the answers, let's go and ask them.

Responses From Impacted HIBP Subscribers

We're approaching 4.5M subscribers to HIBP's free notification service now which makes for a great corpus of people I can reach out to when doing breach verification. I grabbed a handful of addresses from this data set and asked them if they could help out. I sent those that responded positively their full record and asked some questions about the legitimacy of the data, and where they thought it might have come from, here's what they said:

1. The data is mostly accurate.

A few things are off, such as date of birth (could very well be a fake one I've entered before) and details of household members.

There are a lot of columns with single-letter values, which I can't verify without knowing what they mean.

But overall, it's quite accurate.

2. No idea where it came from, sorry. There is a URL in the third-to-last column, but it doesn't seem like a website I would have used before.

I looked through the csv file and couldn't find anything I recognized. I saw the names [redacted], [redacted] and [redacted]- I don't know anyone by those names. I live in Ontario, Canada, but addresses in the file were located in the united states.

Data says I have one child between the ages of 0 and 2, but that's not true - my only son is five. Birth date is wrong - my birthday is [redacted], but the file says [redacted].

There were a few urls in the file and I don't recognize any of them.

Not sure if this last thing is relevant or not. I sometimes get emails intended for other people. I searched my inboxes for the names [redacted] and [redacted]. Nothing came up for [redacted], but I do see an email for [redacted] from [redacted]. I searched through the csv to see if anything matched the data in the email (member number, confirmation number), but nothing matched.

I also noticed that although my email address ([redacted]) is in the csv data, there's also another email address ([redacted]) which is not mine.

I'm not sure if that's helpful or not, but if there's anything more I can do, let me know. :)

As far as name and address they are correct. number of ppl living at the house has changed. The other information I can't seem to understand what the information for example under column AQ row 2 it has a U and I don't know what the U is for. I have noticed that some information is really outdated, so I wouldn't know where the data originated from.

Thank you for sharing, I took a look at the data, let me see if I can answer your questions:

1. While that is my email, the rest of the data actually belongs to an immediate family member. With the exception of a few outdated fields, the data on my family member is correct.

2. I am unfamiliar with Acxiom and am unsure of where this data originated from. I want to note that I have recently been doxxed and have reason to believe data breaches may have been used; however, the data you've provided here was not used in the attacks, to my knowledge.

Please let me know if you have any other questions, or if there is anything else I may do to help.

"Mostly accurate". The feeling I have when reading this is that whoever is responsible for this corpus of data has put it together from multiple sources and quite likely made some assumptions along the way. I can picture how that would happen; imagine trying to match various sources of data based on human-provided text fields in order to "enrich" the collection.

Analysis by Acxiom

This isn't the fist time Acxiom has had to deal with misattribution, and they'd seen exactly the same data set passed around before. Think about it from their perspective: every time there's a claim like this they need to treat it as though it could be legitimate, because we've all seen what happens when an organisation brushes off a disclosure attempt (I could literally write a book about this!) Thus it becomes a burdensome process for them as they repeat the same analysis over and over again, each time drawing the same conclusion.

And what was that conclusion? Simply put, the circulating data didn't align with their own. They're in the best position of all of us to draw that conclusion as they have access to both data sets and whilst I suspect some people may retort with "how do you know you can trust them", not only do I not have a good reason to doubt their findings, I also don't have a good reason to attribute it to them. Every reference I've seen to Acxiom has been from whoever is handing the data around; I've been able to find absolutely nothing within the data set itself to tie it back to them. In almost all breaches I've processed, the truth is in the data and there's nothing here that points the finger at them.

I offered Acxiom the opportunity to further clarify their position with a statement which I've included in its entirety here:

“Acxiom has worked to build a reputation over the course of fifty years for having the highest standards around data privacy, data protection and security. In the past, questionable organizations have falsely attached our name to a data file in an attempt to create a deceitful sense of legitimacy for an asset. In every instance, Acxiom conducts an extensive analysis under our cyber incident response and privacy programs. These programs are guided by stakeholders including working with the appropriate authorities to inform them of these crimes. The forensic review of the case that Troy has looked into, along with our continuous monitoring of security, means we can conclusively attest that the claims are indeed false and that the data, which has been readily available across multiple environments, does not come from Acxiom and is in no way the subject of an Acxiom breach.

Acxiom’s Commitment To Data Protection/ Data Privacy:
We value consumer privacy. U.S. consumers who would like to know what information Acxiom has collected about them and either delete it or opt out of Acxiom’s marketing products, may visit acxiom.com/privacy for more information.”

Summary

The email addresses from the data set have now been loaded into HIBP and are searchable. One point of note that became evident after loading the data is that 94% of the email addresses has already been pwned. That's a very high number (a quick look through the HIBP Twitter feed shows the count is normally between 40% and 80%), and it suggests that this corpus of data may be at least partially constructed from other data already in circulation.

Because the question will inevitably come up, no, I won't send you your full record, I simply don't have the capacity to operate as a personal data lookup and delivery service. I know it's frustrating finding yourself in a breach like this and not being able to take any action, all you can really do at this point is treat it as another reminder of how our data spread around the web and often, we have no idea about it.

Full disclosure: I have absolutely no commercial interest in Acxiom, no money has changed hands and I wasn't incentivised in any way, I just want everyone to have a much healthier suspicion when alleging the source of a data breach 🙂

Related tags
- ❌
- Have
- I
- Been
- Pwned
November 22^nd 2022 at 20:06

Troy Hunt
The Have I Been Pwned API Now Has Different Rate Limits and Annual Billing
November 6^th 2022 at 20:29

The Have I Been Pwned API Now Has Different Rate Limits and Annual Billing

By Troy Hunt

A couple of weeks ago I wrote about some big changes afoot for Have I Been Pwned (HIBP), namely the introduction of annual billing and new rate limits. Today, it's finally here! These are two of the most eagerly awaited, most requested features on HIBP's UserVoice so it's great to see them finally knocked off after years of waiting. In implementing all this, there are changes to the existing "one size fits all" model so if you're using the HIBP API, please make sure you read this carefully and understand the impact (if any) on you. Here goes:

The Rate Limits and (Some) Pricing is Different

The launch blog post for the authenticated API explained the original rationale behind the $3.50 per month price and most importantly, how I wanted to ensure it didn't pose a barrier:

In choosing the $3.50 figure, I wanted to ensure it was a number that was inconsequential to a legitimate user of the service

As I said in the previous blog post, what I didn't understand at the time was that paradoxically, the low amount was a barrier to many organisations! But equally, it's made the API super accessible to the masses so that price stays. The rate limit, however, needed revisiting and to understand why, let's go back to the beginning:

The "1 request per 1,500ms" rate dated all the way back to 2016 where I'd initially attempted to combat abuse by applying the limit per IP. This was an entirely non-empirical, gut feel, "let's just try and fix the problem right now" decision and it was only very recently I actually started trawling through the data and looking at how the API was being consumed. 1 request every 1,500ms is a maximum of 57,600 requests in a day; here's the number of requests by the top 20 consumers of the service in a recent 24 hour mid-week period:

Keeping in mind that you're never going to achieve the full 57,600 requests in a day as you'd have to time every single one of them perfectly so as not to hit the rate limit, only 1 subscriber even achieved half that potential. In fact, only 9 subscribers achieved even a quarter of the potential with everyone else very quickly falling back to a small fraction of even that. To be fair, I'm conscious that I'm taking a full day of data and talking about requests as if they were evenly distributed across the entire period when there are inevitably use cases where it's more a short burst rather than a prolonged, even distribution. Regardless, what the data is saying is that the default "one size fits all" rate limit is way above and beyond what almost every single subscriber is actually consuming, and by a significant order of magnitude too. In a way, what we ended up with is the little guys subsidising the big guys.

The bottom line is that we're simultaneously adding a bunch of higher rate limits whilst reducing the entry level rate limit. It's easier if you see it all in context so let's just jump straight into the pricing (all in USD):

This is from Stripe's embeddable pricing table I mentioned in the previous post and it's what you see when you first sign up for a key. With new limits, it's easier to talk about "requests per minute" or RPM so that's the nomenclature we're sticking with now. That entry level 10RPM model will work for well in excess of 90% of current subscribers and it's only a very small percentage of the existing subscriber base exceeding it. (And yes, again, I know these requests are sometimes made in bursts but even still, 10RPM is far in excess of the vast majority of use cases.)

There are economies of scale that have been factored in here. Going from 10RPM to 100RPM isn't a 10x increase, it's about a 7x increase. Going to 5 times more requests is only 4 times the price, and so on and so forth. The hope is that this makes it easier for the folks who were previously buying multiple keys to justify scratching all the kludge previously used to do that and replacing it with a single key at a higher RPM.

To get to this outcome, we trawled back through heaps of data ranging from the high-level aggregated stats in the earlier chart to the nature of the organisations buying multiple keys (which we can obviously determine based on the email address used). I also chatted with a bunch of API users both during this process and over the preceding years and have a pretty good sense of the use cases. A few trends became immediately clear:

Firstly, use cases that are genuinely personal have a very low rate limit requirement. Checking your own address(es) or those of your family by a custom app, for example. Or one of my favourite uses (and one I definitely use), the Home Assistant integration:

On an ongoing basis, HA makes 1 request every 15 minutes. That's all. Each time we looked at genuine personal use cases, 10RPM was plenty.

Next, we found a bunch of use cases used within internal corporate environments, for example to monitor staff exposure in breaches. Now we're talking larger numbers of requests, but it's also something that's way more efficiently done via the existing domain search feature on the website. It's an on-demand, self-service and totally free feature that's been there for years. I know it's not API-based and there are good reasons for that (see the comment from me on that idea), but there's also the Enterprise route if API access is really that important (more on that later). Other examples included things like scanning customer emails to assess exposure at points where, for example, account takeover was a risk. In each of these cases, we're primarily talking about business entities using the service and I'm comfortable with commercial ventures wearing a greater cost.

And finally, there were the "heavy hitters", the ones with large volumes of keys. One such example using the API en masse provides security services to the big end of town and was funded to the tune of a figure that looks like a phone number. And again, I'm perfectly comfortable with them wearing a cost that's more commensurate to the value as opposed to a figure that was originally arrived at just to keep the bad guys out.

Existing Subscribers are Grandfathered in for 60 Days

Before I talk about the annual pricing, I want to make sure this headline is clear. Nothing changes for existing subscribers until the 6th of Jan next year, which is 60 days from today. On that date, the legacy rate limit of 1 request every 1,500ms will roll to the new 10RPM limit at exactly the same price. For that handful of big users for whom the 10RPM limit will be insufficient, you've got a couple of months to work out the best path forward. I'll be emailing every single active subscriber today to ensure everyone is notified well in advance (there's also an updated Terms of Use which requires a notification email to be sent).

What does this mean in practical terms? If you want annual billing or a higher rate limit, you can go and implement that whenever you're ready (more on that soon). Alternatively, if you just want to stick with 10 RPM then you don't have to do anything, nothing will change. What I do strongly suggest though (and this hasn't changed, it's always been the guidance), is to make sure you're handling HTTP 429 responses gracefully. Regardless of what your rate limit is, if you're consuming the API in a fashion where you're not directly controlling the rate yourself, make sure you handle those responses appropriately.

Billing Can Now Occur Annually

This is the easy one to explain: annual payments are now a thing 😊 As I explained in the previous blog post, frequent payments of small amounts can play havoc with reimbursements in the corporate environment. It sucks, I've been there, but it is what it is. Annual billing alleviates that through a combination of a 12x reduction in the frequency of an expense claim and a larger single sum that's easier to explain to your procurement people than $3.50.

So, what do you charge for annual rather than monthly billing? My initial temptation was just to make it literally 12 times more because I don't have a lot of patience for spivvy marketing guff. However, there's a valid case to be made that a 12x reduction on individual payments warrants a discount as it removes overhead from our end (there's a constant percentage of all payments that are disputed or fail or cause other demands on our time), plus there's an argument to be made along the lines of customer loyalty warranting a discount. There's also just the very simple mathematics of the whole thing, best illustrated by a recent payment in Stripe:

That's 8.5% that disappears on every transaction, largely due to the 30c AUD charge no matter what the price of the transaction is:

The point is that there's merit for all in incentivising annual rather than monthly payments. We decided to look at what a typical annual discount was and time and time again, found the same thing:

Or in other words, a couple of months for free when you sign up for a year. In fact, coincidentally, that's exactly what I just signed up for with Nabu Casa (Home Assistant cloud) after receiving an email saying annual billing was now available 😊

It's never exactly 17%, rather it's like each example took 17% off 12 month's worth of a normal monthly fee then moved the number to something that looked pretty 🙂 Some examples were less (Pluralsight is 14%) and others were more (the higher tiers of Zendesk are 20%), but ultimately we decided to work to that 17% number and came up with the following:

In keeping with the "pay for 10, get 12" theme, these prices are exactly 10 times the monthly ones. Easy peasy.

Stripe Customer Portal Magic Makes Changing Plans Easy

As I mentioned in the "big changes ahead" blog post, I've been deleting code like crazy in favour of deferring more processing back to Stripe themselves. By using their Customer Portal paradigm, it's now easy to change an existing plan:

The change can be to a different rate limit or to a different renewal cadence:

Stripe automatically proratas everything too so whilst you can upgrade immediately to a higher RPM or from monthly to annually, you'll only pay for the difference between the previous plan and the new one. Or, you can downgrade and on next renewal the lower plan will be automatically applied. It's super simple and it's all self-service.

Enterprise

For more than 7 years now, a small handful of organisations have used HIBP in a larger scale commercial fashion. Some of them you're familiar with, for example both 1Password and Mozilla do email address searches using k-Anonymity and that's not something that's a self-service "put your card into Stripe" sort of model (in part because k-Anonymity returns a huge number of results for each search). Infosec firms use Enterprise to support customers via domain level API searches. Identity theft companies use it to advise customers when they're exposed in a breach. One firm even uses it to help detect bot signups; it turns out that so many of us are so pwned, if someone signs up for their service and they're not pwned, that's a little bit suspicious (that's just one of many indicators they use).

This is a fundamentally different model, one that involves a close working relationship, lots of legal documents, procurement people, invoicing instead of credit cards and all sorts of other "Enterprisey" things. That still exists and nothing in today's blog post changes that. I mention this now in today's post simply because some of the folks from those organisations with Enterprise subscriptions will read this post and wonder where they sit. Likewise, I suspect those "100+ key" subscribers of the public API really should be on Enterprise and I'll be reaching out to them separately given the rate limit change will have a bigger impact on them than most.

In Closing

For that vast majority of users who are only at a fraction of the old rate limit, nothing changes other than there now being a key available for 17% less than before on an annual subscription. Meanwhile, for the folks battling corporate bureaucracy around small, frequent payments, this will sort you out and give you choices around rate limits you didn't have before.

There will be some people that fall between the cracks of the use cases outlined above and won't be happy with the changes. I expect that - I know it will happen - but I hope the rationale outlined here demonstrates the volume of thought and consideration that has gone into trying the find the sweet spot for pricing and rate limits. I also expect people will ask about adding other rate limits, for example to fill the gaps between say, 100RPM and 500RPM. We started out with more options, but a combination of that creating the whole paradox of choice problem and deeper analysis of how the API was actually being used led us to simplifying things. But who knows over the longer term, feedback is certainly welcome.

Lastly, if you're watching closely, you'll notice a lot more structure going in around the way HIBP is run. Last week I wrote about rolling out Zendesk for support so there's now a formal ticketing system in place. I also explained how Charlotte is playing a very active role in the management of HIBP and in the coming months, you'll see more around other initiatives to make the project more sustainable. I'm thinking of it like this: what must HIBP do to be sustainable in a post-Troy world? Or in other words, how can we get what has increasingly become an essential service for so many to be more robust and more self-sustaining beyond what one person can do as a sole operator devoting spare time to a passion project.

Stay tuned, there's much more to come 🙂

Related tags
- ❌
- Have
- I
- Been
- Pwned
November 6^th 2022 at 20:29

Troy Hunt
Better Supporting the Have I Been Pwned API with Zendesk
November 3^rd 2022 at 10:09

Better Supporting the Have I Been Pwned API with Zendesk

By Troy Hunt

I've been investing a heap of time into Have I Been Pwned (HIBP) lately, ranging from all the usual stuff (namely trawling through masses of data breaches) to all new stuff, in particular expanding and enhancing the public API. The API is actually pretty simple: plug in an email address, get a result, and that's a very clearly documented process. But where things get more nuanced is when people pay money for it because suddenly, there are different expectations. For example, how do you cancel a subscription once it's started? You could read the instructions when signing up for a key, but who remembers what they read months ago? There's also a greater expectation of support for everything from how to construct an API request to what to do when you keep getting 429 responses because you're (allegedly) making too many requests. And yes, some of these queries are, um, "basic", but they're still things people want support with.

In the beginning, all emails from HIBP came from noreply@haveibeenpwned.com because I simply wasn't geared up to provide support. In my naivety, I assumed people would see "noreply" and not reply. Instead, they'd send email to that address, get frustrated when there was no reply (from the "noreply" address...) and seek out my personal contact info. Or they'd lodge a dispute with Stripe because they'd emailed noreply@ asking for their subscription to be cancelled and it wasn't. So, back in September I started looking for a better solution:

I’m thinking of setting up a more formal support process for @haveibeenpwned, especially for folks buying API keys and having queries around billing or implementation. Any suggestions on a service? Something that can triage requests, perhaps also have FAQs. Thoughts?
— Troy Hunt (@troyhunt) September 29, 2022

This was a non-trivial exercise. We've all used support services before, so we have an idea of what to expect from an end user perspective, but it's a different story once you dive into all the management bits behind them. Frankly, I find this sort of thing mind-numbing but fortunately it's a task my amazing wife Charlotte picked up with gusto. She has become increasingly involved in all things troyhunt.com and HIBP lately as she brings order, calm and frankly, much needed sanity into my otherwise crazy, demanding professional life. We also figured that if we did this right, she'd be able to handle a lot of the support queries I previously did myself, so she was always going to play a big part in choosing the support platform.

Largely based on Charlotte's work, we settled on Zendesk and about a week ago, silently pushed out support.haveibeenpwned.com:

There are FAQs that cover a bunch of frequent questions, troubleshooting that addresses common problems and, of course, the ability to submit a request if you still need help. These are all a work in progress, and we'll add a lot more content in response to queries, just so long as they're about the right thing. Speaking of which:

This service is only for users of the public commercial API key, not for general HIBP queries.

Why? Because I constantly get queries like this:

Uh… and why am I sleeping during the day?! pic.twitter.com/BUGTJtgl7t
— Troy Hunt (@troyhunt) November 1, 2022

Is that even a query?! I don't know! But I do know that someone took the time to track down my personal email address this week and send it to me, and it's not the sort of thing we're going to be responding to on Zendesk. Nor are queries along the lines of the following:

I've been pwned, now what?

Or:

How do I remove my data from data breaches?

Or one of my personal favourites:

I demand you delete all my data from the data breaches or you'll get a letter from my lawyer!

This whole data breach landscape is a foreign concept for many people, and I understand there being questions, but Charlotte and I can't simultaneously run a free service and reply to queries like this from the masses. But the queries that come in via Zendesk are something we can manage as it's clearly scoped, there's lots of supporting docs and for the most part, we're dealing with tech professionals who understand this world a bit better than your average punter in the first place.

As I announced in last week's blog post, we're pushing ahead with new rate limits and annual billing for the API key and getting this piece out first was always an important prerequisite. It's all part of gearing up for bigger things ahead for HIBP 😊

Related tags
- ❌
- Have
- I
- Been
- Pwned
November 3^rd 2022 at 10:09

Troy Hunt
Big Changes are Afoot: Expanding and Enhancing the Have I Been Pwned API
October 27^th 2022 at 07:19

Big Changes are Afoot: Expanding and Enhancing the Have I Been Pwned API

By Troy Hunt

Just over 3 years ago now, I sat down at a makeshift desk (ok, so it was a kitchen table) in an Airbnb in Olso and built the authenticated API for Have I Been Pwned (HIBP). As I explained at the time, the primary goal was to combat abuse of the service and by adding the need to supply a credit card, my theory was that the bad guys would be very reluctant to, well, be bad guys. The theory checked out, and now with the benefit of several years of data, I can confidently say abuse is near non-existent. I just don't see it. Which is awesome 😊

But there were other things I also didn't see, and it's taken a while for me to get around to addressing them. Some of them are fixed now (like right now, already in production), and some of them will be fixed very, very soon. I think it's all pretty cool, let me explain:

Payments Can Be Hard... if You Don't Stripe Right

A little more background will help me explain this better: in the opening sentence of this blog post I mentioned building the original authenticated API out on a kitchen table at an Airbnb in Oslo. By that time, everyone knew I was going through an M&A process with HIBP I called Project Svalbard, which ultimately failed. What most people didn't know at the time was the other very stressful goings on in my life which combined, had me on a crazy rollercoaster ride I had little control over. It was in that environment that I created the authenticated API, complete with the Azure API Management (APIM) component and Stripe integration. It was rough, and I wish I'd done it better. Now, I have.

In the beginning, I pushed as much of the payment processing as possible to the HIBP website. This was due to a combination of me wanting to create a slick UX and frankly, not understanding Stripe's own UI paradigms. It looked like this:

Cards never ended up hitting HIBP directly, rather the site did a dance with Stripe that involved the card data going to them directly from the client side, a token coming back and then that being used for the processing. It worked, but it had numerous problems ranging from lack of support for things like 3D Secure payments, no support for other payments mechanisms such as Google Pay and Apple Pay and increasingly, large amounts of plumbing required to tie it all together. For example, there were hundreds of lines of code on my end to process payments, change the default card and show a list of previous receipts. The Stripe APIs are extraordinarily clever, but I couldn't escape writing large troves of my own code to make it work the way I originally designed it.

Two new things from Stripe since I originally wrote the code have opened up a whole new way of doing this:

Customer Portal: This is a fully hosted environment where payments are made, cards and subscriptions are managed, invoices and receipts are retrieved and basically, a huge amount of the work I'd previously hand-built can be managed by them rather than by me
Embeddable Pricing Table: This brings the products and prices defined in Stripe into the UI of third party services (such as HIBP) such that customers can select their product then head off to Stripe and do the purchasing there

Rolling to these services removed a huge amount of code from HIBP with the bulk of what's left being email address verification, API key management and handling callbacks from Stripe when a payment is successful. What all this means is that when you first create a subscription, after verifying your email address, you see these two screens:

That's the embeddable pricing table following by Stripe's own hosted payment page. I left the browser address bar in the latter to highlight that this is served by Stripe rather than HIBP. I love distancing myself from any sort of card processing and what's more, everything to do with actually taking the payment is now Stripe's problem 😊 If you're interested in the mechanics of this, a successful payment calls a webhook on HIBP with the customer's details which updates their account with a month of API key whilst the screen above redirects them over to the HIBP website where they can grab their key. Easy peasy.

I silently rolled this out a week ago, watched it being used, made a few little tweaks and then waited until now to write about it. The rollout coincided with a typical email I've received so many times before:

First of all I would like to thank you for the wonderful service that helps people to keep track of their email breaches. I was trying to build a product to provide your services via my website, something similar to Firefox, avast and 100's of other companies doing. We were trying to do it according to the guidelines mentioned in the website. However I am not able to renew my purchase due to payment gateway failures at stripe payment. Requesting you to kindly check the same and advise me on alternate methods for making the payment.

The old model often caused payments to be rejected, especially from subscribers in India. The painful thing for me when trying to help folks is that Stripe would simply report the failed payment as follows:

However, going back to the individual who raised the query above after rolling out this update, things changed very dramatically:

To the title of this section, I simply wasn't "Striping" right. I'm sure there's a way with enough plumbing that it's feasible, but why bother? I cut hundreds of lines of code out just by delegating more of the workload back to them. Further, with ever tightening PCI DSS standards (read Scott's piece, interesting stuff) the less I have to do with cards, the better.

This was a "penny drop" moment for me and it's already made a big difference in a positive way. But there's another penny that dropped for me at the same time: one-off keys were an unnecessary problem.

There Are No More One-Off Keys

It was at the moment I was ripping out those hundreds of lines of code that I wondered: why do I have all the additional kludge to support the paradigm of a one-off key that only lasts a month? Why had I built (and was now maintaining) server side code to handle different types of purchases and UX paradigms to represent one-off versus recurring keys? My gut feel was that most payments formed part of an ongoing subscription but hey, who needs gut feels when you have real data?! So I pulled the numbers:

Only 7% of payments were one-offs, with 93% of payments forming part of ongoing subscriptions.

And so I killed the one-off keys. Kinda, because you can still have a key for only one month, you just purchase a monthly subscription then immediately cancel it via the Stripe Customer Portal:

That's linked into from the API key dashboard on HIBP and it'll take all of 5 seconds to do (also note the ability to change payment method directly on the Stripe site). I've added text to that effect on the HIBP website (you may have spotted that in the earlier screen cap) so in practice, the ability to purchase a one-off key is still there and the main upside of this is that I've just killed a trove of code I no longer have to worry about 🙂 Because this is the internet, I'm sure someone will still be upset, but if you only want a key for a month then that capability still well and truly exists.

All of this so far amounts to doing the same things that were always there but better. Now let's talk about the all new stuff!

Annual Billing and Different Rate Limits are Coming... Very Soon!

The title is self-explanatory and "very soon" is in about 2 weeks from now 😎

Let me illustrate the first part of that title with a message I received recently:

Is there a way to procure a 10 year API key? Our client wants to use the Have I been Pwned plugin for [redacted service name]; however, the $3.50 monthly subscription is too small to go through procurement.

What's that saying about no good deed going unpunished? In my naivety, I made the pricing low with the thinking that was a good thing, yet here we are with that posing barriers! This was a recurring message over and over again with folks simply struggling to get their $3.50 reimbursed. I should have seen this coming after years of living the corporate life myself (I have vivid flashbacks of how hard it was to get small sums reimbursed), and filling out an untold number of expense reports. Speaking of which, this was another recurring theme:

Is there a way to pay yearly for HIBP API access vs monthly? Monthly adds overhead in paperwork.

And again, I get it, this is a painful process. It somehow feels even more painful due to the fact the sum is so low; how much time are people burning trying to justify $3.50 to their boss?! It's painful, and this likely explains why the request for annual payments is the second most requested idea on HIBP's UserVoice. The comments there speak for themselves, and I'm having corporate PTSD flashbacks just reading them again now!

Sticking with the UserVoice theme, the 5th most requested feature is for different pricing on different rate limits. This is mostly self-explanatory but what I wasn't aware of until I went and pulled the stats was just how many people were hacking around the rate limit problem. There are heaps of API accounts like this:

hibp+1@domain.com
hibp+2@domain.com
hibp+3@domain.com
...

Because there can only be one key per email address, organisations are creating heaps of unique sub-addressed emails in order to buy multiple keys. This would have been a manual, laborious process; there's no automated way to do this, quite the contrary with anti-automation controls built into the process. Further, each key has it's own rate limit so I imagine they were also building a bunch of plumbing in the back end to then distribute requests across a collection of keys which, yeah, I get it, but man that seems like hard work! When I say "a collection of keys", I'm not just talking about a few of them either; the largest number of active in-use keys by a single organisation is 112. One hundred and twelve! The next largest is 110. I never expected that 🤯 (Incidentally, these orgs and the others obtaining multiple keys are all precisely the kinds I want using the API to do good things.)

Building the mechanics of annual billing and different rate limits is only part of the challenge and most of that is already done, the harder part is pricing it. I'm pulling troves of analytics from APIM at present to better understand the usage patterns, and it's quite interesting to see the data as it relates to requests for the API:

There's no persistent logging of the actual queries themselves, but APIM makes it easy to understand both the volume of queries and how many of them are successful versus failed, namely because they exceed the existing rate limit or were made with an invalid (likely expired) key. So, that's what I need to work out over the next couple of weeks when I'll launch everything and write it up, as always, in detail 🙂

Summary

The HIBP API has become an increasingly important part of all sorts of different tools and systems that use the data to help protect people impacted by data breaches. The changes I've pushed out over the last week help make the service more accessible and easier to manage, but it's the coming changes I'm most excited about. These are the ones that will make life so much easier on so many people integrating the service and, I sincerely hope, will enable them to do things that make a much more profound impact on all of us who've been pwned before.

Go and check out how the whole API key process works, I'd love to hear your feedback 😊

Related tags
- ❌
- Have
- I
- Been
- Pwned
October 27^th 2022 at 07:19

McAfee Blogs
Smartphone Alternatives: Ease Your Way into Your Child’s First Phone
September 1^st 2022 at 13:21

Smartphone Alternatives: Ease Your Way into Your Child’s First Phone

By McAfee

“But everyone else has one.”

Those are familiar words to a parent, especially if you’re having the first smartphone conversation with your tween or pre-teen. In their mind, everyone else has a smartphone so they want a one too. But does “everyone” really have one? Well, your child isn’t wrong.

Our recent global study found that 76% of children aged 10 to 14 reported using a smartphone or mobile device, with Brazil leading the way at 95% and the U.S. trailing the global average at 65%.

Our figures show that younger children with smartphones and mobile devices make up a decisive majority of younger children overall.

Of course, just because everyone else has smartphone doesn’t mean that it’s necessarily right for your child and your family. After all, with a smartphone comes access to a wide and practically unfettered world of access to the internet, apps, social media, instant messaging, texting, and gaming, all within nearly constant reach. Put plainly, some tweens and pre-teens simply aren’t ready for that just yet, whether in terms of their maturity, habits, or ability to care for and use a device like that responsibly.

Yet from a parent’s standpoint, a first smartphone holds some major upsides. One of the top reasons parents give a child a smartphone is “to stay in touch,” and that’s understandable. There’s something reassuring knowing that your child is a call or text away—and that you can keep tabs on their whereabouts with GPS tracking. Likewise, it’s good to know that they can reach you easily too. Arguably, that may be a reason why some parents end up giving their children a smartphone a little sooner than they otherwise would.

However, you don’t need a smartphone to do to text, track, and talk with your child. You have alternatives.

Smartphone alternatives

One way to think about the first smartphone is that it’s something you ease into. In other words, if the internet is a pool, your child should learn to navigate the shallows with some simpler devices before diving into the deep end with a smartphone.

Introducing technology and internet usage in steps can build familiarity and confidence for them while giving you control. You can oversee their development, while establishing rules and expectations along the way. Then, when the time is right, they can indeed get their first smartphone.

But how to go about that?

It seems a lot of parents have had the same idea and device manufacturers have listened. They’ve come up with smartphone alternatives that give kids the chance to wade into the mobile internet, allowing them to get comfortable with device ownership and safety over time without making the direct leap to a fully featured smartphone. Let’s look at some of those options, along with a few other long-standing alternatives.

GPS trackers for kids

These small and ruggedly designed devices can clip to a belt loop, backpack, or simply fit in a pocket, giving you the ability to see your child’s location. In all, it’s quite like the “find my” functionality we have on our smartphones. When it comes to GPS trackers for kids, you’ll find a range of options and form factors, along with different features such as an S.O.S. button, “geofencing” that can send you an alert when your child enters or leaves a specific area (like home or school), and how often it sends an updated location (to regulate battery life).

Whichever GPS tracker you select, make sure it’s designed specifically for children. So-called “smart tags” designed to locate things like missing keys and wallets are just that—trackers designed to locate things, not children.

Smart watches for kids

With GPS tracking and many other communication-friendly features for families, smart watches can give parents the reassurance they’re looking for while giving kids a cool piece of tech that they can enjoy. The field of options is wide, to say the least. Smart watches for kids can range anywhere from devices offered by mobile carriers like Verizon, T-Mobile, and Vodaphone to others from Apple, Explora, and Tick Talk. Because of that, you’ll want to do a bit of research to determine the right choice for you and your child.

Typical features include restricted texting and calling, and you’ll find that some devices are more durable and more water resistant than others, while yet others have cameras and simple games. Along those lines, you can select a smart watch that has a setting for “school time” so that it doesn’t become a distraction in class. Also, you’ll want to look closely at battery life, as some appear to do a better job of holding a charge than others.

Smartphones for kids

Another relatively recent entry on the scene are smartphones designed specifically for children, which offer a great step toward full-blown smartphone ownership. These devices look, feel, and act like a smartphone, but without web browsing, app stores, and social media. Again, features will vary, yet there are ways kids can store and play music, stream it via Bluetooth to headphones or a speaker, and install apps that you approve of.

Some are paired with a parental control app that allows you to introduce more and more features over time as your child as you see fit—and that can screen texts from non-approved contacts before they reach your child. Again, a purchase like this one calls for some research, yet names like Gabb wireless and the Pinwheel phone offer a starting point.

The flip phone

The old reliable. Rugged and compact, and typically with a healthy battery life to boot, flip phones do what you need them to—help you and your child keep in touch. They’re still an option, even if your child may balk at the idea of a phone that’s “not as cool as a smartphone.” However, if we’re talking about introducing mobile devices and the mobile internet to our children in steps, the flip phone remains in the mix.

Some are just phones and nothing else, while other models can offer more functionality like cameras and slide-out keyboards for texting. And in keeping with the theme here, you’ll want to consider your options so you can pick the phone that has the features you want (and don’t want) for your child.

Ease into that first smartphone

Despite what your younger tween or pre-teen might think, there’s no rush to get that first smartphone. And you know it too. You have time. Time to take eventual smartphone ownership in steps, with a device that keeps you in touch and that still works great for your child.

By easing into that first smartphone, you’ll find opportunities where you can monitor and guide their internet usage. You’ll also find plenty of moments to help your child start forming healthy habits around device ownership and care, etiquette, and safety online. In all, this approach can help you build a body of experience that will come in handy when that big day finally comes—first smartphone day.

The post Smartphone Alternatives: Ease Your Way into Your Child’s First Phone appeared first on McAfee Blog.

Related tags
- ❌
- Family
- Safety
- Mobile
- Security
- first
- smartphone
- Smartphone
- for
- kids
- Should
- kids
- have
- smartphone
- What
- age
- should
- a
- kid
- get
- a
- smartphone
September 1^st 2022 at 13:21

Krebs on Security
The Security Pros and Cons of Using Email Aliases
August 10^th 2022 at 15:10

The Security Pros and Cons of Using Email Aliases

By BrianKrebs

One way to tame your email inbox is to get in the habit of using unique email aliases when signing up for new accounts online. Adding a “+” character after the username portion of your email address — followed by a notation specific to the site you’re signing up at — lets you create an infinite number of unique email addresses tied to the same account. Aliases can help users detect breaches and fight spam. But not all websites allow aliases, and they can complicate account recovery. Here’s a look at the pros and cons of adopting a unique alias for each website.

What is an email alias? When you sign up at a site that requires an email address, think of a word or phrase that represents that site for you, and then add that prefaced by a “+” sign just to the left of the “@” sign in your email address. For instance, if I were signing up at example.com, I might give my email address as krebsonsecurity+example@gmail.com. Then, I simply go back to my inbox and create a corresponding folder called “Example,” along with a new filter that sends any email addressed to that alias to the Example folder.

Importantly, you don’t ever use this alias anywhere else. That way, if anyone other than example.com starts sending email to it, it is reasonable to assume that example.com either shared your address with others or that it got hacked and relieved of that information. Indeed, security-minded readers have often alerted KrebsOnSecurity about spam to specific aliases that suggested a breach at some website, and usually they were right, even if the company that got hacked didn’t realize it at the time.

Alex Holden, founder of the Milwaukee-based cybersecurity consultancy Hold Security, said many threat actors will scrub their distribution lists of any aliases because there is a perception that these users are more security- and privacy-focused than normal users, and are thus more likely to report spam to their aliased addresses.

Holden said freshly-hacked databases also are often scrubbed of aliases before being sold in the underground, meaning the hackers will simply remove the aliased portion of the email address.

“I can tell you that certain threat groups have rules on ‘+*@’ email address deletion,” Holden said. “We just got the largest credentials cache ever — 1 billion new credentials to us — and most of that data is altered, with aliases removed. Modifying credential data for some threat groups is normal. They spend time trying to understand the database structure and removing any red flags.”

According to the breach tracking site HaveIBeenPwned.com, only about .03 percent of the breached records in circulation today include an alias.

Email aliases are rare enough that seeing just a few email addresses with the same alias in a breached database can make it trivial to identify which company likely got hacked and leaked said database. That’s because the most common aliases are simply the name of the website where the signup takes place, or some abbreviation or shorthand for it.

Hence, for a given database, if there are more than a handful of email addresses that have the same alias, the chances are good that whatever company or website corresponds to that alias has been hacked.

That might explain the actions of Allekabels, a large Dutch electronics web shop that suffered a data breach in 2021. Allekabels said a former employee had stolen data on 5,000 customers, and that those customers were then informed about the data breach by Allekabels.

But Dutch publication RTL Nieuws said it obtained a copy of the Allekabels user database from a hacker who was selling information on 3.6 million customers at the time, and found that the 5,000 number cited by the retailer corresponded to the number of customers who’d signed up using an alias. In essence, RTL argued, the company had notified only those most likely to notice and complain that their aliased addresses were suddenly receiving spam.

“RTL Nieuws has called more than thirty people from the database to check the leaked data,” the publication explained. “The customers with such a unique email address have all received a message from Allekabels that their data has been leaked – according to Allekabels they all happened to be among the 5000 data that this ex-employee had stolen.”

HaveIBeenPwned’s Hunt arrived at the conclusion that aliases account for about .03 percent of registered email addresses by studying the data leaked in the 2013 breach at Adobe, which affected at least 38 million users. Allekabels’s ratio of aliased users was considerably higher than Adobe’s — .14 percent — but then again European Internet users tend to be more privacy-conscious.

While overall adoption of email aliases is still quite low, that may be changing. Apple customers who use iCloud to sign up for new accounts online automatically are prompted to use Apple’s Hide My Email feature, which creates the account using a unique email address that automatically forwards to a personal inbox.

What are the downsides to using email aliases, apart from the hassle of setting them up? The biggest downer is that many sites won’t let you use a “+” sign in your email address, even though this functionality is clearly spelled out in the email standard.

Also, if you use aliases, it helps to have a reliable mnemonic to remember the alias used for each account (this is a non-issue if you create a new folder or rule for each alias). That’s because knowing the email address for an account is generally a prerequisite for resetting the account’s password, and if you can’t remember the alias you added way back when you signed up, you may have limited options for recovering access to that account if you at some point forget your password.

What about you, Dear Reader? Do you rely on email aliases? If so, have they been useful? Did I neglect to mention any pros or cons? Feel free to sound off in the comments below.

Related tags
August 10^th 2022 at 15:10

Troy Hunt
Welcoming the Polish Government to Have I Been Pwned
July 4^th 2022 at 07:11

Welcoming the Polish Government to Have I Been Pwned

By Troy Hunt

Continuing the rollout of Have I Been Pwned (HIBP) to national governments around the world, today I'm very happy to welcome Poland to the service! The Polish CSIRT GOV is now the 34th onboard the service and has free and open access to APIs allowing them to query their government domains.

Seeing the ongoing uptake of governments using HIBP to do useful things in the wake of data breaches is enormously fulfilling and I look forward to welcoming many more national CSIRTs in the future.

Related tags
- ❌
- Government
- Have
- I
- Been
- Pwned
July 4^th 2022 at 07:11

Troy Hunt
Understanding Have I Been Pwned's Use of SHA-1 and k-Anonymity
June 30^th 2022 at 07:21

Understanding Have I Been Pwned's Use of SHA-1 and k-Anonymity

By Troy Hunt

Four and a half years ago now, I rolled out version 2 of HIBP's Pwned Passwords that implemented a really cool k-anonymity model courtesy of the brains at Cloudflare. Later in 2018, I did the same thing with the email address search feature used by Mozilla, 1Password and a handful of other paying subscribers. It works beautifully; it's ridiculously fast, efficient and above all, anonymous. Yet from time to time, I get messages along the lines of this:

Why are you using SHA-1? It's insecure and deprecated.

Or alternatively:

Our [insert title of person who fills out paperwork but has no technical understanding here] says that k-anonymity involves sending you PII.

Both these positions make no sense whatsoever when you peel back the covers and understand what's happening underneath, but I get how on face value these conclusions can be drawn. So, let's settle it here in a more complete fashion than what I can do via short tweets or brief emails.

SHA-1 is Just Fine for k-Anonymity

Let's begin with the actual problem SHA-1 presents. Actually, the multiple problems, the first of which is that it's just way too fast for storing user passwords in an online system. More than a decade ago now, I wrote about how Our Password Hashing Has no Clothes and in that post, showed the massive rate at which consumer-grade hardware can calculate these hashes and consequently "crack" the password. Since that time, Moore's Law has done its thing many times over making the proposition of SHA-1 (or SHA-256 or SHA-512) even worse than before. For a modern day reference of how you should be storing passwords, check out OWASP's Password Storage Cheat Sheet.

The other problem relates to how SHA-1 is used for integrity checks. Hashing algorithms provide an efficient means of comparing two files and establishing if their contents is the same due to the deterministic nature of the algorithm (the same input always produces the same output). If a trustworthy source says "the hash of the file is 3713..42" (shown in abbreviated form) then any file with that same hash is assumed to be the same as the one described by the trustworthy source. We use hashes all over the place for precisely this purpose; for example, if I wanted to download Windows 11 Business Editions from my MSDN subscription, I can refer to the hash Microsoft provides on the download page:

After download, I can then use a utility such as PowerShell's Get-FileHash to verify that the file I downloaded is indeed the same one listed above. (There's another rabbit hole we can go down about how you trust the hash above, but I'll leave that for another post.)

We also use hashes when implementing subresource integrity (SRI) on websites to ensure external dependencies haven't been modified. Every time this very blog loads Font Awesome from Cloudflare's CDN, for example, it's verified against the hash in the integrity attribute of the script tag (view source for yourself).

And finally (although not exhaustively - there are many other places we use hashing algorithms in tech), we use hashing algorithms on digital certificate signatures. To pick another example from this blog, the certificate issued by Cloudflare uses SHA-256 as the signature hash algorithm:

But ponder this: if a hashing algorithm always produces a fixed length output (in the case of SHA-1, it's 40 hexadecimal characters), then there are a finite number of hashes in the world. In that SHA-1 example, the finite number is 16^40 as there are 16 possible values (0-9 and a-f) and 40 positions for them. But how many different input strings are there in the world? Infinite! So, there must be multiple input strings that produce the same output, and this is what we refer to as a "hash collision". It's possible for this to occur naturally, although it's exceedingly unlikely simply due to the massive number of possibilities 16^40 presents. However, what if you could manufacture a hash collision? I mean what if you could take an existing hash for an existing document and say "I'm going to create my own document that's different but when passed through SHA-1, produces the same hash!"?

Half a decade ago now, Google researchers demonstrated precisely this with their SHAttered attack. Their simple infographic tells the story:

And this is the heart of the integrity problem with SHA-1: it's simply past its used by date as an algorithm we can be confident in. That's why the signature hash algorithm of the TLS cert on this blog uses SHA-256 instead, among other examples of where we've eschewed the weaker algorithm in favour of stronger variants.

So, now that you understand the problem with SHA-1, let's look at how it's used in HIBP and why it isn't a problem there. There are actually 2 reasons, and I'll start with a sample of passwords used in Pwned Passwords:

P@ssw0rd
abc123
635,someone@example.com,+61430978216,37 example street
money
qwerty

That middle line isn't a password, it's a parsing problem. Not necessarily my parsing problem, it just turns out that you can't always trust hackers to dump breached data in a clean format 🤷‍♂️ So, instead of providing passwords to people in plain text format, I provide them as SHA-1 hashes:

21BD12DC183F740EE76F27B78EB39C8AD972A757
6367C48DD193D56EA7B0BAAD25B19455E529F5EE
A4DDCDA001E137C72FF8259F36BC67C5F9E083AA
C95259DE1FD719814DAEF8F1DC4BD64F9D885FF0
B1B3773A05C0ED0176787A4F1574FF0075F7521E

4 of those hashes are easily cracked (Google is great at that, just try searching for the first one) and that's just fine; nobody is put at risk by learning that some unidentified party used a common password. The 1 hash that won't yield any search results (until Google indexes this blog post...) is the middle one. The fact that SHA-1 is fast to calculate and has proven hash collision attacks against its integrity doesn't diminish the purpose it serves in protecting badly parsed data.

The second reason is best explained by walking through the process of how the API is queried. Let's take an example of someone signing up to a website with the following password:

P@ssw0rd

This will pass many password complexity criteria (uppercase, lowercase, number, non-alphanumeric character, 8 chars long) but is clearly terrible. Because they're signing up to a responsible website that checks Pwned Passwords on registration, that website now creates a SHA-1 hash of the provided password:

21BD12DC183F740EE76F27B78EB39C8AD972A757

Let's pause here for a sec: whether it's a hash of a password or a hash of an email address, what we're looking at is a pseudonymous representation of the original data. There's no anonymity of substance achieved here because in the specific case above, you can simply Google the hash and in the case of an email address, you can determine with near certainty (hash collisions aside), if a given plain text email address is the one used to generate the hash.

This, however, is a different story:

21BD1

This is the first 5 characters only of the hash and it's passed to the Pwned Passwords API as follows:

https://api.pwnedpasswords.com/range/21BD1

You can easily run this yourself and see the result but to summarise, the API then responds with 788 lines, including the following 5:

2D6980B9098804E7A83DC5831BFBAF3927F:1
2D8D1B3FAACCA6A3C6A91617B2FA32E2F57:1
2DC183F740EE76F27B78EB39C8AD972A757:83129
2DE4C0087846D223DBBCCF071614590F300:3
2DEA2B1D02714099E4B7A874B4364D518F6:1

What we're looking at here is the hash suffix of every hash that begins with 21BD1 followed by the number of times that password has been seen. Turns out that "P@ssw0rd" ain't a great choice as it's the one in the middle that's been seen over 83k times. The consumer of the Pwned Passwords service knows it's this one because when combined with the prefix, it's a perfect match to the full hash of the password. I'll touch more on the mathematical properties of this in a moment, for now I want to explain the second reason why SHA-1 is used:

SHA-1 makes it very easy to segment the entire corpus of hashes into roughly equal equivalent sized chunks that can be queried by prefix. As I already touched on, there are 16^5 different possible hash prefixes which is specifically 1,048,576 or "roughly a million". Not every hash prefix has 788 associated suffixes, some have more and others less but if we take that as an average, that explains how the approximately 850M passwords in the service are divided down into a million smaller collections.

Why the first 5 characters? Because if it was the first 4 then each response would be 16 times larger and it would start hurting response times. If it was the first 6 then each response would be 16 times smaller and it would start hurting anonymity. 5 characters was the sweet spot between the two.

Why not SHA-256? Instead of 40 characters each hash would be 64 characters and whilst I could have achieved the same anonymity properties by still just using the first 5 characters of the hash, each suffix in the response would be an additional 24 characters and multiplying that 788 times over adds multiple kb to each response, even when compressed on the transport layer. It's also a slower hashing algorithm; still totally unsuitable for storing user passwords in an online system, but it can have a hit on the consuming service if doing huge amounts of calculations. And for what? Integrity doesn't matter because there's no value in modifying the source password to forge a colliding hash. You'd further increase the anonymity by 16^24 more possibilities, but then why not use SHA-512 which is 128 characters therefore another 16^64 possibilities than even SHA-256? Because, as you'll read in the next section, even SHA-1 provides way more practical anonymity than you'll ever need anyway.

In summary, think of the choice of SHA-1 simply being to obfuscate poorly parsed input data to protect inadvertently included info, and as a means of dividing the collection of data down into nice easily segmentable and queryable collections. If your position is "SHA-1 is broken", then you simply don't understand its purpose here.

PII and the Protection Provided by k-Anonymity

Let's turn the discussion more to the privacy aspects of the email address search I mentioned earlier on. The principles are identical to the password search but for one difference in the technical implementation: queries are done on the first 6 characters of a SHA-1 hash, not the first 5. The reason is simple: there are a lot more email addresses in the system than passwords, about 5 billion in total. Querying via the first 6 characters of a SHA-1 hash means there are 16 times more possibilities than with the password search, therefore 16^6 or just over 16M. Let's take this email address:

test@example.com

Which hashes down to this value with SHA-1:

567159D622FFBB50B11B0EFD307BE358624A26EE

And similar to the password search, it's only the prefix that is sent to HIBP when performing a query:

So, putting the privacy hat on, what's the risk when a service sends this data to HIBP? Mathematically, with the next 34 characters unknown, there are 16^34 different possible hashes that this prefix could belong to. Just to really labour the point, given a 6 character SHA-1 hash prefix you could take a 1 in 87,112,285,931,760,200,000,000,000,000,000,000,000,000 guess as to what the full hash prefix is. And then due to the infinite number of potential input strings, multiply that number out to... well... infinity. That's the total number of possible email addresses it could represent. By any definition of the term, those first 6 characters tell you absolutely nothing useful about what email address is being searched for.

But we're left with a more semantic, possibly philosophical question: is "567159" personally identifiable information? In practice, no, for all intents and purposes it's impossible to tell who this belongs to without the remaining 34 characters and even then, you still need to be able to crack that hash which is most likely only going to happen if you have a dictionary of email address to work through in which the given one appears. But it's derived from pseudonymous PII, and this is where the occasional [insert title of person who fills out paperwork but has no technical understanding here] loses their mind.

To explain this in more colloquial terms, it's like saying that the "t" at the beginning of the email address I used above is personally identifying. Really? My own email address begins with a "t", so it must be mine! It's a nonsense argument.

I'll wrap up with a definition and I like NIST's the best, not just because it's clear and concise but because they're a great authoritative source on this sort of thing (it was actually their guidance on prohibiting passwords from previous breach corpuses that led me to create Pwned Passwords in the first place):

Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.

Phone numbers are PII. Physical addresses are PII. IP addresses are PII. The first 6 characters of a SHA-1 hash of someone's email address is not PII.

Summary

None of the misunderstandings I've explained above have dented the adoption of these services. Pwned Passwords is now doing in excess of 2 billion queries a month and has an ongoing feed of new passwords directly from the FBI. The k-anonymity search for email addresses sees over 100M queries a month and is baked into everything from browsers to password managers to identity theft services. The success of these services isn't due to any technical genius on my part (hat-tip again to Cloudflare), but rather to their simple yet effective implementations that (almost) everyone can easily understand 😊

Related tags
- ❌
- Have
- I
- Been
- Pwned
June 30^th 2022 at 07:21