A very early weekly update this time after an especially hectic week. The process with the couple of data breaches in particular was a real time sap and it shouldn't be this hard. Seriously, the amount of effort that goes into trying to get organisations to own their breach (or if they feel strongly enough about it, help attribute it to another party) is just nuts. It's not getting any better either π Regardless, listen to how these couple went and as always, if you've got any bright ideas about how to make this process less painful then I'd love to hear them.
How best to punish spammers? I give this topic a lot of thought because I spend a lot of time sifting through the endless rubbish they send me. And that's when it dawned on me: the punishment should fit the crime - robbing me of my time - which means that I, in turn, need to rob them of their time. With the smallest possible overhead on my time, of course. So, earlier this year I created Password Purgatory with the singular goal of putting spammers through the hellscape that is attempting to satisfy really nasty password complexity criteria. And I mean really nasty criteria, like much worse than you've ever seen before. I opened-sourced it, took a bunch of PRs, built out the API to present increasingly inane password complexity criteria then left it at that. Until now because finally, it's live, working and devilishly beautiful π
This is the easy bit - I didn't have to do anything for this step! But let me put it into context and give you a real world sample:
Ugh. Nasty stuff, off to hell for them it is, and it all begins with filing the spam into a special folder called "Send Spammer to Password Purgatory":
That's the extent of work involved on a spam-by-spam basis, but let's peel back the covers and look at what happens next.
Microsoft Power Automate (previously "Microsoft Flow") is a really neat way of triggering a series of actions based on an event, and there's a whole lot of connectors built in to make life super easy. Easy on us as the devs, that is, less easy on the spammers because here's what happens as soon as I file an email in the aforementioned folder:
Using the built in connector to my Microsoft 365 email account, the presence of a new email in that folder triggers a brand new instance of a flow. Following, I've added the "HTTP" connector which enables me to make an outbound request:
All this request does is makes a POST to an API on Password Purgatory called "create-hell". It passes an API key because I don't want just anyone making these requests as it will create data that will persist at Cloudflare. Speaking of which, let's look at what happens over there.
Let's start with some history: Back in the not too distant past, Cloudflare wasn't a host and instead would just reverse proxy requests through to origin services and do cool stuff with them along the way. This made adding HTTPS to any website easy (and free), added heaps of really neat WAF functionality and empowered us to do cool things with caching. But this was all in-transit coolness whilst the app logic, data and vast bulk of the codebase sat at that origin site. Cloudflare Workers started to change that and suddenly we had code on the edge running in hundreds of nodes around the world, nice and close to our visitors. Did that start to make Cloudflare a "host"? Hmm... but the data itself was still on the origin service (transient caching aside). Fast forward to now and there are multiple options to store data on Cloudflare's edges including their (presently beta) R2 service, Durable Objects, the (forthcoming) D1 SQL database and of most importance to this blogpost, Workers KV. Does this make them a host if you can now build entire apps within their environment? Maybe so, but let's skip the titles for now and focus on the code.
All the code I'm going to refer to here is open source and available in the public Password Purgatory Logger Github repo. Very early on in the index.js file that does all the work, you'll see a function called "createHell" which is called when the flow step above runs. That code creates a GUID then stores it in KV after which I can easily view it in the Cloudflare dashboard:
There's no value yet, just a key and it's returned via a JSON response in a property called "kvKey". To read that back in the flow, I need a "Parse JSON" step with a schema I generated from a sample:
At this point I now have a unique ID in persistent storage and it's available in the flow, which means it's time to send the spammer an email.
Because it would be rude not to respond, I'd like to send the spammer back an email and invite them to my very special registration form. To do this, I've grabbed the "Reply to email" connector and fed the kvKey through to a hyperlink:
It's an HTML email with the key hidden within the hyperlink tag so it doesn't look overtly weird. Using this connector means that when the email sends, it looks precisely like I've lovingly crafted it myself:
With the entire flow now executed, we can view the history of each step and see how the data moves between them:
Now, we play the waiting game π
Wasting spammer time in and of itself is good. Causing them pain by having them attempt to pass increasingly obtuse password complexity criteria is better. But the best thing - the piΓ¨ce de rΓ©sistance - is to log that pain and share it publicly for our collective entertainment π€£
So, by following the link the spammer ends up here (you're welcome to follow that link and have a play with it):
The kvKey is passed via the query string and the page invites the spammer to begin the process of becoming a partner. All they need to leave is an email address... and a password. That page then embeds 2 scripts from the Password Purgatory website, both of which you can find in the open source and public Github repository I created in the original blog post. Each attempt at creating an account sends off the password only to the original Password Purgatory API I created months ago, after which it responds with the next set of criteria. But each attempt also sends off both the criteria that was presented (none on the first go, then something increasingly bizarre on each subsequent go), the password they tried to use to satisfy the criteria and the kvKey so it can all be tied together. What that means is that the Cloudflare Workers KV entry created earlier gradually builds up as follows:
There are a couple of little conditions built into the code:
That's everything needed to lure the spammer in and record their pain, now for the really fun bit π
The very first time the spammer's password attempt is logged, the Cloudflare Worker sends me an email to let me know I have a new spammer hooked (this capability using MailChannels only launched this year):
It was so exciting getting this email yesterday, I swear it's the same sensation as literally getting a fish on your line! That link is one I can share to put the spammer's pain on display for the world to see. This is achieved with another Cloudflare Workers route that simply pulls out the logs for the given kvKey and formats it neatly in an HTML response:
Ah, satisfaction π I listed the amount of time the spammer burned with a goal to further refining the complexity criteria in the future to attempt to keep them "hooked" for longer. Is the requirement for a US post code in the password a bit too geographically specific, for example? Time will tell and I wholeheartedly welcome PRs to that effect in the original Password Purgatory API repo.
Oh - and just to ensure traction and exposure are maximised, there's a neatly formatted Twitter card that includes the last criteria and password used, you know, the ones that finally broke the spammer's spirit and caused them to give up:
Spammer burned a total of 80 seconds in Password Purgatory π #PasswordPurgatory https://t.co/VwSCHNZ2AW
β Troy Hunt (@troyhunt) August 3, 2022
Clearly, I've taken a great deal of pleasure in messing with spammers and I hope you do too. I've gotta be honest - I've never been so excited to go through my junk mail! But I also thoroughly enjoyed putting this together with Power Automate and Workers KV, I think it's super cool that you can pull an app together like this with a combination of browser-based config plus code and storage that runs directly in hundreds of globally distributed edge nodes around the world. I hope the spammers appreciate just how elegant this all is π€£
I didn't intend for a bunch of this week's vid to be COVID related, but between the breach of an anti-vaxxer website and the (unrelated) social comments directed at our state premier following some pretty simple advice, well, it just kinda turned out that way. But there's more on other breaches too, in particular the alleged Paytm one and the actual Customer.io one.
I'm really looking forward to next week's update, here's a little teaser of what you can expect to hear about then π€£
I broke Yoda's stick! 3D printing woes, and somehow I managed to get through the explanation without reverting to a chorus of My Stick by a Bad Lip Reading (and now you'd got that song stuck in your head). Loads of data breaches this week and whilst "legacy", still managed to demonstrate how bad some practices remain today (hi Shadi.com π). Never a dull moment in data breach land, more from there next week π
How many times have you heard the old adage about how nothing in life is free:
If you're not paying for the product, you are the product
Facebook. LinkedIn. TikTok. But this isn't an internet age thing, the origins go back way further, originally being used to describe TV viewers being served ads. Sure, TV was "free" in that you don't pay to watch it (screwy UK TV licenses aside), but running a television network ain't cheap so it was (and still is) supported by advertisers paying to put their message in front of viewers. A portion of those viewers then go out and buy the goods and services they've been pitched hence becoming the "product" of TV.
But what I dislike - no, vehemently hate - is when the term is used disingenuously to imply that nobody ever does anything for free and that there is a commercial motive to every action. To bring it closer to home for my audience, there is a suggestion that those of us who create software and services must somehow be in it for the money. Our time has a value. We pay for hardware and software to build things. We pay for hosting services. If not to make money, then why would we do it?
There are many, many non-financial motives and I'm going to talk about just a few of my own. In my very first ever blog post almost 13 years ago now, I posited that it was useful to one's career to have an online identity. My blog would give me an opportunity to demonstrate over a period of time where my interests lie and one day, that may become a very useful thing. Nobody that read that first post became a "product", quite the contrary if the feedback is correct.
The first really serious commitment I made to blogging was the following year when I began the OWASP Top 10 for ASP.NET series. That was ten blog posts of many thousands of words each that took a year and a half to complete. I had the idea whilst literally standing in the shower one day thinking about the things that bugged me at work: "I'm so sick of sending developers who write code for us basic guidance on simple security things". I wanted to solve that problem, and as I started writing the series, it turned out to be useful for a whole range of people which was awesome! Did that make them the product? No, of course not, it just made them a consumer of free content.
I can't remember exactly when I put ads on my blog. I think it was around the end of 2012, and they were terrible! I made next to no money out of them and I got rid of them altogether in 2016 in favour of the sponsorship line of text you still see at the top of the page today. Did either of these make viewers "the product" in a way that they weren't when reading the same content prior to their introduction? By any reasonable measure, no, not unless you stretch reality far enough to claim that the ads consumed some of their bandwidth or device power or in some other way was detrimental such that they pivoted from being a free consumer to a monetised reader. Then that argument dies when ads rolled to sponsorship. Perhaps it could be claimed that people became the product because the very nature of sponsorship is to get a message out there which may one day convert visitors (or their employers) to customers and that's very true, but that doesn't magically pivot them from being a free consumer of content to a "product" at the moment sponsorship arrived, that's a nonsense argument.
How about ASafaWeb in 2011? Totally free and designed to solve the common problem of ASP.NET website misconfiguration. I never made a cent from that. Never planned to, never did. So why do it? Because it was fun π Seriously, I really enjoyed building that service and seeing people get value from it was enormously fulfilling. Of course nobody was the product in that case, they just consumed something for free that I enjoyed building.
Which brings me to Have I Been Pwned (HIBP), the project that's actually turned out to be super useful and is the most frequent source of the "if you're not paying for the product" bullshit argument. There were 2 very simple reasons I built that and I've given this same answer in probably a hundred interviews since 2013:
That's it. Those 2 reasons. No visions of grandeur, no expectation of a return on my time, just itches I wanted to scratch. Months later, I posed this question:
A number of people have asked for a donate button on @haveibeenpwned. What do you think? Worth donating to? Or does it come across as cheap?
β Troy Hunt (@troyhunt) March 7, 2014
Which is exactly what it looks like on face value: people appreciating the service and wanting to support what I was doing. It didn't make anyone "the product". Nor did the first commercial use of HIBP the following year make anyone a product, it didn't change their experience one little bit. The partnership with 1Password several years later is the same again; arguably, it made HIBP more useful for the masses or non-techies that had never given any consideration to a password manager.
What about Why No HTTPS? Definitely not a product either as the service itself or the people that use it. Or HTTPS is Easy? Nope, and Cloudflare certainly didn't pay me a cent for it either, they had no idea I was building it, I just got up and felt like it one day. Password Purgatory? I just want to mess with spammers, and I'm happy to spend some of my time doing that π (Unless... do they become the product if their responses are used for our amusement?!) And then what must be 100+ totally free user group talks, webinars, podcasts and other things I can't even remember that by their very design, were simply intended to get information to people for free.
What gets me a bit worked up about the "you're the product" sentiment is that it implies there's an ulterior motive for any good deed. I'm dependent on a heap of goodwill for every single project I build and none of that makes me feel like "the product". I use NWebsec for a bunch of my security headers. I use Cloudflare across almost every single project (they provide services to HIBP for free) and that certainly doesn't make me a product. The footer of this blog mentions the support Ghost Pro provides me - that's awesome, I love their work! But I don't feel like a "product".
Conversely, there are many things we pay for yet we remain "the product" of by the definition referred to in this post. YouTube Premium, for example, is worth every cent but do you think you cease being "the product" once you subscribe versus when you consume the service for free? Can you imagine Google, of all companies, going "yeah, nah, we don't need to collect any data from paying subscribers, that wouldn't be cool". Netflix. Disqus. And pretty much everything else. Paying doesn't make you not the product any more than not paying makes you the product, it's just a terrible term used way too loosely and frankly, often feels insulting.
Before jumping on the "you're the product" bandwagon, consider how it makes those who simply want to build cool stuff and put it out there for free feel. Or if you're that jaded and convinced that everything is done for personal fulfilment then fine, go and give me a donation. And now you're thinking "I bet he wrote this just to get donations" so instead, go and give Let's Encrypt a donation... but then that would kinda make free certs a commercial endeavour! See how stupid this whole argument is?
It's very much a last-minute agenda this week as I catch up on the inevitable post-travel backlog and pretty much just pick stuff from my tweet timeline over the week π But hey, there's some good stuff in there and I still managed to knock out almost an hour worth of content!
And we're finally done with this trip. 26 days, 14 different accommodations, 5,146km of driving through 4 states and the last 4 weekly vids all done on the road. Travel is great, but right now going home is even better π Next week's vid will be back in my comfy office with good lighting, video, audio and better planning. Until then, here's a (late) weekly update 303:
11 years now, wow π² It's actually 11 and a bit because it was April Fool's Day in 2011 that my first MVP award came through. At the time, I referred to myself as "The Accidental MVP" as I'd no expectation of an award, it just came from me being me. It's the same again today, and the last year has been full of just doing the stuff I love; loads of talks (which, like the one above at AusCERT, are actually starting to happen in front of real live humans again), live streams every week, blog posts and perhaps my favourite thing of all, open sourcing Pwned Passwords and standing up an ingestion pipeline for the FBI. Cool π
But it has to be said that all these things only happen through the support of the community. There'd be no open source Pwned Passwords if nobody wanted to contribute, no live streams or blog posts if people didn't want to watch them and no conference talks if nobody attended. So, thank you for tuning in and giving me a platform to do what I love π
Continuing the rollout of Have I Been Pwned (HIBP) to national governments around the world, today I'm very happy to welcome Poland to the service! The Polish CSIRT GOV is now the 34th onboard the service and has free and open access to APIs allowing them to query their government domains.
Seeing the ongoing uptake of governments using HIBP to do useful things in the wake of data breaches is enormously fulfilling and I look forward to welcoming many more national CSIRTs in the future.
In a complete departure from the norm, this week's video is the much-requested "cultural differences" one with Charlotte. No tech (other than my occasional plug for the virtues of JavaScript), but lots of experiences from both of us living and working in different parts of the world. Most of it is what Charlotte has learned being thrown into the deep end of Aussieness (without the option of even getting out of the country until very recently), which I thought made for some pretty funny viewing π€£
We almost got through the entire content I had planned... then my phone went into battery saving mode and killed the mic so apologies for that last little bit of missing content. But hey, it was worth it when the battery was low due to capturing these epic shots earlier in the day:
Stunning π€© pic.twitter.com/s1TRJ3bcb1
β Troy Hunt (@troyhunt) July 1, 2022
I think this made for fun viewing with heaps of audience engagement, I hope you enjoy watching it π
Four and a half years ago now, I rolled out version 2 of HIBP's Pwned Passwords that implemented a really cool k-anonymity model courtesy of the brains at Cloudflare. Later in 2018, I did the same thing with the email address search feature used by Mozilla, 1Password and a handful of other paying subscribers. It works beautifully; it's ridiculously fast, efficient and above all, anonymous. Yet from time to time, I get messages along the lines of this:
Why are you using SHA-1? It's insecure and deprecated.
Or alternatively:
Our [insert title of person who fills out paperwork but has no technical understanding here] says that k-anonymity involves sending you PII.
Both these positions make no sense whatsoever when you peel back the covers and understand what's happening underneath, but I get how on face value these conclusions can be drawn. So, let's settle it here in a more complete fashion than what I can do via short tweets or brief emails.
Let's begin with the actual problem SHA-1 presents. Actually, the multiple problems, the first of which is that it's just way too fast for storing user passwords in an online system. More than a decade ago now, I wrote about how Our Password Hashing Has no Clothes and in that post, showed the massive rate at which consumer-grade hardware can calculate these hashes and consequently "crack" the password. Since that time, Moore's Law has done its thing many times over making the proposition of SHA-1 (or SHA-256 or SHA-512) even worse than before. For a modern day reference of how you should be storing passwords, check out OWASP's Password Storage Cheat Sheet.
The other problem relates to how SHA-1 is used for integrity checks. Hashing algorithms provide an efficient means of comparing two files and establishing if their contents is the same due to the deterministic nature of the algorithm (the same input always produces the same output). If a trustworthy source says "the hash of the file is 3713..42" (shown in abbreviated form) then any file with that same hash is assumed to be the same as the one described by the trustworthy source. We use hashes all over the place for precisely this purpose; for example, if I wanted to download Windows 11 Business Editions from my MSDN subscription, I can refer to the hash Microsoft provides on the download page:
After download, I can then use a utility such as PowerShell's Get-FileHash to verify that the file I downloaded is indeed the same one listed above. (There's another rabbit hole we can go down about how you trust the hash above, but I'll leave that for another post.)
We also use hashes when implementing subresource integrity (SRI) on websites to ensure external dependencies haven't been modified. Every time this very blog loads Font Awesome from Cloudflare's CDN, for example, it's verified against the hash in the integrity attribute of the script tag (view source for yourself).
And finally (although not exhaustively - there are many other places we use hashing algorithms in tech), we use hashing algorithms on digital certificate signatures. To pick another example from this blog, the certificate issued by Cloudflare uses SHA-256 as the signature hash algorithm:
But ponder this: if a hashing algorithm always produces a fixed length output (in the case of SHA-1, it's 40 hexadecimal characters), then there are a finite number of hashes in the world. In that SHA-1 example, the finite number is 16^40 as there are 16 possible values (0-9 and a-f) and 40 positions for them. But how many different input strings are there in the world? Infinite! So, there must be multiple input strings that produce the same output, and this is what we refer to as a "hash collision". It's possible for this to occur naturally, although it's exceedingly unlikely simply due to the massive number of possibilities 16^40 presents. However, what if you could manufacture a hash collision? I mean what if you could take an existing hash for an existing document and say "I'm going to create my own document that's different but when passed through SHA-1, produces the same hash!"?
Half a decade ago now, Google researchers demonstrated precisely this with their SHAttered attack. Their simple infographic tells the story:
And this is the heart of the integrity problem with SHA-1: it's simply past its used by date as an algorithm we can be confident in. That's why the signature hash algorithm of the TLS cert on this blog uses SHA-256 instead, among other examples of where we've eschewed the weaker algorithm in favour of stronger variants.
So, now that you understand the problem with SHA-1, let's look at how it's used in HIBP and why it isn't a problem there. There are actually 2 reasons, and I'll start with a sample of passwords used in Pwned Passwords:
P@ssw0rd
abc123
635,someone@example.com,+61430978216,37 example street
money
qwerty
That middle line isn't a password, it's a parsing problem. Not necessarily my parsing problem, it just turns out that you can't always trust hackers to dump breached data in a clean format π€·ββοΈ So, instead of providing passwords to people in plain text format, I provide them as SHA-1 hashes:
21BD12DC183F740EE76F27B78EB39C8AD972A757
6367C48DD193D56EA7B0BAAD25B19455E529F5EE
A4DDCDA001E137C72FF8259F36BC67C5F9E083AA
C95259DE1FD719814DAEF8F1DC4BD64F9D885FF0
B1B3773A05C0ED0176787A4F1574FF0075F7521E
4 of those hashes are easily cracked (Google is great at that, just try searching for the first one) and that's just fine; nobody is put at risk by learning that some unidentified party used a common password. The 1 hash that won't yield any search results (until Google indexes this blog post...) is the middle one. The fact that SHA-1 is fast to calculate and has proven hash collision attacks against its integrity doesn't diminish the purpose it serves in protecting badly parsed data.
The second reason is best explained by walking through the process of how the API is queried. Let's take an example of someone signing up to a website with the following password:
P@ssw0rd
This will pass many password complexity criteria (uppercase, lowercase, number, non-alphanumeric character, 8 chars long) but is clearly terrible. Because they're signing up to a responsible website that checks Pwned Passwords on registration, that website now creates a SHA-1 hash of the provided password:
21BD12DC183F740EE76F27B78EB39C8AD972A757
Let's pause here for a sec: whether it's a hash of a password or a hash of an email address, what we're looking at is a pseudonymous representation of the original data. There's no anonymity of substance achieved here because in the specific case above, you can simply Google the hash and in the case of an email address, you can determine with near certainty (hash collisions aside), if a given plain text email address is the one used to generate the hash.
This, however, is a different story:
21BD1
This is the first 5 characters only of the hash and it's passed to the Pwned Passwords API as follows:
https://api.pwnedpasswords.com/range/21BD1
You can easily run this yourself and see the result but to summarise, the API then responds with 788 lines, including the following 5:
2D6980B9098804E7A83DC5831BFBAF3927F:1
2D8D1B3FAACCA6A3C6A91617B2FA32E2F57:1
2DC183F740EE76F27B78EB39C8AD972A757:83129
2DE4C0087846D223DBBCCF071614590F300:3
2DEA2B1D02714099E4B7A874B4364D518F6:1
What we're looking at here is the hash suffix of every hash that begins with 21BD1 followed by the number of times that password has been seen. Turns out that "P@ssw0rd" ain't a great choice as it's the one in the middle that's been seen over 83k times. The consumer of the Pwned Passwords service knows it's this one because when combined with the prefix, it's a perfect match to the full hash of the password. I'll touch more on the mathematical properties of this in a moment, for now I want to explain the second reason why SHA-1 is used:
SHA-1 makes it very easy to segment the entire corpus of hashes into roughly equal equivalent sized chunks that can be queried by prefix. As I already touched on, there are 16^5 different possible hash prefixes which is specifically 1,048,576 or "roughly a million". Not every hash prefix has 788 associated suffixes, some have more and others less but if we take that as an average, that explains how the approximately 850M passwords in the service are divided down into a million smaller collections.
Why the first 5 characters? Because if it was the first 4 then each response would be 16 times larger and it would start hurting response times. If it was the first 6 then each response would be 16 times smaller and it would start hurting anonymity. 5 characters was the sweet spot between the two.
Why not SHA-256? Instead of 40 characters each hash would be 64 characters and whilst I could have achieved the same anonymity properties by still just using the first 5 characters of the hash, each suffix in the response would be an additional 24 characters and multiplying that 788 times over adds multiple kb to each response, even when compressed on the transport layer. It's also a slower hashing algorithm; still totally unsuitable for storing user passwords in an online system, but it can have a hit on the consuming service if doing huge amounts of calculations. And for what? Integrity doesn't matter because there's no value in modifying the source password to forge a colliding hash. You'd further increase the anonymity by 16^24 more possibilities, but then why not use SHA-512 which is 128 characters therefore another 16^64 possibilities than even SHA-256? Because, as you'll read in the next section, even SHA-1 provides way more practical anonymity than you'll ever need anyway.
In summary, think of the choice of SHA-1 simply being to obfuscate poorly parsed input data to protect inadvertently included info, and as a means of dividing the collection of data down into nice easily segmentable and queryable collections. If your position is "SHA-1 is broken", then you simply don't understand its purpose here.
Let's turn the discussion more to the privacy aspects of the email address search I mentioned earlier on. The principles are identical to the password search but for one difference in the technical implementation: queries are done on the first 6 characters of a SHA-1 hash, not the first 5. The reason is simple: there are a lot more email addresses in the system than passwords, about 5 billion in total. Querying via the first 6 characters of a SHA-1 hash means there are 16 times more possibilities than with the password search, therefore 16^6 or just over 16M. Let's take this email address:
test@example.com
Which hashes down to this value with SHA-1:
567159D622FFBB50B11B0EFD307BE358624A26EE
And similar to the password search, it's only the prefix that is sent to HIBP when performing a query:
567159
So, putting the privacy hat on, what's the risk when a service sends this data to HIBP? Mathematically, with the next 34 characters unknown, there are 16^34 different possible hashes that this prefix could belong to. Just to really labour the point, given a 6 character SHA-1 hash prefix you could take a 1 in 87,112,285,931,760,200,000,000,000,000,000,000,000,000 guess as to what the full hash prefix is. And then due to the infinite number of potential input strings, multiply that number out to... well... infinity. That's the total number of possible email addresses it could represent. By any definition of the term, those first 6 characters tell you absolutely nothing useful about what email address is being searched for.
But we're left with a more semantic, possibly philosophical question: is "567159" personally identifiable information? In practice, no, for all intents and purposes it's impossible to tell who this belongs to without the remaining 34 characters and even then, you still need to be able to crack that hash which is most likely only going to happen if you have a dictionary of email address to work through in which the given one appears. But it's derived from pseudonymous PII, and this is where the occasional [insert title of person who fills out paperwork but has no technical understanding here] loses their mind.
To explain this in more colloquial terms, it's like saying that the "t" at the beginning of the email address I used above is personally identifying. Really? My own email address begins with a "t", so it must be mine! It's a nonsense argument.
I'll wrap up with a definition and I like NIST's the best, not just because it's clear and concise but because they're a great authoritative source on this sort of thing (it was actually their guidance on prohibiting passwords from previous breach corpuses that led me to create Pwned Passwords in the first place):
Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means.
Phone numbers are PII. Physical addresses are PII. IP addresses are PII. The first 6 characters of a SHA-1 hash of someone's email address is not PII.
None of the misunderstandings I've explained above have dented the adoption of these services. Pwned Passwords is now doing in excess of 2 billion queries a month and has an ongoing feed of new passwords directly from the FBI. The k-anonymity search for email addresses sees over 100M queries a month and is baked into everything from browsers to password managers to identity theft services. The success of these services isn't due to any technical genius on my part (hat-tip again to Cloudflare), but rather to their simple yet effective implementations that (almost) everyone can easily understand π
First up, I'm really sorry about the audio quality on this one. It's the exact same setup I used last week (and carefully tested first) but it's obviously just super sensitive to the wind. If you look at the trees in the background you can see they're barely moving, but inevitably that was enough to really mess with the audio quality. I do actually have a windsock for the mic, but it's in a drawer at home so for the remainder of this trip it'll be indoor recording only. Speaking of which, because there was a lot of enthusiasm for Charlotte and I to do one together on the cultural differences we've both experienced living in different parts of the world, that'll be next week's video. Less techie, but hopefully something you'll all enjoy π
Well, we're about 2,000km down on this trip and are finally in Melbourne, which was kinda the point of the drive in the first place (things just escalated after that). The whole journey is going into a long tweet thread you can find below (or mute - that's partly why it's in a single thread):
Itβs time for the next great road trip π pic.twitter.com/9B9k9cXQvH
β Troy Hunt (@troyhunt) June 14, 2022
Next week is NDC Melbourne so please get along to the event if you're in town, it's kinda amazing to think I'll finally be back at an NDC after all this time π
How on earth does an enterprise rack-mounted NAS not come with rails to actually install it in the rack?! So yeah, that's what's in the box, something that should have been in the original box and not in a separate purchase. Just to add to the Synology packaging insanity, I went to install a couple of spare NVMe drives in it today and... there were no screws in the NVMe slots π€¦β I'll be doing the next four weekly updates from various locations around the country as we hit the road again, stay tuned for epic tweet threads of amazing locations π
Four years ago now, I started making domains belonging to various governments around the world freely searchable via a set of APIs in Have I Been Pwned. Today, I'm very happy to welcome the 33rd government, Indonesia! As of now, the Indonesian National CERT managed under the National Cyber and Crypto Agency has full access to this service to help protect government departments within the country.
Indonesia's inclusion marks the first Asian nation to take up this service and look forward to many more from across the globe following in future.
I somehow ended up blasting through an hour and a quarter in this week's video with loads of discussion on the CTARS / NDIS data breach then a real time "let's see what the fuss is about" with news that one of our state's digital driver's licenses (DDL) may be easily forgeable. I think the whole discussion is actually really interesting when looked at through the lens of how on balance, a digitised license compares to a physical one. As you'll see, I think the reporting on this is overblown however... the weak encryption keys do seem like an oversight and the response of Service NSW to criticism has been lacklustre at best. Let's see how it goes in other states, I'll be first in line when they roll out in Queensland so I can finally start leaving my wallet at home!
So I basically spent my whole day yesterday playing with Ubiquiti gear and live-tweeting the experience π This was an unapologetically geeky pleasure and it pretty much dominates this week's video but hey, it's a fun topic. Still, there's a bunch of data breach stuff up front and as I write this, 25M more records courtesy of the MGM breach are making their way up into HIBP. Get ready for a bunch of notification emails going out on that one. Here's this week's video:
Data breaches, 3D printing and passwords - just the usual variety of things this week. More specifically, that really cool Pwned Passwords downloader that I know a bunch of people have been waiting on, and now we've finally released. It hits the existing k-anonymity API over 1 million times and that API is already going on 2 billion requests a month so I'm kinda curious to see what happens if everyone starts running the downloader at the same time... π€
Just before Christmas, the promise to launch a fully open source Pwned Passwords fed with a firehose of fresh data from the FBI and NCA finally came true. We pushed out the code, published the blog post, dusted ourselves off and that was that. Kind of - there was just one thing remaining...
The k-anonymity API is lovely and that's not just me saying that, that's people voting with their feet:
That's already 58% by volume from my December blog post, only 5 months ago to the day. It's also just a rounding error off a 100% cache hit ratio too π But the bit that remained was the promise I made in that last blog post:
Lastly, as of right now, the code to take the ingestion pipeline and dump all passwords into a downloadable corpus is yet to be written. We want to do this - we have every intention of doing this - but given how long it frequently was between releases, we don't feel the need to rush.
The idea of taking 16^5 hash ranges, bundling them all up into a single monolithic archive then making it all downloadable seemed a non-trivial task. Plus, I was still licking my wounds from the massive costs I got hit with after releasing the last archives and them exceeding the cacheable limit at the time on Cloudflare's edge. And that's when it hit me - why don't we just write a script to download all the hashes from the same k-anonymity API so many organisations are already using? It's just 16^5 separate requests and the responses could be dumped into a big text file, how hard could it be? It'd almost all be cached and there's super efficient brotli compression between the client and the Cloudflare edge so it should be fast too, so... why not?
I threw the idea over to StefΓ‘n and in his typically cool Icelandic way he not only built the feature, but did it much better than I was thinking in the first place. So, here's how it works in point form:
And that's it. Run it up and it looks like this:
The -p switch defines the level of parallelism to apply and when run in the Azure VM I tested this from, it took 26 minutes to pull everything down. Obviously YMMV based on connection speed, but with that massive cache hit ratio (also reflected in the output above), at least you'll be retrieving almost every single hash range from a location very close to you.
I'm conscious the one remaining gap we have is that this doesn't make the NTLM versions downloadable and there are folks out there eagerly awaiting that. I suspect we'll take a similar approach there so stay tuned for that, it shouldn't be a biggy now we've established a pattern. I'm also conscious that to make this tool more useful, it would be handy to know when to actually run it by seeing how many new password hashes have been added since a given date. That's on the list - we know it's wanted - and especially as the volume of inbound passwords ramp up I know it'll be super useful for people.
So, go forth and grab the tool, pull down the hashes whenever you feel like it and do good things with them. Now I'm kinda curious to see what those API hit numbers look like once the masses grab this tool and make 1M+ requests each π
A short one this week as the previous 7 days disappeared with AusCERT and other commitments. Geez it was nice to not only be back at an event, but out there socialising and attending all the related things that tend to go along with it. I'll leave you with this tweet which was a bit of a highlight for me, having Ari alongside me at the event and watching his enthusiasm being part of the industry I love π
At #AusCERT with Ari for βtake your son to workβ day π
β Troy Hunt (@troyhunt) May 12, 2022
Iβm up next on stream 2 at 14:45 talking about Pwned Passwords, the FBI, the NCA and giving the whole thing over to the community, come say hi! https://t.co/PqSgb1AjMS pic.twitter.com/Z88xIrrHYW
It's back to business as usual with more data breaches, more poor handling of them and more IoT pain. I think on all those fronts there's a part of me that just likes the challenge and the opportunity to fix a broken thing. Or maybe I'm just a sucker for punishment, I don't know, but either way it's kept me entertained and given me plenty of new material for this week's video π
Didn't get a lot done this week, unless you count scuba diving, snorkelling, spear fishing and laying around on tropical sand cays π This week is predominantly about the time we just spent up on the Great Barrier Reef which has very little relevance to infosec, IoT, 3D printing and the other usual topics. But as I refer to in the guitar lessons blog post referenced below, I share what I do pretty transparently and organically and this week, that's what I want to talk about. So, either enjoy it or skip it until next week when I'll back to business as usual π
Well that was an unusual ending. Both my mouse and keyboard decided to drop off right at the end of this week's video and without any control whatsoever, there was no way to end the live stream! Wired devices from kids borrowed, I eventually got back control and later discovered that all things Bluetooth had suddenly decided to die without any warning whatsoever. I certainly wasn't updating drivers mid-live stream or anything like that so... π€·ββοΈ
Anyway, other than that it's business as usual this week, enjoy!