What Really Is That 16gb Password "Leak"?

DAK · June 23, 2025

Last week, a number of news outlets and organizations posted a story (which was then followed by ~ a retraction) of a darkweb password leak comprising 16B records. This immediately triggered a fervor around whether this was really a single leak, where it came from, who and how was exposed and so on – as always occurs around these things.

I’ve been reading the discourse around this breach, and did some diving into what the data actually is; and there are a number of things to cover in order to clarify the mixed messaging (and spoiler: incorrect messaging) around this dataset, and what risk it poses. If you’d like to read back through a couple of my previous posts around the alien_txtbase leaks and stealer malware attack vectors it’ll help get you up to speed with some of the basics; note: I’m referring to these specific posts for a reason.

What is this “leak”?

This “leak” is a compilation of some date-range of infostealer logs, ones that I’ve previously discussed on this blog. The discourse currently is that it’s all “old data”; and this isn’t fully true, and the reality is much more grey than that. This is both old data from a number of these large infostealer leaks, as well as recent data from more recent dumps that I’ve discussed.

It is, by definition not a leak, it is not from a single source, or any single cloud provider. It is a compilation of stealer log data, stolen via the previously discussed Redline, Lumma, and other campaigns. This in turn comprises somewhere around 16b records. This makes the original messaging around this data incorrect, due to a lack of research performed into the data and the sources involved. However the retraction is also incorrect, based on a lack of thorough research into what the data is.

I will not rehash my previous research into the threat vectors, or the contents here; I’ll perform more work on that moving forward, as I have traditionally. However I want to stress that these retractions are also very wrong and based on a lack of information, and there is actually a real risk to this data.

Further Reading

This blog has covered a number of previous articles that are relevant to a full understanding of the situation, and where the data is coming from, these can be found in the following posts:

  • The Alien Txtbase Leaks (which is the base the old data is built on). Alien_Txtbase was, and continues to be a large source of infosealer credential sets; this “leak” has been discussed by a number of sources such as Troy Hunt, or an analysis of the contents from Specops Software
  • The Anatomy Of A Stealer Package discusses the contents of the packages that get uploaded to the C2 system when the infostealer executes on a victim machine.
  • Stealer Malware Attack Vectors discusses the attack vectors that can be gleaned from screenshots contained in said infostealer packages in some Lumma infostealer attacks. This is the campaign that was taken down by Microsoft this year, as written on Microsoft’s Blog.

Reviewing these articles can help provide the bigger picture of what these attacks comprise of, how they infect and exploit end-users, and the current state of the Lumma C2 campaigns and other ongoing infostealer operations.

The Situation In Brief

  • The original post was incorrect, this is not a single leak, this is a compliation of infostealer campaigns
  • The BleepingComputer retraction is incorrect, some of the data is years old, some of the data is newer than that, assuming their sources are good
  • The data is not necessarily in HIBP or in an existing breached corpus and users are not necessarily protected from this by previous work

Quote BleepingComputer’s followup:

Similar credential collections were released in the past, such as the RockYou2024 leak, with over 9 billion records, and “Colection #1,” [sic] which contained over 22 million unique passwords.

Despite the buzz, there’s no evidence this compilation contains new or previously unseen data

This is patently untrue, the data is not all old, and does contain previously unseen data. That is an assumption on the part of the writer. In order to know it is in fact all old infostealer data, and not more recent, one would need the entire dataset. The screenshots and information used in the clarification do not provide sufficient proof to brush off these infostealer campaigns.

The referenced RockYou2024 article posted by Specops Software provides insufficient evidence to posit that this infostealer campaign data is not a threat.

The HIBP Situation

I pulled a random sample from one of the large files that have been analyzed here in the past, and proceeded to run these through the HIBP pasword check to see if they’re present in previous leaks. Of these 125 records checked, 58 were found to not be present in HIBP. This does not mean it is not present in other third party breached corpuses, this is solely a HIBP check, since this data is generally from end user machines, as explained in other posts here.

A sample of the passwords that were checked, and found not to be leaked, so one can run their own check against HIBP or other various corpuses:

Password
Yumyim48!
Klyde94310
V.xt.dqpfCVTMH7
ograce2580
R.0857
wilkin03?
mumbaispine123
Ax442001
beem111053
94862466Lcgs240465
evandareen170114
Puncak3s#
DassaultMonCompte4@

Naturally, this is not commentary on the efficacy of HIBP as a breached corpus, it is simply displaying that it’s incorrect to brush off these infostealer campaigns as “old outdated data”.

A Cautionary Tale

The information and misinformation around this “leak” should be seen as a cautionary tale when consuming information around data breaches and security leaks. It is difficult to determine which side is incorrect without doing one’s own research. Articles such as what went around this week about the data, and what protection is afforded should be taken with a grain of salt – especially when specifics such as these are not provided.

Rerefencing the BleepingComputer correction, one can see the telegram message in question is the source that has been referenced here regarding infostealer packages, they are in fact the same source; the telegram channel in the screenshot on BleepingComputer is NEW_DAISYCLOUD; the leading characters are blurred. So the article correctly clarifies what the data is, and then follows to incorrectly assure the reader the data is very old, so it’s not a concern.

Twitter, Facebook