How to remove a .txt list of passwords from another .txt list of passwords?

Question

I have a 164gb passwords list and a rockyou.txt list, and I'd like to remove all rockyou passwords from the 164gb list. Any way to do this? I've researched it a bit, and I haven't found a way to do it with such a large file.

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — Community, Feb 05 '22 at 18:10

Royce Williams · Answer 1 · 2022-01-27T18:25:42.207

First, a disclaimer: Dictionary files of this size are usually so noisy - full of naive 'pass001', 'pass002', 'pass003' sequences that would be much more efficiently applied using rules on GPU instead - that you're better off studying how to use rules and other better alternatives. Further, the utility or time savings of removing such a relatively small list like RockYou from such a large one is probably limited.

But let's assume that the use case is still desirable. There is some complexity here, and the established solutions in the password-cracking practice have trade-offs between how much RAM you have and how much pre-processing you have to do, and there are differing opinions on the "best" way to do it:

With unsorted input files and unlimited RAM, you can simply use rling. But even with its slower -b option, an amount of RAM is needed that's larger than the target file. In practice, rling's write-temp-to-disk options (to work around having less RAM during processing) are currently less performant than alternatives. rling was otherwise expressly designed for this use case, and can even work for files of this size if you divide them into chunks first (more on that below).
With limited RAM, you can use split to split the file into small enough chunks to be processed within your available RAM (or a little over). You'll have to experiment a bit to find out the right file size, but a little less than 1/2 of your available RAM is often a reasonable starting point. For bonus benefits, simply leave your 164GB file in these smaller chunks - it will save you a lot of trouble by allowing all future analysis and attacks to run against the smaller files in series. Be sure to use split's -l/N option, to ensure that you're splitting between words/records and not breaking the file in the middle of words.
You can also sort the entire files first. This is a popular option because it scales to arbitrary file sizes for common workflows that do not require preservation of frequency order. For efficiency, you can use the LC_ALL=C environment variable to increase its speed (to skip complex multibyte sorting), and use sort's -S (memory usage cap, to match your system RAM) and -T (tempfile location, preferably to fast storage) options to make this as efficient as feasible. But here I would again recommend keeping the file divided into smaller files, instead of constantly having to work around its monolithic size for no real added benefit (and as noted earlier, the frequency information is destroyed). Once the files are sorted, you can use rli2 from hashcat-utils to remove the RockYou lines from your large file. This requires no major amounts of RAM, because each file can now be classically "mergesort" sorted - read record by record, compared, and written to disk or skipped with the assurance that each line won't have to be compared again because its predecessors and successors are already known. The major limitation of rli2 is that, unlike rling, it can only take one "removefile" as an argument.

All that being said ... by the time you have a 164GB file, you already have the wrong problem. Mashing up many different wordlists into a single monolithic wordlist is an anti-pattern, and usually by the time you received the wordlist, multiple people have been mashing up terrible other lists in with other lists, without regard to source or quality.

Instead, you can shift your model to do the following (which is what I do):

Instead of always mashing up new sources into your giant file, simply keep all source files separated, and instead remove duplicates from each new source before adding it to your wordlist repository - using rling to remove all your other files from each new file. Even if a new file is large, simply break it into enough pieces to be comfortably processed.

I prefer this last option because:

it scales to handle arbitrary total aggregate size, while fitting into your existing available RAM as needed
it preserves original file order
each new file can be deduplicated easily with rling prior to being added to your 'library'

How to remove a .txt list of passwords from another .txt list of passwords?

1 Answers1