how to unique large text file content

Question

I have a text file with 34,686,770 lines. Length of all lines are in between 50 and 250. Some of the lines are appeared more than one. I want to make all these lines unique.

I can't store all these lines in a list to make it unique. How can I do this.

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.

I have to make the file with unique line.

Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.

How can I do this?

I can't. Size of text file is 5GB. My RAM will not allow this — Shahriar, Feb 16 '15 at 14:11
http://stackoverflow.com/questions/16327566/unique-lines-in-bash — chelmertz, Feb 16 '15 at 14:11
You could hash the line and check if the hash is present in a set — EdChum, Feb 16 '15 at 14:13
@EdChum, I think your idea will be helpful. Can you provide me a sample code? — Shahriar, Feb 16 '15 at 14:14
Check this out: https://docs.python.org/2/library/hashlib.html come back if you get stuck — EdChum, Feb 16 '15 at 14:15
@AerofoilKite The solution I provide below does what EdChum suggests — Jamie Cockburn, Feb 16 '15 at 14:28
splitting up file and joining back could help: http://stackoverflow.com/questions/22751000/split-large-text-filearound-50gb-into-multiple-files — Ganesh Kamath - 'Code Frenzy', Feb 16 '15 at 14:41

score 5 · Accepted Answer · answered Feb 16 '15 at 14:17

Without storing all the text in memory:

with open('text.txt') as text:
    with open('unique.txt', 'w') as output:
        seen = set()
        for line in text:
            line_hash = hash(line)
            if line_hash not in seen:
                output.write(line)
                seen.add(line_hash)

Instead we are storing a hash of the text, which is much smaller. Of course, there is a chance of a hash collision, in which case this code would skip a unique line that should be included.

score 1 · Answer 2 · answered Feb 16 '15 at 14:11

Use shell tools:

$ cat in.txt 
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.
I thought the author should have used more dialogue. It reads like a history book.
I thought the author should have used more dialogue. It reads like a history book.
$ sort < in.txt | uniq
I thought the author should have used more dialogue. It reads like a history book.
Only has limited access to OBDII data stream unless you pay more money to upgrade the software.

Reut Sharabani · Answer 3 · 2015-02-16T16:22:45.203

If you can't load the file to memory, why not split it smartly to smaller files and work there. You only need to know that identical lines end up in identical files, and you want some collisions to not end up with a huge amount of files.

Here is a script that takes the prefix of each sentence (can be changed obviously) and puts the sentence in the file corresponsing to the prefix.

This is actually much like a hash map, only not in memory as your RAM can not handle the amount of data you're tryiing to process.

The result is many smaller files (buckets, if you will..) that will have all occurrences of a line grouped in a certain file (same prefix). They can be unique-d individually and then merged to the result file.

Here is how it's done:

Initializing the program to read from the file input.txt, and write to output.txt, using a prefix size of 2 to hash/split:

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

Create the folder holding the split files containing similar and identical lines:

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

Line-distributing function - puts a line in a specified file:

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

Hash function that promises some collision (which is good), and identical lines being in a similar file:

def prefix_hash(line):
    return line[:prefix_size]

Now we distribute lines to their smaller files (like hash "buckets")

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

Generate a list of created file names:

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

De-duplicate lines in the smaller files:

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

Join smaller files into the result file:

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())

The whole thing together:

import os

input_file_name = 'input.txt'
split_folder = 'splits'
prefix_size = 2

# create hash files folder
if not os.path.exists(split_folder):
    os.makedirs(split_folder)

# a function to put a line in a file
def put_in_file(file_name, line):
    with open(os.path.join(split_folder, file_name), 'a') as f:
        f.write(line)

def prefix_hash(line):
    return line[:prefix_size]

with open(input_file_name) as f:
    # convenience method
    def putter(line):
        put_in_file(prefix_hash(line), line)

    for line in f:
        putter(
            line + (os.linesep if not line.endswith(os.linesep) else '')
        )

split_file_names = map(
    lambda x: os.path.join(split_folder, x), os.listdir(split_folder)
)

for split_file_name in split_file_names:
    # dedup each file
    with open(split_file_name, 'r') as f:
        unique_lines = set(f.readlines())
    with open(split_file_name, 'w') as f:
        f.write(''.join(unique_lines))

output_file = "output.txt"
with open(output_file, 'w') as of:
    for split_file_name in split_file_names:
        with open(split_file_name, 'r') as f:
            of.write(f.read())

Note: To make this a lot faster you should keep the file handlers open at all times and probably make some threads using a queue to pass lines among them (prevents waiting for I/O as well as opening and closing the files). I can add this later if anyone wants it.

how to unique large text file content

3 Answers3