Reading in tab-delimited txt file in chunks?

Question

I have a txt file, and here is a snippet of the first few lines:

C    A10231    A1    171|171    HER
C    B23098    A1    171|171    HEF
C    A03295    A2    171|171    HAF

I want to create a running list of every time the third column reads something other than "A1", and also keep track of how many times "A1" appears. Is there a way to import this file into a pandas df without causing a memory error?

If not, how can I process the txt file using the following rules:

Keep a running count of every time the third column reads "A1"
If the third column is not "A1", append the value to a list.
Find the amount of rows in the txt file

I essentially want to create three outputs. One output is the count of A1, the other is a list of everything that isn't A1 non_A1 = ['A2','B3','B4,'V6'...], and the last is the total number of rows.

You don't need a data frame at all; you simply need to process a tab(?)-delimited file as you read it. — chepner, Feb 19 '23 at 17:45

chepner · Answer 1 · 2023-02-19T17:57:06.287

1

All you need to do is process each line as you read it; no need to store anything more than your accumulated results and the current line in memory at any given time, and certainly no need to build a full dataframe from the contents of the file.

row_count = 0
a1_count = 0
non_a1 = []

with open("file.tsv") as f:
    for line in f:
        row = line.strip().split('\t')
        row_count += 1
        if row[2] == 'A1':
            a1_count += 1
        else:
            non_a1.append(row[2])

edited Feb 19 '23 at 17:57

answered Feb 19 '23 at 17:48

chepner

497,756
71
530
681

Thanks; I originally was using `csv.reader`, but decided to do the field-splitting manually to keep it simple. – chepner Feb 19 '23 at 17:57
I would also start with `csv.reader` as it implies splitting by `delimiter='\t'` and stripping a newline when file is opened with `newline=''` – RomanPerekhrest Feb 19 '23 at 18:41
*How* one iterates over the file is, I think, less relevant to the question being asked than what to do with each line, but yes. – chepner Feb 19 '23 at 19:27

Corralien · Answer 2 · 2023-02-19T18:10:39.263

1

As you tag your question with Pandas, you can use:

count_A1 = 0
non_A1 = set()
num_rows = 0

for chunk in pd.read_csv('/home/damien/data.txt', sep='\t', usecols=[2], header=None, chunksize=1):
    count_A1 += chunk[2].eq('A1').sum()
    non_A1 |= set(chunk.loc[chunk[2].ne('A1'), 2].unique().tolist())
    num_rows += chunk.shape[0]

Output:

>>> count_A1
2

>>> list(non_A1):
['A2']

>>> num_rows
3

edited Feb 19 '23 at 18:10

answered Feb 19 '23 at 18:04

Corralien

109,409
8
28
52

score 1 · Answer 3 · answered Feb 19 '23 at 18:57

Using pandas for this trivial task is overkill

a1_count = 0
line_count = 0
others = []

with open('foo.tsv') as tsv:
    for line in tsv:
        if (ax := line.split()[2]) == 'A1':
            a1_count += 1
        else:
            others.append(ax)
        line_count += 1

score 0 · Answer 4 · answered Feb 19 '23 at 19:42

In a similar vein to @Corralien. However, using the categorical datatype that results in memory savings for large amounts of data that are in a limited number of categories:

import pandas as pd

# Create some test data
fname = "reading_tsv_in_chunks.tsv"
with open("reading_tsv_in_chunks.tsv", "w") as fid:
    for i in range(1000):
        fid.write("C\tA10231\tA1\t171|171\tHER\nC\tB23098\tA1\t171|171\tHEF\nC\tA03295\tA2\t171|171\tHAF\nC\tA02225\tA3\t171|171\tHAX\nC\tA012325\tA4\t171|171\tHAY\n")
# Read as categorical
df = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,], dtype="category")

print(f"Contents of df:\n{df.describe()}\n")
print(f"Memory usage of with categorical dtype:\n{df.memory_usage()}\n\n")
# Read as non-categorical
df2 = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,])
print(f"Contents of df2:\n{df2.describe()}\n")
print(f"Memory usage of WITHOUT categorical dtype:\n{df2.memory_usage()}\n\n")

# Process as necessary e.g.
a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category=="A1"])
non_a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category!="A1"])
print(f"A1 count: {a1_count}\n")
print(f"Non-A1 count: {non_a1_count}")

Reading in tab-delimited txt file in chunks?

4 Answers4