1

I have a txt file, and here is a snippet of the first few lines:

C    A10231    A1    171|171    HER
C    B23098    A1    171|171    HEF
C    A03295    A2    171|171    HAF

I want to create a running list of every time the third column reads something other than "A1", and also keep track of how many times "A1" appears. Is there a way to import this file into a pandas df without causing a memory error?

If not, how can I process the txt file using the following rules:

  • Keep a running count of every time the third column reads "A1"
  • If the third column is not "A1", append the value to a list.
  • Find the amount of rows in the txt file

I essentially want to create three outputs. One output is the count of A1, the other is a list of everything that isn't A1 non_A1 = ['A2','B3','B4,'V6'...], and the last is the total number of rows.

  • 1
    You don't need a data frame at all; you simply need to process a tab(?)-delimited file as you read it. – chepner Feb 19 '23 at 17:45

4 Answers4

1

All you need to do is process each line as you read it; no need to store anything more than your accumulated results and the current line in memory at any given time, and certainly no need to build a full dataframe from the contents of the file.

row_count = 0
a1_count = 0
non_a1 = []

with open("file.tsv") as f:
    for line in f:
        row = line.strip().split('\t')
        row_count += 1
        if row[2] == 'A1':
            a1_count += 1
        else:
            non_a1.append(row[2])
chepner
  • 497,756
  • 71
  • 530
  • 681
  • Thanks; I originally was using `csv.reader`, but decided to do the field-splitting manually to keep it simple. – chepner Feb 19 '23 at 17:57
  • I would also start with `csv.reader` as it implies splitting by `delimiter='\t'` and stripping a newline when file is opened with `newline=''` – RomanPerekhrest Feb 19 '23 at 18:41
  • *How* one iterates over the file is, I think, less relevant to the question being asked than what to do with each line, but yes. – chepner Feb 19 '23 at 19:27
1

As you tag your question with Pandas, you can use:

count_A1 = 0
non_A1 = set()
num_rows = 0

for chunk in pd.read_csv('/home/damien/data.txt', sep='\t', usecols=[2], header=None, chunksize=1):
    count_A1 += chunk[2].eq('A1').sum()
    non_A1 |= set(chunk.loc[chunk[2].ne('A1'), 2].unique().tolist())
    num_rows += chunk.shape[0]

Output:

>>> count_A1
2

>>> list(non_A1):
['A2']

>>> num_rows
3
Corralien
  • 109,409
  • 8
  • 28
  • 52
1

Using pandas for this trivial task is overkill

a1_count = 0
line_count = 0
others = []

with open('foo.tsv') as tsv:
    for line in tsv:
        if (ax := line.split()[2]) == 'A1':
            a1_count += 1
        else:
            others.append(ax)
        line_count += 1
DarkKnight
  • 19,739
  • 3
  • 6
  • 22
0

In a similar vein to @Corralien. However, using the categorical datatype that results in memory savings for large amounts of data that are in a limited number of categories:

import pandas as pd

# Create some test data
fname = "reading_tsv_in_chunks.tsv"
with open("reading_tsv_in_chunks.tsv", "w") as fid:
    for i in range(1000):
        fid.write("C\tA10231\tA1\t171|171\tHER\nC\tB23098\tA1\t171|171\tHEF\nC\tA03295\tA2\t171|171\tHAF\nC\tA02225\tA3\t171|171\tHAX\nC\tA012325\tA4\t171|171\tHAY\n")
# Read as categorical
df = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,], dtype="category")

print(f"Contents of df:\n{df.describe()}\n")
print(f"Memory usage of with categorical dtype:\n{df.memory_usage()}\n\n")
# Read as non-categorical
df2 = pd.read_csv(fname, sep="\t", header=None, names=["category",], usecols=[2,])
print(f"Contents of df2:\n{df2.describe()}\n")
print(f"Memory usage of WITHOUT categorical dtype:\n{df2.memory_usage()}\n\n")

# Process as necessary e.g.
a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category=="A1"])
non_a1_count = sum([ len(values) for category, values in df.groupby("category")["category"] if category!="A1"])
print(f"A1 count: {a1_count}\n")
print(f"Non-A1 count: {non_a1_count}")
KolaB
  • 501
  • 3
  • 11