-1

I'm facing a problem to read a file with txt format. The file contains a huge amount of data (88604154 lines, 2695.7893953323364 MB) and I have to analyze the data then plot a histogram of them.

The problem is that it takes ages for the computer to read that much data, so I thought I could read the data partly and add the parts together. I did a little search and came up with this code:

import resource

file_name = '/home/lam/Downloads/C3--Trace--00001.txt'

lines_num = []
for i in range(1,50001):
    lines_num.append(i)

with open (r"/home/lam/Downloads/C3--Trace--00001.txt", 'r') as fp:
    lines = []
    for i, line in enumerate(fp):
        if i in lines_num:
            lines.append(line.strip())
        elif i > 50001:
            break
txt_file.close()        

With this I can have the lines in the certain amount (for example from line one to 50000), but I want to repeat the code for like 1775 times in order to read all the data and then append them all in one list. How can I write a function for this?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Kia
  • 1
  • 3

1 Answers1

0

You need to read in chunks until there are no more chunks available:

with open(r"/home/lam/Downloads/C3--Trace--00001.txt", 'r') as src, open("sink.txt", 'w') as sink:
  chunk_size = 1024 * 1024 # 1024 * 1024 byte = 1 mb
  while True:
    chunk = src.read(chunk_size)
    if not chunk:
      break
    sink.write(chunk)

Here I'm reading the chunk size and then writing that data into another file.

The read function moves the pointer automatically so you don't need to provide indexing.

You could also use the code you shared but remove the break exception:

file_name = f"/home/lam/Downloads/C3--Trace--00001.txt"

with open (file_name, 'r') as fp:
    lines = []
    for i, line in enumerate(fp):
        lines.append(line.strip())

Example of how to calculate the mean:

import statistics

means = []
total_nums = 0

with open(r"./info.txt", 'r', newline="\n") as src:
  for line in src:
    line = [int(num) for num in line.split(",")]
    mean = statistics.mean(line)
    num = len(line)
    means.append({"num": num, "mean": mean})
    total_nums += num

total_mean = 0
for mean in means:
    total_mean += mean["mean"] * (mean["num"] / total_nums)
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Kushim
  • 303
  • 1
  • 7
  • @Kia if it’s crashing is because probably you are running out of memory, by reading in chunks you can do some kind of filtering or aggregation to decrement the number of lines or data points you have, and then plot that new filtered list. I wrote to another file just as an example, you can basically do whatever you want with this information. Another thing you may want to look into when dealing with dat that is crashing your computer is pyspark. That library let’s you create a lazy DataFrame and then you can use it’s built-in methods for plotting. Please provide an example row of your data – Kushim Jun 13 '23 at 14:22
  • Yeah got the problem and got it fixed – Kia Jun 14 '23 at 08:14
  • But when I want to calculate the mean of my data I bump to this error: MemoryError: Unable to allocate 10.2 GiB for an array with shape (2613,) and data type – Kia Jun 14 '23 at 08:17
  • Mathematically speaking you can calculate the mean of each chunk and store this mean in an array, and then calculate the mean of the entire data set by doing the mean of the means. In your case chunks should prob have the same length, but just to be sure you can store the amount of numbers used to calculate the mean and the acualt mean and keep a variable where you sum the amount of numbers, at the end you can calculate the weight by dividing the numbers used for each mean by the total of numbers. I will upate my answer with an example – Kushim Jun 20 '23 at 23:50