I'm a bit of a beginner when it comes to Python, but one of my projects from school needs me to perform classification algorithms on this reddit popularity dataset. The files are huge .zst files and can be found here: https://files.pushshift.io/reddit/submissions/ Anyway, I'm just not sure how to extract this onto a database, as the assignments we've had so far just used .csv datasets which I could easily put into a pandas dataframe. I stumbled upon a different post and I tried using the code:
def transform_zst_file(self,infile):
zst_num_bytes = 2**22
lines_read = 0
dctx = zstd.ZstdDecompressor()
with dctx.stream_reader(infile) as reader:
previous_line = ""
while True:
chunk = reader.read(zst_num_bytes)
if not chunk:
break
string_data = chunk.decode('utf-8')
lines = string_data.split("\n")
for i, line in enumerate(lines[:-1]):
if i == 0:
line = previous_line + line
self.appendData(line, self.type)
lines_read += 1
if self.max_lines_to_read and lines_read >= self.max_lines_to_read:
return
previous_line = lines[-1]
But I am not entirely sure how to put this into a pandas dataframe, or put only a certain percentage of datapoints into the dataframe if the file is too big. Any help would be very appreciated!
The following code only crashes my computer every time i try to run it:
import zstandard as zstd
your_filename = "..."
with open(your_filename, "rb") as f:
data = f.read()
dctx = zstd.ZstdDecompressor()
decompressed = dctx.decompress(data)
Might be due to the size of the file being too big, is there anyway to extract just a percentage of this file into the pandas dataframe?