Extract data from a text file with keywords using Python

Question

I am working with a text data file which I have been able to extract data from as CSV/XLSX using:

import pandas as pd

token = open('file.txt','r')
linestoken = token.readlines()
resulttoken = []
for x in linestoken:
    resulttoken.append(x.split())
token.close()

df = pd.DataFrame(resulttoken)
df.to_csv('file.csv', index=None, header=None)

print('done!')

But my file has Keywords to distinguish the data from different sources (such as from different batches of experiments). The structure of the data is as follows

Keyword1

Column 1 Column2 Column3 Column4 ....

Keyword2

Column 1 Column2 Column3 Column4 ....

Keyword3

Column 1 Column2 Column3 Column4 ....

and so on...

With the code I have used, I am only able to extract data divided into different columns, but I want to create sheets in an Excel workbook for each keyword and seed their following relevant data into them. I shall highly appreciate any help in this regard.

Thanks!

Jeff Learman · Answer 1 · 2019-11-13T20:25:49.247

0

A CSV file does not have sheets, so you can't do this with to_csv(). Try this:

df.to_excel('file.xlsx', index=None, header=None, sheet_name=tabname)

According to Pandas documentation:

Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes.

edited Nov 13 '19 at 20:25

answered Nov 13 '19 at 19:38

Jeff Learman

2,914
1
22
31

Iron Pillow · Answer 2 · 2019-11-13T20:32:13.520

Assuming each source (keyword) can recur at various points in the file, the first task is to group all data lines by source. Once that's done, we can write each source's data to a separate CSV file (because, as noted, CSV files can't have sheets).

This Python 2 program assumes your data file fits into RAM:

import collections
import csv
import os

tally = collections.defaultdict( list )

with open( 'file.txt', 'r' ) as f:
    lines = f.readlines()
for i in range( 0, len(lines), 2 ):
    # lines[i] is source.  < .rstrip() > removes trailing newline character.
    # lines[i+1] is row of data, as a single string.
    tally[ lines[i].rstrip() ].append( lines[i+1] )

# Output files go into already-existing subdirectory called "subdirectory"
for source, data_rows in tally.items():
    with open( os.path.join( 'subdirectory', '%s.csv' % source ), 'wb' ) as csvfile:
        filewriter = csv.writer( csvfile )        # Can tailor delimiter, etc.
        for data_row in data_rows:
            filewriter.writerow( data_row.split() )

Using the csv package is a lighter-weight solution than using Pandas. With larger data sets you can open CSV files as needed, and keep them open as you parse the input file. But for quick-and-dirty scripts that involve grouping things, I find collections.defaultdict( list ) to be very handy.

Extract data from a text file with keywords using Python

2 Answers2