Reading text file with padas to get the specific lines

Question

I'm trying to read the text log file in Pandas with read_csv method, and I have to read every line in the file before ---- , I have defined the columns names just to get the data based on column to make it ease, but i'm not getting the way to achieve this.

My raw log data:

myserer143
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.

Are you sure you want to continue [Yy/Nn]?

Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.

Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
  Removing aex-* links in /usr/bin
  Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.

Uninstallation has finished.
dbserer144
-------------------------------
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
This will remove the Symantec Management Agent for UNIX, Linux and Mac software from your system.

Are you sure you want to continue [Yy/Nn]?

Uninstalling dependant solutions...
Unregistering the Altiris Base Task Handlers for UNIX, Linux and Mac sub-agent...
Unregistering the Script Task Plugin...
Unregistering the Power Control Task Plugin...
Unregistering the Service Control Task Plugin...
Unregistering the Web Service Task Plugin...
Unregistering the Reset Task Agent Task Plugin...
Unregistering the Agent Control Task Plugin...
Unregistering solution...
Unregistering the SMF cli plug-in...
Unregistering the Software Management Framework Agent sub-agent...
Removing wrapper scripts and links for applications...
Unregistering the Software Management Framework Agent Plugins...
Removing wrapper scripts and links for applications...
Unregistering solution...
Unregistering the CTA cli plug-in...
Unregistering the Client Task Scheduling sub-agent...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac sub-agent...
Remove the wrapper script and link for the Task Util application...
Unregistering the Altiris Client Task Agent for UNIX, Linux and Mac Plugin...
Unregistering the Client Task Scheduling Plugin...
Unregistering the Alert User Task Plugin...
Unregistering the shared library...
Unregistering solution...
Unregistering the Inventory Rule Agent...
Removing wrapper scripts and links for applications...
Unregistering the Inventory Rule Agent Plugin...
Removing wrapper scripts and links for applications...
Unregistering solution...
Uninstalling dependant solutions finished.
Removing Symantec Management Agent for UNIX, Linux and Mac package from the system...
Removing wrapper scripts and links for applications...
Stopping Symantec Management Agent for UNIX, Linux and Mac: [  OK  ]
Remove non packaged files.
Symantec Management Agent for UNIX, Linux and Mac Configuration utility.
  Removing aex-* links in /usr/bin
  Removing RC init links and scripts
Cleaning up after final package removal.
Removal finished.

Uninstallation has finished.

DataFrame looks like below:

>>> data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a", "b", "c"], engine="python")
>>> data
                                                       a   b   c
0                                              myserer143 NaN NaN
1                        ------------------------------- NaN NaN
2      Stopping Symantec Management Agent for UNIX, L... NaN NaN
3      This will remove the Symantec Management Agent... NaN NaN
4             Are you sure you want to continue [Yy/Nn]? NaN NaN
5                    Uninstalling dependant solutions... NaN NaN
6      Unregistering the Altiris Base Task Handlers f... NaN NaN
7                Unregistering the Script Task Plugin... NaN NaN
8         Unregistering the Power Control Task Plugin... NaN NaN
9       Unregistering the Service Control Task Plugin... NaN NaN

Expected result:

myserer143
dbserer144

OR it its doable

myserer143 Uninstallation has finished
dbserer144 Uninstallation has finished

@BernardL, thnx for the revert , I juts updated the question. — krock1516, Nov 30 '18 at 06:24
Where did you get `dbserer144 Uninstallation has finished` from? Is it in your full text file? — BernardL, Nov 30 '18 at 06:34
I don't see the `Uninstallation has finished` message for `dbserer144`. — BernardL, Nov 30 '18 at 06:42
BTW, you should also consider that if the log file is too large for memory. — BernardL, Nov 30 '18 at 06:58
I added a solution with "too large for memory" in consideration. — BernardL, Dec 01 '18 at 03:12
It takes away the overhead of creating a dataframe and storing unwanted data in memory, thus the generator. — BernardL, Dec 01 '18 at 05:12

jezrael · Accepted Answer · 2018-11-30T08:08:01.950

Use shift with startswith for boolean mask and filter by boolean indexing:

data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")

m1 = data['a'].shift(-1).str.startswith('----', na=False)
m2 = data['a'].shift(-2).str.startswith('----', na=False)

Filtering rows and also added last row of DataFrame by append:

data = data[m1 | m2].append(data.iloc[[-1]])
print (data)
                               a
0                     myserer143
44  Uninstallation has finished.
45                    dbserer144
89  Uninstallation has finished.

Reshape values and join text together:

df = pd.DataFrame(data.values.reshape(-1,2)).apply(' '.join, 1).to_frame('data')
print (df)
                                      data
0  myserer143 Uninstallation has finished.
1  dbserer144 Uninstallation has finished.

EDIT:

For better performace or working with large file is possible loop by each line to list, get values to list of dictionaries and create DataFrame. Last shift and add last value:

data = pd.read_csv("alt_1.logs", sep='delimiter', names=["a"], engine="python")

L = []
with open('result.csv', 'r') as f:
    for line in f:
        line = line.strip()
        if line:
            L.append(line)
L = L[-1:] + L

out = [{'a':L[i-1], 'b':L[i-2]} for i, x in enumerate(L) if x.startswith('---') ]
print (out)
[{'a': 'myserer143', 'b': 'Uninstallation has finished.'}, 
 {'a': 'dbserer144', 'b': 'Uninstallation has finished.'}]

df = pd.DataFrame(out)
df['b'] = df['b'].shift(-1).fillna(df.loc[0,'b'])
df = df.apply(' '.join, 1).to_frame('data')
print (df)
                                      data
0  myserer143 Uninstallation has finished.
1  dbserer144 Uninstallation has finished.

thnx , However , can we make ensure like `myserer143` has `Uninstallation has finished.` at the end before it takes new line `dbserer144` because as of now we are Just appending that after the match. — krock1516, Nov 30 '18 at 06:43
Is there a better way to read txt file in pandas rather `read_csv`. — Karn Kumar, Nov 30 '18 at 07:37
Many Many Thanks again for your time and nice solution @jezrael ! — krock1516, Nov 30 '18 at 08:10

BernardL · Answer 2 · 2018-12-01T05:06:23.437

Considering that there are many lines in the data you do not need, I think it would be better to prepare the data before loading it into the dataframe.

Based on the file, the parts of information you need are always separated by a delimiter of '-------..., so it makes sense to look ahead in the generator for those lines and only load the 2 lines before the delimiter.

We do that by taking the first 2 lines out as a start, and then looping through the file to get the information you need.

from itertools import tee, islice, zip_longest

results = []

f = open('sample.txt','r')
n = 2 #number of lines to check
first = next(f)
delim = next(f)

results.append(first)
peek, lines = tee(f)

for idx, val in enumerate(lines):
    if val == delim:
        for val in islice(peek.__copy__(), idx - n, idx):
            results.append(val)
    last = idx

for i in islice(peek.__copy__(), last, last + 1):
    results.append(i)

results
>> ['myserer143\n',
 'Uninstallation has finished.\n',
 'dbserer144\n',
 'Uninstallation has finished.\n',
 'dbserer144\n',
 'Uninstallation has finished.']

At this point, there is no wastage of memory to load the unused lines and your returned list contains the information you need by setting the offsets for the first few lines and getting the last line.

Then you can group the results in pairs to be loaded to the dataframe using the Python recipe from itertools.

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

results = [i.strip() for i in results]
data = list(grouper(results, n))

df = pd.DataFrame(data, columns = ['Name','Status'])
df

>>
         Name                        Status
0  myserer143  Uninstallation has finished.
1  dbserer144  Uninstallation has finished.
2  dbserer144  Uninstallation has finished.

BernardL , Thank you so much for taking your precious time and providing the alternative solution in a detailed explained way , My +1 . — krock1516, Dec 01 '18 at 04:44
as not being an expert in python , would you be able to tell me about `peek, lines = tee(f)` what this code doing here. — krock1516, Dec 01 '18 at 04:58
No problem, my bad for not adding the library and method names. Anyways, `tee`, returns *n* number of iterators from a single iterable. The idea is, instead of loading the whole file, we are trying to look ahead of the generator to find the line we want and then return a slice of it. But with the way of how generators work, you won't be able to retrieve values that you have already used, that's why we have a copy of it to refer back once we get the index of `delim`. Similar to [here](https://stackoverflow.com/questions/1517862/using-lookahead-with-generators) — BernardL, Dec 01 '18 at 05:05

Reading text file with padas to get the specific lines

2 Answers2