concatenate multiple files into single one but skip to reappend the already existing content

Question

I have the following files with the following content ( one line per file:

<189>162: CSR-1000V: *Sep 27 06:17:02: %LINEPROTO-5-UPDOWN: Line protocol on Interface Loopback317, changed state to up <189>165: CSR-1000V: *Sep 27 06:17:07: %LINEPROTO-5-UPDOWN: Line protocol on Interface Loopback320, changed state to up <189>164: CSR-1000V: *Sep 27 06:17:06: %LINEPROTO-5-UPDOWN: Line protocol on Interface Loopback319, changed state to up <189>161: CSR-1000V: *Sep 27 06:16:59: %LINEPROTO-5-UPDOWN: Line protocol on Interface Loopback316, changed state to up<189>163: CSR-1000V: *Sep 27 06:17:04: %LINEPROTO-5-UPDOWN: Line protocol on Interface Loop

I want to create a python script that can add those to a single file ( output.txt ), but i am stuck, since i am using for loop and the script keeps add the existing lines over and over

Any ideas?

Thank you

!/usr/bin/python import subprocess subprocess.call('cd /home/adrian/from_hdfs; for f in *; do (cat "${f}"; echo) >> finalfile.txt; done', shell=True) — Adrian Cincu, Sep 27 '19 at 07:06
So you want to append or replace the lines into the existing file? — Standard, Sep 27 '19 at 07:11
Finalfile.txt should be created, then for each new file that contains a new line, this line should be append to the finalfile.txt file, without re adding the existing lines from existing files — Adrian Cincu, Sep 27 '19 at 07:14
Yes :) New file with new line comes into the folder => append this new line to the finaltext. If this new line already exist in the finalfile file, then skip it. and so on for any other incoming files — Adrian Cincu, Sep 27 '19 at 07:21

score 0 · Answer 1 · answered Sep 27 '19 at 07:13

0

Flows As you can see in the attachment, there is a datapipeline in apache nifi, with the "ExecuteScript" Processor, where i run the above python code. The problem as i described is that existing lines from files are keep added continuously

answered Sep 27 '19 at 07:13

Adrian Cincu

17
1
6

Standard · Answer 2 · 2019-09-27T08:10:32.893

There are more than one methods this can be handled, but it depends your enviroment:

First one: Read through the files in the directory and append the data to your output file. Then, you save your already-read-files in a dictionary and save that on the disc, using pickle or json. Next time your code getc called, parse that list and skip the files you saved in that list. (PS: Use Python for file handling, its the use case)

Second one: Pass the newly create files as argument, if its suitable for you (I don't know anything about apache-nifi)

Third one: Compare the lines with the lines in your output file, but that will cost much performance and could be very unreliable.

Fourth one: Move the already read files into a subdirectory.

I would chose method one, as it's quite simple and straight forward.

edit: I made a piece code (did not test it), if it doesn't work out of the box it should be clear what to do anyway.

import json
import os

directory = "/home/adrian/from_hdfs/"

parsed = {}
with open('data.txt') as json_file:
    parsed = json.load(json_file)


#open output file
with open("finalfile.txt", "a") as outfile:

    #loop through src directory
    for filename in os.listdir(directory):
        if filename in parsed: 
            continue # skip file if already read

        file_abs = os.path.join(directory, filename)

        #print("Reading file: "+file_abs)
        with open(file_abs, "r") as src_file:
            myfile.write(src_file.read()) #append data from src to dest
            parsed[filename] = 1



with open('result.json', 'w') as fp:
    json.dump(parsed, fp)

Thank you. Can you provide a sample code and then i will adjust for my env? — Adrian Cincu, Sep 27 '19 at 07:52

score 0 · Answer 3 · answered Sep 27 '19 at 08:25

#CODE:

#!/usr/bin/python

import subprocess
import json
import os


subprocess.call('cd /home/adrian/from_hdfs; for f in *; do (cat "${f}"; echo) >> notfinal.txt; done', shell=True) =====> I am using this to generate "data.txt" from your example

directory = "/home/adrian/from_hdfs/"

parsed = {}
with open('/home/adrian/from_hdfs/notfinal.txt') as json_file:
    parsed = json.load(json_file)


#open output file
with open("finalfile.txt", "a") as outfile:

    #loop through src directory
    for filename in os.listdir(directory):
        if filename in parsed: 
            continue # skip file if already read

        file_abs = os.path.join(directory, filename)

        #print("Reading file: "+file_abs)
        with open(file_abs, "r") as src_file:
            myfile.write(src_file.read()) #append data from src to dest
            parsed[filename] = 1



with open('result.json', 'w') as fp:
    json.dump(parsed, fp)



Traceback (most recent call last):
  File "./script.py", line 14, in <module>
    parsed = json.load(json_file)
  File "/usr/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

concatenate multiple files into single one but skip to reappend the already existing content

3 Answers3