0

I am trying to grep a string out of specific .gz file in an archived .tar file.

The contents of that tar file has an archive per host, looking something like:

APPLOG/cp13ap011/logs/domeus.log.2021-07-09.gz
APPLOG/cp15ap043/logs/domeus.log.2021-07-09.gz
APPLOG/cp14ap411/logs/domeus.log.2021-07-09.gz
APPLOG/cp11ap231/logs/domeus.log.2021-07-09.gz

I located the file location find /backup/tmp/ -type f -name 'APPLOG-P10-2021-07-09.tar' |xargs zgrep -F 'communicationId=6700409965' >> ~/tmp/2021_07_09.txt

When I realised its a .tar file, holding the record in the demos

tar -tf APPLOG-P10-2021-07-09.tar -O |find APPLOG/ -type f -name 'domeus.log.2021-07-09*' | xargs zgrep -E "Id=6700409965" >> ~/tmp/2021_07_09.txt

The file is located here APPLOG/domeus.log.2021-07-09.gz there are multiple machines that hold the record, but all machines would have a duplicated file name domeus.log.2021-07-09* and file is massive so it needs to refine the string "Id=6700409965"

The end it is not getting me a result of those files

-rw-r--r-- 1 0 Nov 15 16:58 2021_07_09.txt

The trick , I do not want to unzip the file unless there are no other options

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Haitham
  • 33
  • 6
  • `tar | find` doesn't make sense. `find` doesn't read from stdin so it won't see anything `tar` is sending it. – Charles Duffy Nov 15 '21 at 18:05
  • How important is it to do this efficiently? If it's very important, I'd use a different language, like Python, where the `tarfile` module lets you do all this in a single pass. The easy ways to do it in bash involve reading your input file twice (once to get the available names, once to extract content with the specific name(s) you care about). – Charles Duffy Nov 15 '21 at 18:06
  • @CharlesDuffy, I'd love to do that, but I barely have access to do anything on our backup – Haitham Nov 15 '21 at 18:07
  • I don't know what you mean re: not having access. Do you mean you don't have a Python interpreter available? (That would be surprising on a modern-ish system; Python has been built into most Linux distros for something like 20 years now). – Charles Duffy Nov 15 '21 at 18:08
  • ...anyhow, if you're okay with a slower approach that reads your input files twice, that's fine, we can implement that in bash easily enough. – Charles Duffy Nov 15 '21 at 18:09
  • In addition to @Charles Duffy, instruct tar to only extract the specific gz file you are after and then pipe directly to zgrep. – Martin Schapendonk Nov 15 '21 at 18:10
  • Also, if you don't have Python, that raises concerns about how old your version of bash is. What version are you working with? – Charles Duffy Nov 15 '21 at 18:10
  • 1
    @MartinSchapendonk, right, the question is if the OP has the full filename of that specific gz file before starting. (If they did, why would they use `find` at all?) – Charles Duffy Nov 15 '21 at 18:10
  • @Haitham, to be clear, this _might_ be as easy as `tar -xf APPLOG-P10-2021-07-09.tar -O domeus.log.2021-07-09.gz | gunzip -c >output-location`. But you haven't given enough information in the question for us to be sure. The output of `tar -tf` would be useful as a starting point. – Charles Duffy Nov 15 '21 at 18:12
  • @CharlesDuffy, there are multiple machines that hold the record, but all machines would have a duplicated file name ```domeus.log.2021-07-09*``` – Haitham Nov 15 '21 at 18:12
  • @Haitham, please put that information inside the question itself. Ideally, with an example of the actual output from `tar -tf` – Charles Duffy Nov 15 '21 at 18:14
  • @CharlesDuffy, Also the file is huge, its 8GB , I just need specific string ```"Id=6700409965"``` – Haitham Nov 15 '21 at 18:14
  • @Haitham, ...that changes nothing substantive; then just `tar -xf file.tar -C file-to-extract | zgrep ... >output-location`. – Charles Duffy Nov 15 '21 at 18:15
  • @Haitham, ...and again, how to fill out `file-to-extract` depends on details you have persisted, thus far, in not giving us -- despite being asked multiple times. – Charles Duffy Nov 15 '21 at 18:16
  • (when you say "multiple machines" -- does each machine have a different _directory_ like `APPLOG`? Does each machine have a different tar file? We don't know what "multiple machines" means in the context of this question). – Charles Duffy Nov 15 '21 at 18:17
  • (mind, no matter what, `tar` is reading the tar file from the front until it gets to the data you're searching for -- it isn't an indexed format; having a footer that describes where to search to to find different files is one of the innovations that made `zip` special when it first came out, but `tar` is older). – Charles Duffy Nov 15 '21 at 18:19
  • I have updated the question, Ok to answer yours in specific APPLOG/cp13ap011/logs/domeus.log.2021-07-09.gz APPLOG/cp15ap043/logs/domeus.log.2021-07-09.gz APPLOG/cp14ap411/logs/domeus.log.2021-07-09.gz APPLOG/cp11ap231/logs/domeus.log.2021-07-09.gz I would just add * after the APPLOG to get the list – Haitham Nov 15 '21 at 18:22
  • I'm going to [edit] that comment into the question -- comments should only be supplemental information, but those filenames are critical for producing a correct answer. – Charles Duffy Nov 15 '21 at 18:24
  • And to repeat another question I asked you earlier: Which **specific version of bash** does this need to support? Is it an answer that works with bash 4.0 and later acceptable? – Charles Duffy Nov 15 '21 at 18:26
  • GNU bash, version 5.0.3(1)-release (x86_64-pc-linux-gnu) – Haitham Nov 15 '21 at 18:27
  • One other thing. Are you using `>>` because you want to append to the file _within the same run of your script_, or do you want the script, when run multiple times, to concatenate _all_ those runs into your output file next to each other? – Charles Duffy Nov 15 '21 at 18:30
  • The ```>>``` is to get all the results into a file to be able to read after – Haitham Nov 15 '21 at 18:31
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239249/discussion-between-charles-duffy-and-haitham). – Charles Duffy Nov 15 '21 at 18:31

1 Answers1

0

Unfortunately, doing this in bash is going to require multiple passes -- it would be much more efficient in Python, where the tarfile module lets you both decide which files you want to inspect, and read the content of those files, in one pass.

while IFS= read -r -d '' tarfile; do
  tar -xf "$tarfile" -T <(
    tar -tf "$tarfile" |
      grep -E 'APPLOG/(.*)/logs/domeus[.]log[.]2021-07-09[.]gz'
  ) -O |
    gunzip -c |
    grep 'Id=6700409965'
done < <(find /backup/tmp/ -type f -name 'APPLOG-P10-2021-07-09.tar' -print0) \
     >~/tmp/2021_07_09.txt

Providing documentation for the individual techniques used:

  • while read loops are discussed in detail in BashFAQ #1.
  • <(...) is process substitution syntax -- it expands to a filename from which the output of ... can be read, which is on modern platforms implemented with a named pipe or equivalent (so that content doesn't need to be written to disk and the processes can run in parallel).
  • tar -T expects the next argument to be a list of filenames to operate on.
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • tar: option requires an argument -- 'C' Try 'tar --help' or 'tar --usage' for more information. gzip: stdin: unexpected end of file – Haitham Nov 15 '21 at 19:00
  • Sorry, that should have been `-O`, not `-C`. – Charles Duffy Nov 15 '21 at 19:54
  • @Haitham, it's been a few days, so I'm curious where you're at -- if your devops staff had a better approach, any chance it could turn up as an answer here? And have you tried rerunning after fixing the `-O`/`-C` thinko? – Charles Duffy Nov 18 '21 at 21:10