0

I have a very long file (yes, this is DNA in fasta format) that is actually a batch of several files patched together, output on the stdout. E.g.:

>id1
ACGT
>id2
GTAC
=
>id3
ACGT
=
>id4
ACCGT
>id6
AACCGT

I want to split this stream according to a pattern (here shown as =) and perform actions on each piece individually.

I've looked into something like

myprogram | while read -d = STRING; do 
  # do something
done

but I'm concerned that putting large amount of info into a variable will be very inefficient. In addition I've read that read (...) is inefficient per se.

I'd like to find something like csplit that outputs the pieces into a loop, but I couldn't come up with something smart. Ideally something like this very bad pseudocode:

myprogram | csplit - '=' | while csplit_outputs; do
  # do something with csplit_outputs
done

I'd like to avoid writing temporary files as well, as I fear it will also be very inefficient.

Does that make any sense?

Any help appreciated!

Lionel Guy
  • 13
  • 4

1 Answers1

1

I would use awk, and set the record separator to =.

awk '{do something}' RS='=' input.file
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Thanks about that. It sounds good, but then the 'do something' is trapped into awk. I'd like to get the output back into bash. Say the 'do something' is five different scripts linked by pipes, how would that work? – Lionel Guy Jul 06 '15 at 12:32
  • I'm not sure what your scripts doing. If they are working with text, like grepping, filtering etc, you might use just one tool - awk - to replace them. Otherwise, you might use the `system()` function in awk which allows the execute cmds *in a shell*.. *in a shell* means that, you can do: `system("cmd1 | cmd2 ...");` – hek2mgl Jul 06 '15 at 12:52
  • Right, it would probably do the job, but it would mean calling an subshell into awk, which might make it more difficult to handle errors. – Lionel Guy Jul 06 '15 at 13:32