1

I wrote a bash script in order to split a file. The file looks like this:

@<TRIPOS>MOLECULE
ZINC32514653
....
....

@<TRIPOS>MOLECULE
ZINC982347645
....
....

Here is the script I wrote:

#!/bin/bash
#split the file into files named xx##.mol2
csplit -b %d.mol2 ./Zincpharmer_ligprep_1.mol2 '/@<TRIPOS>MOLECULE/' '{*}'
#rename all files called xx##.mol2 by their 2nd line which is ZINC######
for filename in ./xx*.mol2; 
do
    newFilename=$(echo $filename | sed -n 2p $filename)
    if [ ! -e "./$newFilename.mol2" ]; then
    mv -i $filename ./$newFilename.mol2

    else
        num=2
        while [ -e "./"$newFilename"_$num.mol2" ]; do
        num=$((num+1))  
        done
        mv $filename "./"$newFilename"_$num.mol2"
    fi
    done

I have two questions:

1) is there a way to include the prefix option into csplit and telling csplit that the prefix is the line after the seperator.

2) the first line created by csplit xx00 is an empty file, as the separator is in the first line. How can I avoid this?

The expected output would be files named ZINC32514653.mol2 and ZINC982347645.mol2. An in case there a two entries with the same ZINC### ZINC982347645_2.mol2.

ta8
  • 313
  • 3
  • 12
  • If you can add the expected output and a sample input, folks here will try and give you a possibly more effective solution! – Inian Jul 28 '16 at 11:58

2 Answers2

1

This can't be done with csplit. I recommend something along the lines of:

awk  '/@<TRIPOS>MOLECULE/ { getline file; next } {print $0 > file }'
Michael Vehrs
  • 3,293
  • 11
  • 10
0

All you need to know if available from this man csplit page:-

To tell csplit to change the prefix:-

-f, --prefix=PREFIX
       use PREFIX instead of 'xx'

To exclude empty files:-

-z, --elide-empty-files
       remove empty output files
Inian
  • 80,270
  • 14
  • 142
  • 161
  • In other words, the answer to the first question is no. – Michael Vehrs Jul 28 '16 at 12:04
  • @MichaelVehrs: Am not sure why you've said that, because I could see the `-f` flag is meant for that – Inian Jul 28 '16 at 12:14
  • thank you for your answers so far, I did not see the -z option. I am aware of the -f option. But how do tell the system that the prefix should be the line after the separator, thus ZINC##### ? – ta8 Jul 28 '16 at 12:31
  • @AndreasTosstorff: You need to specifically add the prefix after the `-f` flag – Inian Jul 28 '16 at 12:34
  • Yes I understand that, but the prefix is a variable. for every file created it would be the ZINC#### after the separator used to create this file. So in the example given the filenames should be something like ZINC32514653.mol2 and ZINC982347645.mol2. I am looking for a way to write: csplit -f MOLECULE> – ta8 Jul 28 '16 at 12:39
  • @Inian The point is that the prefix argument is fixed, whereas the OP wants to extract the name of each file from the input file itself. – Michael Vehrs Jul 28 '16 at 12:57