0

I'm trying to copy part of a line to append to the end:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz

becomes:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz

I have tried:

sed 's/\(.*(GCA_\)\(.*\))/\1\2\2)'
Sundeep
  • 23,246
  • 2
  • 28
  • 103
Sam Lipworth
  • 107
  • 4
  • 12
  • 1
    A more simplified question would be "How to change `ftp://one/two/three_four/five` to `ftp://one/two/three_four/three/five` – George Vasiliou Sep 12 '17 at 09:10
  • 1
    I think it would be better if OP explains how the new version is arrived at... could be as simple as `xyz.5_foo.bar.baz` to `xyz.5/xyz_foo.bar.baz` – Sundeep Sep 12 '17 at 09:17

2 Answers2

0
$ f1=$'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz'

$ echo "$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz

$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\1\2\3\/\2\4/' <<<"$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz

sed -E (or -r in some systems) enables extended regex support in sed , so you don't need to escape the group parenthesis ( ).

The format (GCA_.[^.]*) equals to "get from GCA_ all chars up and excluding the first found dot" :

$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\2/' <<<"$f1"
GCA_900169985

Similarly (.[^_]*) means get all chars up to first found _ (excluding _ char). This is the regex way to perform a non greedy/lazy capture (in perl regex this would have been written something like as .*_?)

$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\3/' <<<"$f1"
.1
George Vasiliou
  • 6,130
  • 2
  • 20
  • 27
0

Short sed approach:

s="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz"
sed -E 's/(GCA_[^._]+)\.([^_]+)/\1.\2\/\1/' <<< "$s"

The output:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105