1

I have a few hundred .txt files in a directory that have the following format:

<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME>  </TIME>
<AUTHOR>  </AUTHOR>
<HEADLINE>
        The title is here 
</HEADLINE>
<TEXT>
        Text that I want
</TEXT>
</DOC>

I would like to manipulate every single file so that the file would only contain the text between the <TEXT> and </TEXT> tags (i.e.Text that I want)

I have tried the following code but it does not seem to do what I need:

find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/<\/TEXT/p'

How can I do this using a bash script (preferably using sed)?

KaanTheGuru
  • 373
  • 1
  • 2
  • 11

2 Answers2

2

You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.

find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/<\/TEXT>/,/<TEXT>/d' {} +
oguz ismail
  • 1
  • 16
  • 47
  • 69
1

If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:

#!/bin/bash

for file in /root/Desktop/data/data/*.txt; do
  echo $(cat "$file" | tr -d '\n' | sed -nE 's/<TEXT>(.*)<\/TEXT>/\1/p')
done
Tyler Marshall
  • 489
  • 3
  • 9