Get specific text between a certain tag in all files in a directory

Question

I have a few hundred .txt files in a directory that have the following format:

<DOC>
<DOCNO> 33 </DOCNO>
<SOURCE> URL v.01 </SOURCE>
<URL> www.url.com/extension.html </URL>
<DATE> 2019/12/29/ </DATE>
<TIME>  </TIME>
<AUTHOR>  </AUTHOR>
<HEADLINE>
        The title is here 
</HEADLINE>
<TEXT>
        Text that I want
</TEXT>
</DOC>

I would like to manipulate every single file so that the file would only contain the text between the <TEXT> and </TEXT> tags (i.e.Text that I want)

I have tried the following code but it does not seem to do what I need:

find /root/Desktop/data/data -type f | xargs sed -n '/<TEXT/,/<\/TEXT/p'

How can I do this using a bash script (preferably using sed)?

Yes, that is correct. For some reason they were not showing as a part of plaintext. @kabanus — KaanTheGuru, Nov 19 '18 at 19:38
That is because you can embed HTML in your post so `<>` should always be in a code block. — kabanus, Nov 19 '18 at 19:40
Works for me. Did you try running `find` and making sure you actually get a hit with the tags? — kabanus, Nov 19 '18 at 19:46
Also, consider dropping `xargs` for a pure `find` solution `-execdir sed -n '/ — kabanus, Nov 19 '18 at 19:47
Thank you for that! Unfortunately, the tags are still in place, and the changes are not made on the files. — KaanTheGuru, Nov 19 '18 at 19:52
Possible duplicate of [How to select lines between two patterns?](https://stackoverflow.com/questions/38972736/how-to-select-lines-between-two-patterns) — Wiktor Stribiżew, Nov 19 '18 at 19:53

oguz ismail · Accepted Answer · 2018-11-19T20:21:37.797

2

You want to remove everything but the text between TEXT tags from your files, right? This is how you do that.

find /root/Desktop/data/data -type f -execdir sed -i '0,/<TEXT>/d;/<\/TEXT>/,/<TEXT>/d' {} +

edited Nov 19 '18 at 20:21

answered Nov 19 '18 at 19:54

oguz ismail

1
16
47
69

1

That's it! Thanks Oguz! – KaanTheGuru Nov 19 '18 at 19:57
Won't this leave the last and everything after it in the file since there won't be a matching opening ? – Tyler Marshall Nov 19 '18 at 20:02
1

@TylerMarshall nope. sed processes its input line by line, so in this case it will delete everything until either finding a line matching `/<\/TEXT>/`or hitting EOF. – oguz ismail Nov 19 '18 at 20:10

Tyler Marshall · Answer 2 · 2018-11-19T20:13:00.277

1

If there are at most one pair of the tags you are looking for and you don't want newline characters in the text:

#!/bin/bash

for file in /root/Desktop/data/data/*.txt; do
  echo $(cat "$file" | tr -d '\n' | sed -nE 's/<TEXT>(.*)<\/TEXT>/\1/p')
done

edited Nov 19 '18 at 20:13

answered Nov 19 '18 at 19:57

Tyler Marshall

489
3
9

Get specific text between a certain tag in all files in a directory

2 Answers2