1

I have this .xml file:

<docs>
<doc>
Some text
</doc>
<doc>
here some
</doc>
<doc>
text here
</doc>
</docs>

I am trying to use csplit in order to get only the text parts. This is what I came up with.

$ csplit docs.xml '%^<docs>%1' '/^<\/doc/1' '{*}'
imre
  • 379
  • 2
  • 6
  • 17

1 Answers1

1

if the file structure like the one you included you can extract the content by doing grep -v "^<" x or more conveniant approach cat x|sed -e 's/<[^>]*>//g'|grep -v '^$' or to do it the csplit way based on the comments below you can do it lik this

cat doc.xml | egrep -v '<?xml version="1.0" \?>|<docs>|</docs>' | csplit -q -z - '/<doc/' '{*}' --prefix=out-
Saddam Abu Ghaida
  • 6,381
  • 2
  • 22
  • 29
  • This works, but csplit creates different files with the content between the tags, right? Cat just prints it to the terminal. Any way to get this functionality to your approach? – imre Feb 12 '14 at 14:01
  • you can redirect the whole output to a file like this cat x|sed -e 's/<[^>]*>//g'|grep -v '^$' > output.txt – Saddam Abu Ghaida Feb 12 '14 at 14:02
  • The point of doing this is having multiple text files (docs), each with the content between the tags. So the first one would contain "Some text", the second one "here some" and the third one "text here". Is that possible? – imre Feb 12 '14 at 14:17
  • if you want to use csplit it will split the xml not the body of the xml such as csplit x '/^/' '{*}' and then you filter the resulting directories using sed or grep ... etc – Saddam Abu Ghaida Feb 12 '14 at 14:29
  • OK, say I just wanted to create text files (via csplit), each containing the respective text between the tags. Please, could you take a look at what I came up with and tell me where the mistake is? It's telling me it couldn't find "". – imre Feb 12 '14 at 14:32
  • if you want to doit like this execute this command cat doc.xml | egrep -v '||' | csplit -q -z - '/ – Saddam Abu Ghaida Feb 12 '14 at 14:55
  • This gives me an error saying `csplit: illegal option -- q` – imre Feb 12 '14 at 15:11
  • for me it works fine , but you can remove -q because it refers to quite mode inorder not to show anything on the screen – Saddam Abu Ghaida Feb 12 '14 at 15:13