0

I have a file with multiple Doctype declarations. I am trying to use CSPLIT to break the file up into smaller chunks but running into some issues. Here is a sample of the file I am working with:

<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">

<pmc-articleset><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
  <?properties open_access?>
  <front>
    <p>
    Apple
    </p>
  </front>
</article>
</pmc-articleset>
<?xml version="1.0" ?>
<!DOCTYPE pmc-articleset PUBLIC "-//NLM//DTD ARTICLE SET 2.0//EN" "https://dtd.nlm.nih.gov/ncbi/pmc/articleset/nlm-articleset-2.0.dtd">
<pmc-articleset><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
  <?properties open_access?>
  <front>
    <p>
    Banana
    </p>
  </front>
</article>
</pmc-articleset>

Here is my command:

csplit -z --prefix output_file --suffix-format '%02d.xml' handSurgery.xml '/^<[?]xml[ ]/' '{*}'

Here are the errors:

csplit: illegal option -- z

Any solution would be appreciated. Thank you!

  • I don't know csplit, but you're aware, I hope, that there is no regular expression that will detect an XML declaration in the middle of a file with 100% reliability? It's easy to bury something that looks like an XML declaration in a comment or CDATA section, and it's not at all unlikely that this will be done unintentionally. This is therefore a very poor choice of file format. – Michael Kay Nov 11 '20 at 17:48
  • Unfortunately this is the only file format I have to work with. I need to find a way to parse it, I figured breaking it up would be the best way. – justin viola Nov 11 '20 at 19:06

0 Answers0