Is there a Linux command line utility to remove sections (not sure if that's the correct term) from an XML file?

Question

I am trying to do some manipulation of an XMLTV format file that contains TV schedule information. Within the file are sections that look like this:

  <programme start="20141215220000 -0500" stop="20141216060000 -0500" channel="someid.someaddress.com">
    <title lang="en">Local Programming</title>
    <length units="hours">1</length>
    <episode-num system="common">S00E00</episode-num>
    <episode-num system="dd_progid">SH00019112.0000</episode-num>
    <previously-shown />
  </programme>

As you can see the second line contains this:

    <title lang="en">Local Programming</title>

What I would like to find is some kind of command line utility that runs in Linux, that can look for that specific line and if it exists, remove everything between and including the programme tags. I am not very familiar with XML files so I don't know if there is a specific name for a block of data such as this, but I just want to remove that entire section whenever the title is "Local Programming".

It would actually work better for my purposes if I could remove the block only when the title is "Local Programming" AND the channel value in the first line is a certain specific value, since I only need to remove these for a specific channel, but it would not hurt anything to remove all of the "Local Programming" blocks on any channel, and to look for two values would probably make this a much more difficult problem. It has to be a command line utility because it will be called from a short shell script.

Basically I'm just trying to identify the best tool for the job. I'm not a programmer (unless you count making a bash shell script of a few lines, that just runs several things sequentially, as programming) so I'd like to stick with an existing command line tool if possible, but I'm not adverse to pulling in something new with apt-get either. Any suggestions?

EDIT: What worked was the xmlstarlet tool suggested by Charles Duffy, but only if I did not attempt to use the --var option and instead specified the values directly. For example, this removed all blocks with the title "Local Programming" from a file xmltv.xml:

xmlstarlet ed --delete "//programme[title='Local Programming']" <xmltv.xml >newfile.xml

And if I want to remove the block only when the title is "Local Programming" AND the channel value in the first line is a certain specific value, then it appears that this works:

xmlstarlet ed --delete "//programme[title='Local Programming'][@channel='someid.someaddress.com']" <xmltv.xml >newfile.xml

This is exactly what I was looking for, so I consider the problem solved. Thank you to all who replied.

This is quite straightforward, but your specification on how to decide what to strikes me as less clear than ideal. Do you want to remove any title, whatever its language? Only English-language titles? Only inside that one program? Inside any program? Please try to specify your problem unambiguously with a minimum of superfluous text. — Charles Duffy, Dec 17 '14 at 00:20
Charles Duffy: I'm not sure how I could possibly make it less ambiguous, but here goes: I want to remove every "programme" block that contains the specific line "Local Programming" in whatever XML file I specify. It is a XML file, not a program, that I am trying to modify, and there are no variations on that line - if the "title" line is anything else then I don't want to remove the "programme" block. I hope that clears it up for you. — Skyviewer, Dec 17 '14 at 01:03
So -- the version I gave removed programs where _any_ title is `Local Programming`; based on the clarification, it sounds like you only want to remove programs where the _English_ title is `Local Programming`. I'll amend my answer appropriately. — Charles Duffy, Dec 17 '14 at 01:40
I used the word "program" to refer to a TV `programme` -- in reference to the problem domain. Perhaps "channel" would be the better term, but that has a specific meaning in your schema already. Anyhow -- thanks again for the clarification; it was helpful (requiring parsing through less text to find the important parts than is the case for the question as originally written). — Charles Duffy, Dec 17 '14 at 01:53
Why are people voting to close this? Making programmatic edits is very much in the domain of programming. If this were a question of how to do the same operation using a human-driven editor, that would be a very different matter. — Charles Duffy, Dec 17 '14 at 14:09
Editing answers into questions is considered poor form here. If an answer resolves your question, click the "Accept" box for it. If your problem was solved some other way, add that as your own answer -- and, after a delay, you'll be allowed to accept it yourself. — Charles Duffy, Dec 17 '14 at 19:16
Charles Duffy: If you are the owner of this service, then you should probably delete my account because you have too many stupid rules, and eventually I will probably go off on you because of this stupidity. If you are NOT the owner then it's none of your doggone business how I choose to post what worked for me. I don't do well in forums where there is overbearing moderation and too many idiotic rules (especially ones that are in direct conflict with what is considered acceptable in other forums) so unless you are the site owner I would appreciate it if you would keep this to yourself. — Skyviewer, Dec 19 '14 at 00:04
The rules are a matter of public debate and consensus -- you're welcome to head over to http://meta.stackoverflow.com/ and help to shape them yourself. That said, while I often _do_ get a bit rules-lawyer-y, I don't see where I'm doing so in this current thread. (Also -- high-rep users get a subset of moderator powers; it's how the site works. Thus, I'm one among the many folks who _do_ help in its care and feeding, though it's the folks with the diamonds by their names who wield the really big sticks). — Charles Duffy, Dec 19 '14 at 00:05
A bit? You THINK? You complained about (and removed) my change to add SOLVED in the subject line, something encouraged and appreciated in many other forums. Then you get on me about editing my first post to say what worked, something also encouraged in other places. No, I am not going to debate it with you, I will just ignore you, and if I do something that really offends the site owner(s) then they can identify themselves as such and maybe I will take that into consideration, but MY idea of "poor form" is you getting on my case about something this trivial. — Skyviewer, Dec 19 '14 at 00:13
"high-rep users get a subset of moderator powers; it's how the site works." Well, as we can see here, some people abuse those powers to beat up on new users. Maybe that's a flaw in the way the site works. — Skyviewer, Dec 19 '14 at 00:15
Removing "SOLVED" is by consensus -- go look on meta; there's firm agreement there. Same for everything else. If you don't like the rules, you're welcome to join the community that sets them and have that discussion in the appropriate place. Likewise, if you think I've been heavy-handed, go file a post on meta asking for review; if the community there -- including the elected moderators -- thinks I'm out-of-line, they'll act appropriately, and I'll honor any corrections in my behavior thus suggested. — Charles Duffy, Dec 19 '14 at 00:15
I don't have the time or inclination to debate the site rules; and I doubt that would be in the slightest bit appreciated since I just got an account here. I can accept that they don't want "SOLVED", seems stupid, but okay. But if someone is going to nitpick and complain about little things that make sense, like putting the solution up at the top so people don't have to wade through a wall of text to read it, then I am simply not going to respect that. Accept it or kick me out, I really don't care at this point, but I do not suffer foolishness gladly. — Skyviewer, Dec 19 '14 at 00:23
Clicking the checkbox by your selected answer *does* bring it up to the top; it also adds highlighting to the question in the list showing that it has a successful answer. See http://meta.stackexchange.com/questions/86278/detect-edits-to-add-solved-or-resolved-to-the-title-and-direct-the-user-to-a as a relevant proposal. — Charles Duffy, Dec 19 '14 at 00:26
Alternately, see http://meta.stackexchange.com/questions/116101/is-it-ok-to-add-solved-to-the-title-of-a-question -- discussing whether it is acceptable to put SOLVED in a question title. — Charles Duffy, Dec 19 '14 at 00:27
Do you really not understand that I do not care about this whatsoever? You want to make a big deal out of it and I'm telling you that in this case I do not respect your opinion and honestly don't care about this. Just this little exchange has totally soured me on this site. If you deleted my post and banned me from the site it would not matter because there are many other help sites that do not have the proverbial pickle up their posterior about stupid stuff like this. This whole conversation is ridiculous and I am done with it. Sorry I ever made an account here. — Skyviewer, Dec 19 '14 at 00:58

Charles Duffy · Answer 1 · 2014-12-17T14:13:03.583

5

To delete any program having both the English-language title Local Programming and the channel someid.someaddress.com:

xmlstarlet ed \
  --var chan "'someid.someaddress.com'" \
  --var name "'Local Programming'" \
  --delete '//programme[title[@lang="en"]=$name][@channel=$chan]' \
  <in.xml >out.xml && mv out.xml in.xml

If you're targeting an older XMLStarlet release, you may need to do the substitutions yourself -- using "Local Programming" in place of $name and "someid.someaddress.com" in place of $chan -- but the above is known to work against the 1.5.0 release.

This requires the tool XMLStarlet, which should be available for installation in your distribution vendor's repository.

Note that you didn't show your document's namespace declarations -- if xmlns='...' has been specified in a parent, some adjustment may be called for.

edited Dec 17 '14 at 14:13

answered Dec 17 '14 at 00:22

Charles Duffy

280,126
43
390
441

You might also want to look at the xsltproc utility. – Mike Dec 17 '14 at 00:41
This looks ideal for the purpose, thank you for the suggestions! I thought there was probably a command line utility suited for this task and I have never heard of xmlstarlet before (or xsltproc) so I will definitely check both of those out. I can't access the system I want to run this on until later this evening but as soon as I can I will give this a try. – Skyviewer Dec 17 '14 at 01:08
Incidentally, `xmlstarlet sel` (a different subcommand) can generate XSLT suitable for use with xsltproc, so you can use xmlstarlet on your development machine to generate scripts you can then run with xsltproc (if that's already installed on your target machines). – Charles Duffy Dec 17 '14 at 01:51
The original suggestion you posted here, for which I thanked you above, actually does work. Unfortunately you then edited it to use the --var option and at least on my system this does NOT work. I get an error message: **failed to load external entity "chan"** and if I remove the references to chan it then complains about name in the same way: **failed to load external entity "name"** – Skyviewer Dec 17 '14 at 10:20
@Skyviewer: I've added some notes about adaptation for versions of XMLStarlet older than 1.5.0 (which this was developed and tested against), and against which I've validated it to continue to work. – Charles Duffy Dec 17 '14 at 14:13

score 2 · Answer 2 · answered Dec 17 '14 at 08:39

2

In addition to the proper XML handling, as exemplified in the other answer, one can always resort to the old-fashioned way: by handling the XML as plain text. In Perl:

cat fancy.xml |
perl -ne 'BEGIN{$/=undef;} print grep { /^<programme/ ? !m{<title\s+lang="en">Local\s+Programming</title>} : 1 } split qr{(<programme.*?</programme>)}s'

That reads the whole input XML (by resetting the input record separator), cuts it into the flat list of program blocks and everything going between them (the split()), and then filters out the program blocks which have the sought string present in them (the grep()).

answered Dec 17 '14 at 08:39

Dummy00001

16,630
5
41
63

I want to thank you for this, because had the xmlstarlet tool not worked I would have tried this. But as I stated above, I am not a programmer, and what I have found out about perl is that while it can perform a huge number of tasks, the code is almost totally indecipherable to most people (and I'm one of them). I can see a definite advantage to using perl, in that it is available for just about every platform, but given a choice between a using a command I can sort of understand and a scripting language that's about as understandable to me as Greek hieroglyphics, I'll prefer the former. – Skyviewer Dec 17 '14 at 11:01
I have given that only as an example, or an idea, for the case when the XML utilities are not readily available (or you are boggled down by the `XPath` quirks.). In the past I did something similar in `VIM` editor using its built-in script facility and picked Perl for the example more or less arbitrarily. If one can tweak the `sed`'s `$IFS`, it could be done even with the `sed`. – Dummy00001 Dec 17 '14 at 12:55

Is there a Linux command line utility to remove sections (not sure if that's the correct term) from an XML file?

2 Answers2