0

I have xml files that contain scrolling lyrics for karaoke songs that we are acquiring from another company. I’m in need of removing each <pg> tag that contain multiline phrases like:

8
BAR
INSTRUMENTAL
BREAK

They are always on their own separate page within a <pg> tag. The company told us the common words that appear every time are BAR & BREAK. This will avoid actual lyrics from being deleted within the remaining page tags (hopefully). There may be multiple instances of these tags throughout the xml as well. I need find and delete all of them.

I’m able to select the opening <pg and all the code up until the next opening <pg one at a time with this regex in Notepad++:

(<pg)(.+?)(?=<pg)

Is there a way to add code to locate both words BAR and BREAK to the above regex and only have those full tags found and deleted (multiple times within a file)? Then I can switch to Find In Files for a bulk search and replace routine?

Below is an example of 3 <pg> tags consecutively. I need the 2nd complete tag found and deleted, then continue on to delete another full <pg> tag if found until it reaches the end of the file. (rinse and repeat)

I have about 24 files to test with 7000 to follow. I’m hoping the common denominator of words to select between the <pg> tags are always BAR and BREAK.

Thank you so much for any help and advice.

<pg id=“lyrics.16” t=“157.09,15.88”>
<ln>
<lyr s="I’M " t=“161.28,.24”/>
<lyr s="ON " t=“161.52,.43”/>
<lyr s="MY " t=“161.95,.37”/>
<lyr s="OWN " t=“162.32,1.05”/>
</ln>
<ln>
<lyr s="I’M " t=“164.57,.26”/>
<lyr s="ON " t=“164.83,.42”/>
<lyr s="MY " t=“165.25,.43”/>
<lyr s="OWN " t=“165.68,1.07”/>
</ln>
<ln>
<lyr s="I’M " t=“167.91,.24”/>
<lyr s="ON " t=“168.15,.38”/>
<lyr s="MY " t=“168.53,.42”/>
<lyr s="OWN " t=“168.95,.62”/>
</ln>
<ln>
<lyr s="NO " t=“169.57,.48”/>
<lyr s="NO " t=“170.05,.19”/>
<lyr s="NO " t=“170.24,.41”/>
<lyr s="NO " t=“170.65,.43”/>
<lyr s="NO " t=“171.08,.56”/>
</ln>
<ln>
<lyr s="YEAH " t=“171.64,.23”/>
<lyr s="EH " t=“171.87,.42”/>
<lyr s="YEAH " t=“172.29,.58”/>
</ln>
</pg>
<pg id=“lyrics.17” t=“172.97,7.93”>
<ln>
<lyr s="8 " t=“174.16,.21”/>
<lyr s="BAR " t=“174.37,.24”/>
</ln>
<ln>
<lyr s="INSTRUMENTAL " t=“174.61,4.52”/>
</ln>
<ln>
<lyr s="BREAK " t=“179.13,1.67”/>
</ln>
</pg>

<pg id=“lyrics.18” t=“180.9,9.72”>
<count c=“pt.1” t=“184.92,1.27” n=“4”/>
<ln>
<lyr s="WOAH " t=“186.55,.25”/>
<lyr s="OH " t=“186.8,.39”/>
<lyr s="WOAH " t=“187.19,.41”/>
</ln>
<ln>
<lyr s="I " t=“187.6,.21”/>
<lyr s="CAN’T " t=“187.81,.38”/>
<lyr s="LET " t=“188.19,.28”/>
<lyr s="YOU " t=“188.47,.38”/>
<lyr s="GO " t=“188.85,.6”/>
</ln>
<ln>
<lyr s="MY " t=“189.45,.44”/>
<lyr s="LITTLE " t=“189.89,.6”/>
<lyr s="GIRL " t=“190.49,.03”/>
</ln>
</pg>

I'm unable to create the additional part of the Notepad++ search needed and I'm asking for advice.

jeffmic
  • 1
  • 1
  • I'm happy to help but your question is a bit confusing. Simplify it with expected input and output – Steve Tomlin Apr 01 '23 at 16:16
  • 1
    Regex is not the right tool to process XML. [You can't parse xml with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Toto Apr 01 '23 at 17:20
  • It’s a messy solution (just like my question was), but this is working on the 30 files I tested, plus none of the actual lyrics are removed by mistake. (?s-i:).)*?BAR(?:(?!).)*?BREAK(?:(?!).)*?)\r\t – jeffmic Apr 01 '23 at 18:06
  • Please [edit] the question to add extra information. Please do not add code (or Regexs etc) in comments as that makes them hard to read and makes it hard for you to explain them and format them. – AdrianHHH Apr 01 '23 at 20:16
  • You were unclear. Do you want s deleted that contain **both** "bar" and "break" or **either** "bar" or "break"? – Chris Maurer Apr 01 '23 at 21:40
  • Both bar and break in the same tag. Uppercase – jeffmic Apr 01 '23 at 21:54
  • It is better to use XSLT for the task. Notepad++ has **XML Tools** plugin for that. – Yitzhak Khabinsky Apr 02 '23 at 01:03

1 Answers1

0

I recommend not trusting the guessed things, and to do it in steps:

  1. Remove the things you are sure you won't need
    <lyr s="(8|BAR|INSTRUMENTAL|BREAK) " t=“[\d.,]+”/> -> nothing

  2. This will empty some of <ln>s, remove them
    <ln>\s*</ln> -> nothing

  3. This will empty some of <pg>s, remove them
    <pg[^>]*>\s*</pg> -> nothing

Dimava
  • 7,654
  • 1
  • 9
  • 24