0

I'm using TextWrangler to get specific information from an XML. I need to find a list of file names that are present and print out only those file names.

An example of the code is below:

<file id="file_1045280">
    <name>SKY_A026C032_150707_R4RO.mov</name>
    <pathurl>file://localhost/M:/FPL_MEDIA/04_MEZZANINE/SKY/SKY-EP03/SKY-0312_20150707_AA_A026/SKY_A026C032_150707_R4RO.mov</pathurl>
    <duration>1796</duration>
    <timecode>
        <rate>
            <ntsc>false</ntsc>
            <timebase>25</timebase>
        </rate>
        <frame>0</frame>
        <displayformat>NDF</displayformat>
    </timecode>
    <media>
        <video>
            <duration>1796</duration>
            <samplecharacteristics>
                <width>1920</width>
                <height>1080</height>
            </samplecharacteristics>
        </video>
    </media>
</file>
                            <sourcetrack>
                                <mediatype>video</mediatype>
                            </sourcetrack>
                            <link>
                                <linkclipref>clipItem_1045280</linkclipref>
                                <mediatype>video</mediatype>
                                <trackindex>1</trackindex>
                            </link>
                        </clipitem>
                        <enabled>TRUE</enabled>
                        <locked>FALSE</locked>
                    </track>
                </video>
            </media>
        </clip>
        <clip id="clip_1045282">
            <name>SKY_A026C018_150707_R4RO</name>
            <duration>958</duration>
            <rate>
                <ntsc>false</ntsc>
                <timebase>25</timebase>
            </rate>
            <in>-1</in>
            <out>-1</out>
            <masterclipid>clip_1045282</masterclipid>
            <ismasterclip>TRUE</ismasterclip>
            <media>
                <video>
                    <track>
                        <clipitem id="clipitem_1045282">
                            <name>SKY_A026C018_150707_R4RO</name>
                            <duration>958</duration>
                            <masterclipid>clip_1045282</masterclipid>
                            <rate>
                                <ntsc>false</ntsc>
                                <timebase>25</timebase>
                            </rate>
                            <in>0</in>
                            <out>958</out>
                            <start>0</start>
                            <end>958</end>
<file id="file_1045282">
    <name>SKY_A026C018_150707_R4RO.mov</name>
    <pathurl>file://localhost/M:/FPL_MEDIA/04_MEZZANINE/SKY/SKY-EP03/SKY-0312_20150707_AA_A026/SKY_A026C018_150707_R4RO.mov</pathurl>
    <duration>958</duration>
    <timecode>
        <rate>
            <ntsc>false</ntsc>
            <timebase>25</timebase>
        </rate>
        <frame>0</frame>
        <displayformat>NDF</displayformat>
    </timecode>
    <media>
        <video>
            <duration>958</duration>
            <samplecharacteristics>
                <width>1920</width>
                <height>1080</height>
            </samplecharacteristics>
        </video>
    </media>
</file>
                            <sourcetrack>
                                <mediatype>video</mediatype>
                            </sourcetrack>
                            <link>
                                <linkclipref>clipItem_1045282</linkclipref>
                                <mediatype>video</mediatype>
                                <trackindex>1</trackindex>
                            </link>
                        </clipitem>
                        <enabled>TRUE</enabled>
                        <locked>FALSE</locked>
                    </track>
                </video>
            </media>
        </clip>
        <clip id="clip_1045283">
            <name>SKY_A026C033_150707_R4RO</name>
            <duration>1202</duration>
            <rate>
                <ntsc>false</ntsc>
                <timebase>25</timebase>
            </rate>
            <in>-1</in>
            <out>-1</out>
            <masterclipid>clip_1045283</masterclipid>
            <ismasterclip>TRUE</ismasterclip>
            <media>
                <video>
                    <track>
                        <clipitem id="clipitem_1045283">
                            <name>SKY_A026C033_150707_R4RO</name>
                            <duration>1202</duration>
                            <masterclipid>clip_1045283</masterclipid>
                            <rate>
                                <ntsc>false</ntsc>
                                <timebase>25</timebase>
                            </rate>
                            <in>0</in>
                            <out>1202</out>
                            <start>0</start>
                            <end>1202</end>

At the moment, I am using the following Grep:

.*?(\<name\>)(.*)(.mov).*

This manages to find the strings that I need. However, I need to replace all of the remaining text with nothing i.e. so I'm left with a list of file names.

Can anyone advise how I may go about this?

Thanks in advance, Matt

2 Answers2

0

Using TextWrangler, a quick way had been to first use -> Text -> Process Lines Containing... to search for <name>.+\.mov</name> with Grep and Copy to new document checked.
The resulting file could be cleaned up searching for (something along) ^\s*<name>(.+\.mov)</name>\s*$ and replacing with \1 with Grep checked.

Abecee
  • 2,365
  • 2
  • 12
  • 20
-1

How about this. THere's a bit of overlap, but it means

"match everything as if it's a single line that 
[comes after </name> and before <name>], or 
[is between the beginning and <name>] or 
[is the <name> or </name> tags itself].

(?ms)(?<=<\/name>)(.*?)(?=<name>)|(^.*?<name>)|(<.?name>)

https://regex101.com/r/vV4xZ6/2
ergonaut
  • 6,929
  • 1
  • 17
  • 47
  • That's very close @ergonaut, thank you. Would there be a way to only list the instances of the file name that end in .mov? So the instance of the file name after the tag, but not the tag? That way, the actual file name is only listed once per file. – matttickner Oct 14 '15 at 21:06
  • it's quite complex, the best solution would be to use an actual parser. – ergonaut Oct 14 '15 at 22:07
  • I"m afraid I don't know what that means? Essentially, I just want to find the results of that Grep you've done that ends in .mov. I have tried modifying your expression to reflect that, but it always finds anything contained between the tags. Is there any way to modify it so that it's lookahead/behind means that the string between the has to end in .mov to remain after the replace? – matttickner Oct 14 '15 at 22:14
  • Or, finding the string after the final forward slash before e.g. /SKY_A026C032_150707_R4RO.mov would give the same result? – matttickner Oct 14 '15 at 22:16
  • Sorry I don't have an answer for you. However my answer is the negative case of what you had originally. It might be easier to take the positive case, and extract those in a separate file. – ergonaut Oct 15 '15 at 02:21