1

I have been searching for a solution for a very simple task : filter XML results based on multiple criteria and save it as a new XML file. By filtering I mean, to select the values for the output. So, only output the XML whom meet the conditions of the filter. Not just these values of the filter, that doesn't make sense, but all elements of the XML where it matches the filter. Those who don't meet the filter criteria should be removed from the XML. Basically, the filter should remove data from the input and save it to a new and smaller output XML.

But until now I did not manage to find any solution for this. I did see a lot of writing about various tools, such as XMLstarlet (is this abandonned?), XMLint, XSLT and Xidel. But honestly I have no idea where to start. Somehow most questions are about extracting one value out of the XML instead of filtering it - by throwing away the elements that don't match the filter - and make a new, more compact version, of the XML.

This is what I want :

  • Filter XML on specific values - if it contains this value, use the whole output of all elements of the XML that matches that filter.

Take this single XML entry of the an example XML.

<item>
<g_id>5e4e8249-fb12-43e6-8f7e-ccef1b242097</g_id>
<g_title>A Bathing Ape Court Sta Beige</g_title>
<g_description>A Bathing Ape Court Sta Beige</g_description>
<g_google_product_category>Apparel & Accessories > Shoes</g_google_product_category>
<g_condition>new</g_condition>
<g_availability>in stock</g_availability>
<g_price>425.00 EUR</g_price>
<g_gtin>747883771947</g_gtin>
<g_brand>BAPE</g_brand>
<g_identifier_exists>TRUE</g_identifier_exists>
<g_gender>male</g_gender>
<g_age_group>adult</g_age_group>
<g_color>Beige/Light Brown/Blue</g_color>
<g_size>8</g_size>
<g_is_bundle>FALSE</g_is_bundle>
<g_adult>FALSE</g_adult>
<g_custom_label_0>sneakers</g_custom_label_0>
<g_custom_label_1>Other Brands A Bathing Ape</g_custom_label_1>
<g_custom_label_2>0ZXSHM191X30GBGK</g_custom_label_2>
</item>

Imagine the XML has 10.000 entries / records and the g_custom_label_0 can contain different values. Also g_custom_label_2 could be empty. Now I want this entire XML - with those 10.000 entries - to be filtered on these conditions :

g_custom_label_0 = sneakers g_custom_label_3 = not empty

How can I do this using a bash script and save the filtered results to a new XML file?

And if possible... there is a relation between the g_gtin and g_custom_label_2 (many:one). Is it also possible to only save in the new XML the first unique g_custom_label_2? –

The XML file is just too big (200MB - compressed with GZ), I only want to use the relevant stuff, this could be a great way to reduce the XML file size.

Help is really appreaciated!

Mark

PS. I would prefer I could use a bash script after downloading the original XML to filter it and save to a new one.

--- UPDATE ---

Based on the suggestions XSLT should be the way to go. I created this template :


<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="@*|item()">
        <xsl:copy>
            <xsl:apply-templates select="@*|item()" />
        </xsl:copy>
    </xsl:template>
    <xsl:template match="item[g_custom_label_0 = 'sneakers']" />
</xsl:stylesheet>

Just try to get the filtering on the g_custom_label_0 right first. Unfortunately after executing this command :

xsltproc --output output.xml template.xslt test.xml

I get an error :

error xsltCompileIdKeyPattern : expecting 'key' or 'id' or node type compilation error: file template.xslt line 2 element template xsltCompilePattern : failed to compile 'item()'

The original XML looks like this :

<?xml version='1.0' encoding='UTF-8'?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
    <channel>
        <title>Some title</title>
        <link></link>
        <description>Some description</description>
    <item>
        <g:id>74cce63d-1523-460f-b59f-4a42a94be350</g:id>
        <g:title>100 Thieves 2019 Hoodie Cream</g:title>
        <g:description>Released in late 2019, this 100 Thieves 2019 Hoodie Cream is one of the hottest pieces of gaming merch out there. The clean, all white hoodie features a stitched 100 Thieves logo on the chest in black. This hoodie released on November 23, 2019, and sold out immediately upon it's release, generating considerable interest on StockX. It carried an original retail price of $90. Shop now by placing a Bid. Looking to get your gaming setup in check? Shop the coveted PS5 console &lt;a href= "  https://stockx.com/sony-ps5-playstation-5-blu-ray-edition-console-white">here&lt;/a> on StockX.</g:description>
        <g:google_product_category>Apparel &amp; Accessories</g:google_product_category>
        <g:condition>new</g:condition>
        <g:availability>in stock</g:availability>
        <g:price>179.00 GBP</g:price>
        <g:gtin>765208957340</g:gtin>
        <g:brand>100 Thieves</g:brand>
        <g:age_group>adult</g:age_group>
        <g:color>Cream</g:color>
        <g:size>M</g:size>
        <g:size_system>US</g:size_system>
        <g:shipping>
            <g:country>GB</g:country>
            <g:price>11.41 GBP</g:price>
        </g:shipping>
        <g:is_bundle>FALSE</g:is_bundle>
        <g:custom_label_0>streetwear</g:custom_label_0>
        <g:custom_label_1>Other Brands 100 Thieves</g:custom_label_1>
    </item>
    <item>
        <g:id>6576ba88-a874-48b2-a8a9-e4b82db0ac64</g:id>
        <g:title>100% Soft Dumpster Fire Kawaii Galaxy Trash Vinyl Figure</g:title>
        <g:google_product_category>Arts &amp; Entertainment > Hobbies &amp; Creative Arts > Collectibles</g:google_product_category>
        <g:condition>new</g:condition>
        <g:availability>in stock</g:availability>
        <g:price>178.00 GBP</g:price>
        <g:gtin>747883533613</g:gtin>
        <g:brand>100% Soft</g:brand>
        <g:gender>unisex</g:gender>
        <g:age_group>adult</g:age_group>
        <g:size_system>US</g:size_system>
        <g:item_group_id>6060eb7c-a3aa-4b52-a4a3-834b5da1ba22</g:item_group_id>
        <g:shipping>
            <g:country>GB</g:country>
            <g:price>11.41 GBP</g:price>
        </g:shipping>
        <g:is_bundle>FALSE</g:is_bundle>
        <g:custom_label_0>collectibles</g:custom_label_0>
        <g:custom_label_1>Other Artists Other</g:custom_label_1>
    </item>
    <item>
        <g:id>c3885e7e-7cfc-426c-922c-c7fde307da73</g:id>
        <g:title>100% Soft Dumpster Fire Figure Chrome</g:title>

--- UPDATE 2 ---

Made some (bad) progress here. I adjusted the template to this :

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="item[g_custom_label_0('sneaker')]" />
</xsl:stylesheet>

and got a file that has no XML markers in it anymore. Totally wrong. Don't do this..


<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="item[g_custom_label_0 = 'sneakers']">
  <xsl:copy>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>
</xsl:stylesheet>

Leads to the same results.. pfff..

--- UPDATE 3 ---

The suggestion made by Yiztak was perfect in getting rid of the elements that do not match the "sneakers" criteria. Thanks for that.

He proposed the following XSLT template :


<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:g="http://base.google.com/ns/1.0">
    <xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!--remove items where g:custom_label_0 != 'sneakers'-->
    <xsl:template match="item[not(g:custom_label_0='sneakers')]"/>
</xsl:stylesheet>

Next step is to remove duplicates on the g_custom_label_3 values (not present in the XML examples) and remove empty or missing values of the g_custom_label_3. I will do some experiments on this.

<xsl:if test=..."> would be the next step. Have no clue how to combine with previous statements. Any further ideas?

So g:custom_label_3 should be present and have a value to be in the newly exported XML.

  • It is a task for XSLT with parameters. XSLT should use a so called Identity Transform. You can call XSLT transformation from PowerShell. – Yitzhak Khabinsky Aug 26 '22 at 20:25
  • If you want to use XSLT for this then read up on XSLT and try, if you really fail show us your attempt and how it failed. Xidel does support XQuery, no? Also easy with XQuery, if you want to do it with XQuery read up on XQuery and give it a try, if you really fail, show us how and how it failed. – Martin Honnen Aug 26 '22 at 20:28
  • @MartinHonnen check the update... any ideas? – Mark van den Ing Aug 26 '22 at 21:06
  • What does "filter on" mean for you, to select those elements for the output or to remove them? Your attempt at an empty template with e.g. `` is fine to eliminate those `item` element having a child `g_custom_label_0 = 'sneakers'`. For copying the rest you would just set up the identity transformation template https://en.wikipedia.org/wiki/Identity_transform#Using_XSLT. And using `item` to select or match an element named `item` is the right syntax, your other attempt with `item()` is nonsense and therefore gives you an error. – Martin Honnen Aug 26 '22 at 21:47
  • @MartinHonnen clear point. I edited the question to make it more sound. Yes, with filtering I mean to select those elements for the output, or to remove the elements that does not meet the filter criteria. I don't want to eliminate the items that match the criteria, I want to keep them, the others should be removed. Guess I was thinking the wrong way here.. I will check out the Identity Transform. But, honestly, this should not be that complicated right? Thanks for your help so far.. – Mark van den Ing Aug 26 '22 at 21:58
  • So you want certain `item` elements in the output? Only those? Or do you want to copy the `channel`, `title`, `description` as well? What would the result look like, still have the `rss` root and the `channel` etc? Or just a new root containing the `item` elements you want? – Martin Honnen Aug 26 '22 at 22:14
  • @MartinHonnen, for additional questions you need to ask a separate brand new question. – Yitzhak Khabinsky Aug 26 '22 at 22:37

3 Answers3

0

This

match="item[g_custom_label_0('sneaker')]"

is not valid XSLT 1.0 syntax. I think you want

match="item[g_custom_label_0 = 'sneakers'] 

I don't know where you got the idea for your attempt, but XSLT is not a language you can learn by trial and error.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Thanks, I updated it in the update above. Unfortunately still a error : xsltCompileIdKeyPattern : expecting 'key' or 'id' or node type compilation error: file template.xslt line 2 element template xsltCompilePattern : failed to compile 'item()'. As for my trial / error efforts, I am completely new to XSLT and I was suggested to show my (failing) attempts. – Mark van den Ing Aug 26 '22 at 21:25
  • I think it should be an easy task to find something about this, but I just can't find a simple tutorial how to filter a XML file based on some filter criteria and export it to a new file. – Mark van den Ing Aug 26 '22 at 21:26
0

Please try the following solution.

Notable points:

  • Input XML has a namespace. It should be declared and used in the XSLT too.
  • 2nd template is eliminating what is not needed via [not(...)] predicate.

Input XML

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
    <channel>
        <title>Some title</title>
        <link></link>
        <description>Some description</description>
        <item>
            <g:id>74cce63d-1523-460f-b59f-4a42a94be350</g:id>
            <g:title>100 Thieves 2019 Hoodie Cream</g:title>
            <g:description>Released in late 2019, this 100 Thieves 2019 Hoodie Cream is one of the hottest pieces of gaming merch out there. The clean, all white hoodie features a stitched 100 Thieves logo on the chest in black. This hoodie released on November 23, 2019, and sold out immediately upon it's release, generating considerable interest on StockX. It carried an original retail price of $90. Shop now by placing a Bid. Looking to get your gaming setup in check? Shop the coveted PS5 console &lt;a href= "  https://stockx.com/sony-ps5-playstation-5-blu-ray-edition-console-white">here&lt;/a> on StockX.</g:description>
            <g:google_product_category>Apparel &amp; Accessories</g:google_product_category>
            <g:condition>new</g:condition>
            <g:availability>in stock</g:availability>
            <g:price>179.00 GBP</g:price>
            <g:gtin>765208957340</g:gtin>
            <g:brand>100 Thieves</g:brand>
            <g:age_group>adult</g:age_group>
            <g:color>Cream</g:color>
            <g:size>M</g:size>
            <g:size_system>US</g:size_system>
            <g:shipping>
                <g:country>GB</g:country>
                <g:price>11.41 GBP</g:price>
            </g:shipping>
            <g:is_bundle>FALSE</g:is_bundle>
            <g:custom_label_0>streetwear</g:custom_label_0>
            <g:custom_label_1>Other Brands 100 Thieves</g:custom_label_1>
        </item>
        <item>
            <g:id>6576ba88-a874-48b2-a8a9-e4b82db0ac64</g:id>
            <g:title>100% Soft Dumpster Fire Kawaii Galaxy Trash Vinyl Figure</g:title>
            <g:google_product_category>Arts &amp; Entertainment > Hobbies &amp; Creative Arts > Collectibles</g:google_product_category>
            <g:condition>new</g:condition>
            <g:availability>in stock</g:availability>
            <g:price>178.00 GBP</g:price>
            <g:gtin>747883533613</g:gtin>
            <g:brand>100% Soft</g:brand>
            <g:gender>unisex</g:gender>
            <g:age_group>adult</g:age_group>
            <g:size_system>US</g:size_system>
            <g:item_group_id>6060eb7c-a3aa-4b52-a4a3-834b5da1ba22</g:item_group_id>
            <g:shipping>
                <g:country>GB</g:country>
                <g:price>11.41 GBP</g:price>
            </g:shipping>
            <g:is_bundle>FALSE</g:is_bundle>
            <g:custom_label_0>sneakers</g:custom_label_0>
            <g:custom_label_1>Other Artists Other</g:custom_label_1>
        </item>
    </channel>
</rss>

XSLT

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:g="http://base.google.com/ns/1.0">
    <xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!--remove items where g:custom_label_0 != 'sneakers'-->
    <xsl:template match="item[not(g:custom_label_0='sneakers')]"/>
</xsl:stylesheet>

Output XML

<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <channel>
    <title>Some title</title>
    <link/>
    <description>Some description</description>
    <item>
      <g:id>6576ba88-a874-48b2-a8a9-e4b82db0ac64</g:id>
      <g:title>100% Soft Dumpster Fire Kawaii Galaxy Trash Vinyl Figure</g:title>
      <g:google_product_category>Arts &amp; Entertainment &gt; Hobbies &amp; Creative Arts &gt; Collectibles</g:google_product_category>
      <g:condition>new</g:condition>
      <g:availability>in stock</g:availability>
      <g:price>178.00 GBP</g:price>
      <g:gtin>747883533613</g:gtin>
      <g:brand>100% Soft</g:brand>
      <g:gender>unisex</g:gender>
      <g:age_group>adult</g:age_group>
      <g:size_system>US</g:size_system>
      <g:item_group_id>6060eb7c-a3aa-4b52-a4a3-834b5da1ba22</g:item_group_id>
      <g:shipping>
        <g:country>GB</g:country>
        <g:price>11.41 GBP</g:price>
      </g:shipping>
      <g:is_bundle>FALSE</g:is_bundle>
      <g:custom_label_0>sneakers</g:custom_label_0>
      <g:custom_label_1>Other Artists Other</g:custom_label_1>
    </item>
  </channel>
</rss>
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21
  • Perfect. That did a lot of the trick. I now have a XML only containing the elements where the g:custom_label_0 matches with "sneakers". Next step is to filter on a string being set in the g:custom_label_2 (not shown in example) and remove the elements that have a duplicate in the g:custom_label_2 – Mark van den Ing Aug 26 '22 at 22:23
0

g_custom_label_0 = sneakers g_custom_label_3 = not empty

With xidel you could use x-replace-nodes:

$ xidel -s input.xml -e '
  x:replace-nodes(
    //item[not(g:custom_label_0 = "sneakers" and g:custom_label_3 != "")],
    ()
  )
' --output-format=xml --output-node-indent

Is it also possible to only save in the new XML the first unique g_custom_label_2?

Is this what you mean?

$ xidel -s input.xml -e '
  (//item[g:custom_label_2])[1]
' --output-format=xml --output-node-indent
Reino
  • 3,203
  • 1
  • 13
  • 21