Generate XML format from TXT file

Question

I have the input txt file below and I´m trying to generate the XMl file below. I´m trying to make it with awk but I think I´m re-inventing the wheel. How do you suggest me to do it? Thanks

Input txt file (sample, this input could be bigger)

Usw 1:1 Desktop
Usw 1:2 Netbooks
Usw 1:3 Servers, mainframes and supercomputers
Usw 1:4 Smart devices
Usw 1:5 Embedded devices
Usw 1:6 Gaming
Usw 1:7 Specialized uses
Usw 2:1 Precursors
Usw 2:2 Creation
Usw 2:5 Naming
Usw 2:6 Commercial and popular uptake
Usw 2:9 Current development
Des 1:1 User interface
Des 1:2 Video input infrastructure
Des 1:3 Hardware
Des 2:1 Community
Des 2:2 Programming on Linux

xml file desired

<?xml version="1.0" encoding="utf-8"?>

<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<INTRO>
    <title>Some title</title>
    <creator>
    </creator>
    <subject>Some subject</subject>
    <description>Some description</description>
    <date>2010-05-12</date>
    <type>Some text</type>
</INTRO>
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
    <CTR cnumber="1">
    <ES vnumber="1">Desktop</ES>
    <ES vnumber="2">Netbooks</ES>
    <ES vnumber="3">SerES, mainframes and supercomputers</ES>
    <ES vnumber="4">Smart devices</ES>
    <ES vnumber="5">Embedded devices</ES>
    <ES vnumber="6">Gaming</ES>
    <ES vnumber="7">Specialized uses</ES>
    </CTR>
    <CTR cnumber="2">
    <ES vnumber="1">Precursors</ES>
    <ES vnumber="2">Creation</ES>
    <ES vnumber="5">Naming</ES>
    <ES vnumber="6">Commercial and popular uptake</ES>
    <ES vnumber="9">Current development</ES>
    </CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
    <CTR cnumber="1">
    <ES vnumber="1">User interface</ES>
    <ES vnumber="2">Video input infrastructure</ES>
    <ES vnumber="3">Hardware</ES>
    </CTR>
    <CTR cnumber="2">
    <ES vnumber="1">Community</ES>
    <ES vnumber="2">Programming on Linux</ES>
    </CTR>
</RTBLOCK>
</XMLRT>

Please don't post work requests here. In its current form, this post is not a question, this is a task assignment. At the very least, post your current code and describe where you are stuck. Read [ask]. — Tomalak, Apr 28 '18 at 17:13
That being said, awk is a bad tool choice for this task. Use an XML-aware tool. For example Python to parse the text file and the `lxml` module to generate the XML tree. — Tomalak, Apr 28 '18 at 17:15
Thank you Tomalak for answer. Actually I´m not a student, this is not a task assignmment. I have this kind of files and looking for help to choice a better tool. — Ger Cas, Apr 28 '18 at 17:16
I thougth AWK is a bad choice, but I use it because is the tool I know a little bit. — Ger Cas, Apr 28 '18 at 17:22
What I mean by "task assignment" is: You try to assign a task *to us*. *"I have X and I need Y"* is not a question. It's what a boss would give to an employee, and this is not how Stack Overflow works. Also, *"I want to keep using the wrong tool because I know it a bit"* is not a good excuse. Learn how to use the right tool. — Tomalak, Apr 28 '18 at 17:25
The goal is that you add some code of your own to your question to show at least the research effort you made to solve this yourself. — Cyrus, Apr 28 '18 at 17:51
This answer might help to fill an empty XML file with content: https://stackoverflow.com/a/48061566/3776858 — Cyrus, Apr 28 '18 at 18:17
you can use TXR tool, view [kaz's answer](https://stackoverflow.com/a/43440063/4767343) — Jose Ricardo Bustos M., Apr 28 '18 at 18:33
Tomalak, I'm not your boss. You answer if you want, nobody is forcing you. Is not an excuse to use awk. I'm not a professional programmers like many of you in this site. I know a bit of something and try to look suggestions, if you see, I asked for a suggestion about a best tool, not for somebody to make me the complete script. Don't assume things you're not sure. — Ger Cas, Apr 28 '18 at 18:44
Thanks Jose Ricardo and Cyrus, I´ll take a look what you shared. — Ger Cas, Apr 28 '18 at 18:47
I think you misunderstood @Tomalak's comments. He's not chastising you - he's advising you on how to get help with your question. Of course, to paraphrase your own comment - you can take the advice if you want, nobody is forcing you. See [ask] for more information. Oh, and awk would be perfectly fine for this task - using awk to **parse** XML is the thing that would be questionable but that's not what you're doing, you're generating XML from a simple text file. — Ed Morton, Apr 28 '18 at 18:48
@EdMorton Since awk has no concept of how XML encoding works, it it is bound to produce syntactically invalid XML at some point. Simple rule: Use XML-aware tools to consume and produce XML, no exceptions. "Good enough, fingers crossed" is not sensible, especially since XML-aware tools are practically everywhere. — Tomalak, Apr 28 '18 at 20:01
@Tomalak there are no XML tools that come as standard on every UNIX installation. Awk does. An awk script is not bound to produce syntactically invalid XML at some point because you write the script and the XML you're trying to produce is always some small subset of all possible XML constructs so it's usually extremely simple and robust to write a script to generate it, as in this case. — Ed Morton, Apr 28 '18 at 22:33
@EdMorton This is purely academic. Tell me one situation where one doesn't have Python *and* cannot do anything about it. That's exceedingly unlikely and not a basis for using the wrong tool. People thinking *"Ahh, what can happen, and look, it works for my test data"* are the reason why broken UTF-8 characters are still common in this day and age, and why we still see input boxes that forbid the use of "special characters", because they apparently break something, because somebody couldn't be bothered to use the right tools. It makes me sad to see that hand-waved away by s/o with experience. — Tomalak, Apr 29 '18 at 06:50
@Tomalak I've worked on UNIX systems for 30+ years that had neither python nor perl, nor could we install them. The main administrative boxes for the networks of telecoms computers we provided had only standard UNIX tools and so even the UNIX boxes in the labs that we used for testing them were only allowed to have standard UNIX and we could not install any other software. It's not that unusual - you're projecting your own experience. Broken UTF-8 characters - give me a break. The OP has plain text, no reason to complicate things trying to solve a problem he doesn't have. — Ed Morton, Apr 29 '18 at 12:39

Ed Morton · Accepted Answer · 2018-04-28T23:33:29.537

6

Just to show you don't need an XML-aware tool to generate the specific XML you need for any given purpose, here's one way to do it for your example:

$ cat tst.awk
BEGIN {
    print    "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
    print    ""
    print    "<XMLRT xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"SomeSchema.xsd\" bename=\"The name\" status=\"v\" version=\"1.4\" revision=\"1\" type=\"x-rt\">"
    print    "<INTRO>"
    print    "    <title>Some title</title>"
    print    "    <creator>"
    print    "    </creator>"
    print    "    <subject>Some subject</subject>"
    print    "    <description>Some description</description>"
    print    "    <date>2010-05-12</date>"
    print    "    <type>Some text</type>"
    print    "</INTRO>"

    rtBeg  = "<RTBLOCK bname=\"%s\" bnumber=\"1\" bsname=\"1%s\">\n"
    ctrBeg = "    <CTR cnumber=\"%d\">\n"
    esBody = "    <ES vnumber=\"%d\">%s</ES>\n"
    ctrEnd = "    </CTR>\n"
    rtEnd  = "</RTBLOCK>\n"
    xmlEnd = "</XMLRT>\n"
}
{
    bname = $1

    split($2,tmp,/:/)
    cnum = tmp[1]
    vnum = tmp[2]

    text = $0
    sub(/([^[:space:]]+[[:space:]]+){2}/,"",text)
}

bname != prevBname {
    if (prevCnum  != "") printf ctrEnd
    if (prevBname != "") printf rtEnd
    printf rtBeg, bname, substr(bname,1,1)
    prevCnum = ""
    prevBname = bname
}

cnum != prevCnum {
    if (prevCnum != "") printf ctrEnd
    printf ctrBeg, cnum
    prevCnum = cnum
}

{ printf esBody, vnum, text }

END {
    if (prevCnum  != "") printf ctrEnd
    if (prevBname != "") printf rtEnd
    printf xmlEnd
}

.

$ awk -f tst.awk file
<?xml version="1.0" encoding="utf-8"?>

<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
<INTRO>
    <title>Some title</title>
    <creator>
    </creator>
    <subject>Some subject</subject>
    <description>Some description</description>
    <date>2010-05-12</date>
    <type>Some text</type>
</INTRO>
<RTBLOCK bname="Usw" bnumber="1" bsname="1U">
    <CTR cnumber="1">
    <ES vnumber="1">Desktop</ES>
    <ES vnumber="2">Netbooks</ES>
    <ES vnumber="3">Servers, mainframes and supercomputers</ES>
    <ES vnumber="4">Smart devices</ES>
    <ES vnumber="5">Embedded devices</ES>
    <ES vnumber="6">Gaming</ES>
    <ES vnumber="7">Specialized uses</ES>
    </CTR>
    <CTR cnumber="2">
    <ES vnumber="1">Precursors</ES>
    <ES vnumber="2">Creation</ES>
    <ES vnumber="5">Naming</ES>
    <ES vnumber="6">Commercial and popular uptake</ES>
    <ES vnumber="9">Current development</ES>
    </CTR>
</RTBLOCK>
<RTBLOCK bname="Des" bnumber="1" bsname="1D">
    <CTR cnumber="1">
    <ES vnumber="1">User interface</ES>
    <ES vnumber="2">Video input infrastructure</ES>
    <ES vnumber="3">Hardware</ES>
    </CTR>
    <CTR cnumber="2">
    <ES vnumber="1">Community</ES>
    <ES vnumber="2">Programming on Linux</ES>
    </CTR>
</RTBLOCK>
</XMLRT>

The above will work efficiently, robustly and portably with any POSIX awk in any shell on any UNIX box.

edited Apr 28 '18 at 23:33

answered Apr 28 '18 at 22:54

Ed Morton

188,023
17
78
185

Breaks for data that contains any of `"`, `<`, `>` and `&`. Also, which guarantee is there that the output actually will be UTF-8 encoded (with any POSIX awk in any shell on any UNIX box)? – Tomalak Apr 29 '18 at 06:57
1

@Tomalak you can escape these characters eg. use `gsub("<", "\<", text); gsub("&", "\&", text);` (the others don't need escaping in content). Re UTF-8: all awk implementations (nawk, gawk, mawk) will pass valid UTF-8 input to the output because that's how UTF-8 is designed: that you can treat UTF-8 strings as byte sequences without conflict when scanning for ASCII characters. You'd only need UTF-8-aware tools for counting character lengths (as opposed to byte lengths), splitting strings at character boundaries, etc. gawk *will* also honor the locale if it's set to eg `LANG=En_US.UTF-8` – imhotap Apr 29 '18 at 11:08
That doesn't change anything about the input file, does it? If the input isn't UTF-8, the output won't be, end of story. And that's a problem when you generate a file that is supposedly ``. – Tomalak Apr 29 '18 at 12:08
1

@Tomalak you're inventing a problem that the OP simply doesn't have. He's got text input, and wants the same text in the output but wrapped in XML tags, that's all. There's absolutely no indication in the question that he needs any conversion done between the input and the output. Even IF he needs UTF-encoded output and IF his input isn't already UTF-8 encoded, then it'd be trivial to do so but again - there's simply no indication that that's an issue **for this task**. – Ed Morton Apr 29 '18 at 12:45
You are missing my point. If the input is in a single-byte encoding and contains special (non-ASCII) characters, the output of your program will be broken. It's as simple as that. (The "for this task" excuse has about as much virtue as "works on my machine". You're make assumptions on top of a reduced sample.) – Tomalak Apr 29 '18 at 12:49
I'm not missing your point at all, I'm saying that you're imagining a problem that does not exist **for this task** and suggesting a solution to it (installing non-standard tools) that may not be possible while a trivial solution exists with standard UNIX tools**if the problem existed**. – Ed Morton Apr 29 '18 at 12:52
How do you know that the problem does not exist? Do you know the file encodings the OP works with? Do you know what environment variables are set? Have you seen all data that can be in the input file? You are generalizing from **20 lines of sample input**, with the OP themselves saying "could be bigger". I'm pointing out under which real-world, fully expectable conditions your approach will fail, and you say I'm making things up? – Tomalak Apr 29 '18 at 12:56
2

Right. What you're doing is similar to trying to optimize a solution for performance when there hasn't been any indication that there is a performance problem. You're saying we need to provide a solution that does UTF-8 conversions when there hasn't been any indication that UTF-8 conversion is necessary AND saying that to do so we should use a tool that the OP simply may not have! The OP asked for an awk script (see the question text and tag) that wraps his text in XML tags. that's all. Tell you what - feel free to post a solution and the OP can decide which works best for him/her. – Ed Morton Apr 29 '18 at 13:11
1

Thanks so much Ed Morton. It works just fine. I see that seeing an expert in awk is a kind of easy to do this. I was doing it using another logic complicating the things a lot, due to that I thougth in ask for help suggestion for other tool. – Ger Cas Apr 29 '18 at 20:54
You're welcome and if you ever do need to worry about UTF-8 or any other encodings and can install/use non-standard tools then see the man page for `iconv` and you can probably just preprocess the input to the awk script with that. – Ed Morton Apr 30 '18 at 01:14

score 1 · Answer 2 · answered Apr 29 '18 at 18:47

How do you suggest me to do it?

I suggest using an XSLT-2.0+ processor like Saxon by Saxonica for outputting the wanted XML file. But other XSLT-2.0 processor do work as well.

The following XSLT-2.0 stylesheet is working in two steps:

Retrieve unparsed text to an <xsl:variable>
Parse this (plain) text variable with RegEx via <xsl:analyze-string>
Group the resulting flat XML nodes with <xsl:for-each-group>

So the stylesheet could look like this:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
    <xsl:output method="xml" />    
    <xsl:param name="text-encoding" as="xs:string" select="'utf-8'"/>
    <xsl:param name="text-uri"      as="xs:string" select="'file:///home/kubuntu/Downloads/input.txt'"/>

    <xsl:template match="/">
        <XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
            <!-- Step 1 ### get unparsed text -->
            <xsl:variable name="input-text" select="unparsed-text($text-uri, $text-encoding)"/>
            <!-- Step 2 ### Apply RegEx to every line to create <Line...> elements -->
            <xsl:variable name="xmlStepOne">
                <xsl:for-each select="tokenize($input-text,'&#xa;')">
                    <xsl:if test=".!=''">                  <!-- Skip empty lines -->
                        <xsl:analyze-string select="." regex="([^\s]+)\s([^:]+):([^\s]+)\s(.*)$">
                            <xsl:matching-substring>       <!-- Parse line with RegEx and create <Line...> XML -->
                                <Line str="{regex-group(1)}" idx1="{regex-group(2)}" idx2="{regex-group(3)}"><xsl:value-of select="regex-group(4)"/></Line>
                            </xsl:matching-substring>
                            <xsl:non-matching-substring>   <!-- Output an error if a line cannot be processed -->
                                <xsl:message terminate="yes">Error processing line &#xa;<xsl:value-of select="current()"/>&#xa;</xsl:message>
                            </xsl:non-matching-substring>
                        </xsl:analyze-string>                
                    </xsl:if>
                </xsl:for-each>
            </xsl:variable>
            <!-- Step 3 ### Group the linear flow of <Line...> elements -->
            <xsl:for-each-group select="$xmlStepOne/Line" group-by="@str">
                <RTBLOCK bname="{current-grouping-key()}" bnumber="1" bsname="{concat('1',substring(current-grouping-key(),1,1))}">
                    <xsl:for-each-group select="current-group()" group-by="@idx1">
                        <xsl:sort select="@idx1" />
                        <CTR cnumber="{@idx1}"> 
                            <xsl:for-each select="current-group()">
                                <xsl:sort select="@idx2" />
                                <ES vnumber="{@idx2}"><xsl:value-of select="."/></ES>
                            </xsl:for-each>
                        </CTR>
                    </xsl:for-each-group>
                </RTBLOCK>
            </xsl:for-each-group>
        </XMLRT>
    </xsl:template>

</xsl:stylesheet>

You can set the input filename and encoding with the two parameters at the beginning.

The output from the sample file above is:

<?xml version="1.0" encoding="UTF-8"?>
<XMLRT xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SomeSchema.xsd" bename="The name" status="v" version="1.4" revision="1" type="x-rt">
    <RTBLOCK bname="Usw" bnumber="1" bsname="1U">
        <CTR cnumber="1">
            <ES vnumber="1">Desktop</ES>
            <ES vnumber="2">Netbooks</ES>
            <ES vnumber="3">Servers, mainframes and supercomputers</ES>
            <ES vnumber="4">Smart devices</ES>
            <ES vnumber="5">Embedded devices</ES>
            <ES vnumber="6">Gaming</ES>
            <ES vnumber="7">Specialized uses</ES>
        </CTR>
        <CTR cnumber="2">
            <ES vnumber="1">Precursors</ES>
            <ES vnumber="2">Creation</ES>
            <ES vnumber="5">Naming</ES>
            <ES vnumber="6">Commercial and popular uptake</ES>
            <ES vnumber="9">Current development</ES>
        </CTR>
    </RTBLOCK>
    <RTBLOCK bname="Des" bnumber="1" bsname="1D">
        <CTR cnumber="1">
            <ES vnumber="1">User interface</ES>
            <ES vnumber="2">Video input infrastructure</ES>
            <ES vnumber="3">Hardware</ES>
        </CTR>
        <CTR cnumber="2">
            <ES vnumber="1">Community</ES>
            <ES vnumber="2">Programming on Linux</ES>
        </CTR>
    </RTBLOCK>
</XMLRT>

Another advantage of this approach is that you can handle everything with XML/XSLT and so it is aware of character-encodings and everything else that isn't covered by more simple solutions with awk or similar.

Thanks zx485, I´ve tried your solution installing Saxon .NET but when I run the commands hangs it. Transform -s:test\input.txt -xsl:test\style.xslt -o:output.txt — Ger Cas, Apr 29 '18 at 20:22
You do not need to specify `input.txt` as source. Use an empty XML file (here `a.xml`) as source like ` ABCD ` and give the `text-uri` as string parameter: `Transform -s:test/a.xml -xsl:test/style.xslt -o:output.txt text-uri=input.txt`. Then it should work as desired. — zx485, Apr 29 '18 at 22:14

Generate XML format from TXT file

2 Answers2