4

Currently I'm trying to differentiate between different serialized text formats. Mainly between XBRL, XML, CSV, and JSON.

I would like to assume that, checking by steps, if we use a parser to parse an XBRL/XML and returns without any exception thrown, then it's a valid XML document and needs further checking to see if the document is a regular xml or an xbrl.

If the first check fails, try parsing the csv. If parsing the csv returns an exception, try parsing as a JSON. If none of the above works, it's an invalid document.

Would this be an exceptional way of identifying the type of text format the document is? Or is there a better way? (i.e reading the first few bytes of the document etc...).

thanks

sincreadys
  • 53
  • 4
  • 1
    It might be that some data can be interpreted in more than one way (e.g. a single string in double quotes could be valid CSV *and* valid JSON), but in that case there's no "perfect" answer anyway. If it's OK to go with any valid format, then reading a few bytes and ordering your tests accordingly (e.g. lotsa `<` suggests trying XML first) will save time -- just go with the first one that doesn't give an exception. Finally note that there are many "parameters" for CSV (e.g. types of quoting, are embedded newlines allowed etc.) -- IOW a huge variety of slightly incompatible CSV formats. :( – j_random_hacker Sep 08 '15 at 11:48

3 Answers3

1

If you know the JSON will be an object or array, and that the content HAS to be one of those four...

if(content.charAt(0) == "[" || content.charAt(0) == "{") { 
    // JSON
} else if(content.charAt(0) == "<") {
    if(content.indexOf("xmlns=\"http://www.xbrl.org/2001/instance\"") >= 0) {
        // XBRL
    } else {
        // XML
    }
} else {
    // CSV ?...
    // first remove strings
    var testCSV = content.replace("\"\"", ""); // remove escaped quotes
    testCSV = testCSV.replace(/".*?"/g, ""); // match-remove quoted strings
    var lines = testCSV.split("\n");
    if(lines.length === 1 && lines[0].split(",").length > 1) {
        // only 1 row so we can only verify if there is two or more columns
        // CSV
    } else if(lines.length > 1 && lines[0].split(",").length > 1 && lines[0].split(",").length === lines[1].split(",").length) {
        // we know there's multiple lines with the same number of columns
        // CSV
    }
    // can't be sure what it is
    // ???
}

The above will give you a reasonable amount of certainty.

EDIT I added a quick CSV test as well.

Louis Ricci
  • 20,804
  • 5
  • 48
  • 62
  • I believe the content would be either one of the 4 choices. But if we were to incorporate the chances of the content to not be one of the 4, how should we go about that? I had a thought about parsing the document using a JSON, XML and CSV parser. If all three parsers fail, we would throw an exception stating that the document is neither one of the formats. But I'm only limited to what's available in the Java Library and cannot use any other external libraries for parsing. I believe there's a JSON and XML parser built in, but no CSV parsers just yet. – sincreadys Sep 09 '15 at 02:35
  • @sincreadys - Verifying CSV format isn't too difficult, I added some code to demonstrate. The trickiest part of the CSV spec is how to deal with strings and quotes, so first I remove them using a simple replace, then a non-greedy regex to match the surrounding quotes and string content. Once all of the strings are removed we just split on the line break and each line is split on commas (column delimiter). Now it's just a matter of making sure each line contains the same number of columns (which matches the CSV spec). My code only tests the first 2 lines, but you can be more thorough by looping – Louis Ricci Sep 09 '15 at 11:56
0

XBRL has not been seen as a "language" by users any more. XBRL has became a semantic standard for financial business documents. Initially, XML was vastly adopted by companies because in that time JSON did not even exist (we are talking about 90's).

Today, XML is used just because its facility of creating a huge amount of linked data (through XLinks, Schemas and Linkbases). However, you are not stuck in XML format, you can use any one of this technology for representing the XBRL file: XML, JSON or CSV.

If you already have a XBRL-XML file, you can convert it to XBRL-JSON format through free and Open-source tools - e.g.: https://youtu.be/Xr6v4jL535w.

Manglu
  • 10,744
  • 12
  • 44
  • 57
0

I would like to specifically address the difference between XML and XBRL.

XML is a syntax. An XML parser may be tasked with parsing out the elements, checking the elements against a schema, and perform other syntax-level validations against the structure of the document. For the most part, parsing XML is a syntax check against the structure of the document.

XBRL leverages the XML format, so all XBRL documents are also XML documents. However, the XBRL specification goes above and beyond an XML parser to ensure that the semantics of the data encoded in the XML format are correct. An XBRL parser, for example, loads a calculation linkbase, if one is defined, and ensures that the numeric values that participate in the calculation add up correctly as defined by the calculation linkbase. Tools such as Gepsio perform this XBRL-specific semantic check work to ensure that the data encoded in the XML format conforms to all of the rules defined in the XBRL Specification.

XBRL is semantic rules against XML-encoded data. Valid XBRL is also valid XML, but the reverse is not necessarily true.

JeffFerguson
  • 2,952
  • 19
  • 28