11

I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.

CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.

I'm looking for a format which offers the following:

  • Plain text, easy to read
  • very easy to parse by most software platforms
  • column definition can change without requiring changes in software clients

Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.

What suggestions do you have?

srmark
  • 7,942
  • 13
  • 63
  • 74

5 Answers5

6

I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)

Brian Driscoll
  • 19,373
  • 3
  • 46
  • 65
  • 1
    Could you please show me the official spec for this. How to encode unicode? How to quote a tab in tab-delimited files? How to encode binary data like PDF/PNG in tab-delemited files? – guettli Jan 04 '19 at 13:38
  • 1
    @guettli It known as [tab-separated values (TSV)](https://en.wikipedia.org/wiki/Tab-separated_values). – Meyti Jul 02 '22 at 02:18
5

I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.

If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.

And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.

John Gietzen
  • 48,783
  • 32
  • 145
  • 190
  • 4
    Or rather, no one under the sun can parse it. – danielm Mar 11 '16 at 20:47
  • Could you please show me the official spec for csv. How to encode unicode? How to quote a comma in csv files? How to encode binary data like PDF/PNG in csv? – guettli Jan 04 '19 at 13:39
4

You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.

Examples here: http://www.yaml.org/

Surprisingly, the website's text itself is YAML.

SirDarius
  • 41,440
  • 8
  • 86
  • 100
  • AFAIK YAML is not tabular. Every row can have different attributes/columns. – guettli Sep 20 '19 at 08:45
  • @guettli It could be said that YAML is a superset of tabular formats, as everything which can be stored in CSV can also be stored in YAML. The opposite is of course not true. (example: https://gist.github.com/noirotm/e2eaa5f40a346910901285584dca75c2) – SirDarius Sep 20 '19 at 12:44
  • YAML has a ton of issues, this is discussed [here](https://noyaml.com/) in detail. The Implicit conversion of "NO" to boolean for example (could also mean Norway.) – asynts Nov 05 '22 at 10:55
2

I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.

 {
    "header": ["Column1", "Column2", "Column3"],
    "rows"  : [
                ["aaa", "xxx", 1],
                ["bbb", “yyy”, 2],
                ["ccc", “zzz”, 3]
              ]
  }

If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.

fralau
  • 3,279
  • 3
  • 28
  • 41
2

Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:

  • ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.

  • Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.

  • Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.

All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).

You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).