Alternative to CSV?

Question

I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.

CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.

I'm looking for a format which offers the following:

Plain text, easy to read
very easy to parse by most software platforms
column definition can change without requiring changes in software clients

Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.

What suggestions do you have?

score 6 · Answer 1 · answered Oct 06 '10 at 16:25

6

I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)

answered Oct 06 '10 at 16:25

Brian Driscoll

19,373
3
46
65

1

Could you please show me the official spec for this. How to encode unicode? How to quote a tab in tab-delimited files? How to encode binary data like PDF/PNG in tab-delemited files? – guettli Jan 04 '19 at 13:38
1

@guettli It known as [tab-separated values (TSV)](https://en.wikipedia.org/wiki/Tab-separated_values). – Meyti Jul 02 '22 at 02:18

score 5 · Accepted Answer · answered Oct 06 '10 at 16:25

5

I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.

If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.

And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.

answered Oct 06 '10 at 16:25

John Gietzen

48,783
32
145
190

4

Or rather, no one under the sun can parse it. – danielm Mar 11 '16 at 20:47
Could you please show me the official spec for csv. How to encode unicode? How to quote a comma in csv files? How to encode binary data like PDF/PNG in csv? – guettli Jan 04 '19 at 13:39

score 4 · Answer 3 · answered Oct 06 '10 at 16:25

4

You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.

Examples here: http://www.yaml.org/

Surprisingly, the website's text itself is YAML.

answered Oct 06 '10 at 16:25

SirDarius

41,440
8
86
100

AFAIK YAML is not tabular. Every row can have different attributes/columns. – guettli Sep 20 '19 at 08:45
@guettli It could be said that YAML is a superset of tabular formats, as everything which can be stored in CSV can also be stored in YAML. The opposite is of course not true. (example: https://gist.github.com/noirotm/e2eaa5f40a346910901285584dca75c2) – SirDarius Sep 20 '19 at 12:44
YAML has a ton of issues, this is discussed [here](https://noyaml.com/) in detail. The Implicit conversion of "NO" to boolean for example (could also mean Norway.) – asynts Nov 05 '22 at 10:55

score 2 · Answer 4 · answered Mar 18 '20 at 13:51

I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.

 {
    "header": ["Column1", "Column2", "Column3"],
    "rows"  : [
                ["aaa", "xxx", 1],
                ["bbb", “yyy”, 2],
                ["ccc", “zzz”, 3]
              ]
  }

If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.

score 2 · Answer 5 · answered Oct 01 '21 at 07:20

Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:

ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.
Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.
Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.

All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).

You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).

Alternative to CSV?

5 Answers5

Linked

Related