1

I have data with 5 different fields (combination of ints, strings, and large strings) and I'd like to hold it in some sort of flat file container. I have tens of thousands of such entries but I don't have a need for any sort of database (at all, just need to iterate through data, no need to query). All the formats I've examined (XML, JSON, YAML) all require redundant fields for each entry even though my data is structured and homogenous. Something like CSV would be great except I can't use commas or new lines as delimiters. Are there any formats you'd reccomend?

Example of data format:

id | epoch | short string | url | large description

oxuser
  • 1,257
  • 2
  • 16
  • 23
  • I have a question, if you don't classified your data, then every time you need to scan and load from element 1 to element Zth ? – ajreal Dec 17 '11 at 06:06
  • Could you put an exemple of you data format for us to give you a more specific solution ? – Ludovic Kuty Dec 17 '11 at 06:07
  • @ajreal yep I scan this data probably 5 times or so total and never use it again. – oxuser Dec 17 '11 at 06:21
  • @lkuty Just added that to my post. large description is ~300 words or so. Probably doesn't contain newlines. – oxuser Dec 17 '11 at 06:23

3 Answers3

3

A file format similar to CSV seems fitting, and of course you can use whatever delimiter you'd like in your file, you just need to "escape" the characters used if they are present in the data you are storing.

If you do not feel like escaping individual characters you can use an encoding scheme that doesn't ouput any of the delimiters you have chosen to use, such as Base64.


My dad is stronger than yours!

What file schema is "the best" depends on so many circumstances. As an example; I'm in love with JSON when it comes to sending smaller chunks of data between a client and a server.

Though I'd think both once and twice before using it in a flat-file schema, especially if there is a lot of data to be contained in there.

JSON is to some extend human readable, which is great for debugging, though not as great for so much else.

XML is a great format, and I like the idea behind it though it's way too complex.

CSV files or similar patterns following the same idea is my 5 cents.


Sample flat-file schema

id | epoch | short string | url | large description  

 |            -> ; (delmiter)
 id           -> matching /^[0-9]+$/
 epoch        -> matching /^[0-9]+$/ (also known as unix timestamp)
 url          -> Urls should not contain raw ';',
                 (explicitly check before input)
 short string -> Normalized
 large desc.  -> Normalized

Normalized in the above just means a method of sanitizing the data so that it doesn't interfere with parts of our schema.

Escaping ;\r\n is what we need to make this work, or just, as mentioned earlier, use an encoding algorithm such as Base64.

You should keep in mind in what order you'd like to store your data. If you'd like to parse out url more often then epoch it could be a good idea to put that as far to the left of the line as possible.

If you'd like to have easy/fast searching you could/should store all "large descriptions" in a separate file, and only fetch/process that data when it's required.

Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
  • This is not a bad idea considering my dataset will be at most 10 megabytes (tens of thousands of rows). I'm just not comfortable with using a random delimiter and escaping characters. But I should probably just get over that. – oxuser Dec 17 '11 at 06:25
  • 1
    @oxuser normalize your data by using a encoding algorithm such as Base64 (mentioned/linked in the post), then you don't have to worry about colliding delimiters and such. – Filip Roséen - refp Dec 17 '11 at 06:28
  • Oh nice I never knew about this. What is Base64 usually used for? Purposes like this? – oxuser Dec 17 '11 at 06:32
  • @oxuser Well, not really used for purposes as this.. though it's a simple method of getting control over the characters used to represent an arbitrary sequence of bytes. Base64 only consists of printable characters and is often used to represent raw binary data. Example of very popular use case of bae64: http://www.gnu.org/usenet/usenet-gpg-key.txt – Filip Roséen - refp Dec 17 '11 at 06:47
  • @oxuser please mark the answer as accepted, unless you are looking for an alternative one? Then update your question and let me now so I can help you out. Thank you. – Filip Roséen - refp Dec 20 '11 at 15:55
3

You could use a JSON array instead of an object. This way you limit the noise to a minimum. It could be a single array or an array of arrays depending on the format of your data.

It would be less verbose than XML. Don't know for YAML.

For example, you could have:

[
    [123, 123456789, "short string", "http://url", "large ... description"],
    [123, 123456789, "short string", "http://url", "large ... description"],
    [123, 123456789, "short string", "http://url", "large ... description"],
    [123, 123456789, "short string", "http://url", "large ... description"]
]
Ludovic Kuty
  • 4,868
  • 3
  • 28
  • 42
  • My fields are not all of the same type (int, strings), so can can I store them all in a JSON array? What would it look like? – oxuser Dec 17 '11 at 06:26
  • JSON can take care of that, no problem. The array doesn't have to be homogeneous. – Ludovic Kuty Dec 17 '11 at 06:28
  • I like this idea. Kind of a wonky way to use JSON, but makes sense. – oxuser Dec 17 '11 at 06:29
  • Note that in my example, the array values are not of the same type but they are at the same position for each array: first is "id", second is "epoch", ... like in your example. If the position might vary too, then you need a way to recognize them. In that case a JSON object my be the way to go. – Ludovic Kuty Dec 17 '11 at 06:32
  • Implementing/using a proper JSON parser is much more complex than parsing a simple CSV though. And if you aren't going to use the full set of json features (such as representing objects etc etc) there really is no need for such a scheme. – Filip Roséen - refp Dec 17 '11 at 06:55
  • Keeping it simple is the golden rule, simple but fully functional. – Filip Roséen - refp Dec 17 '11 at 06:55
  • I ended up going with using multiple JSON files. – oxuser Dec 22 '11 at 00:56
0

You could use CSV and use your own delimiter like $%*,;. Otherwise you could escape the commas and new lines in your text.

Udo Held
  • 12,314
  • 11
  • 67
  • 93