76

What's the best separator/delimiter character(s) for a plaintext db file?

I considered using |, ,, <TAB>, ;, etc. But they all seem to be possible to break when the nearby entries have special enough characters.

So, the experienced database users, what delimiter character(s) do you suggest to use?

Meng Lu
  • 13,726
  • 12
  • 39
  • 47
  • 2
    What if I use a combination of 2 or more special characters as my separator? Chances to breach my structure will be less plus it will be simple as well. – Iqra. Jul 04 '17 at 05:30

12 Answers12

65

Well, there are few separator characters in US-ASCII, hex 1c, 1d, 1e and 1f. The plain text shouldn't contain them.

1c  FS  ␜  ^\  File Separator
1d  GS  ␝  ^]  Group Separator
1e  RS  ␞  ^^  Record Separator
1f  US  ␟  ^_  Unit Separator
Michas
  • 8,534
  • 6
  • 38
  • 62
  • 7
    In essence, that is what they're for, and it would be nice if this became a standard. But there are some issues with this. Not all import/export tools of DB technologies support non-printable separators. E.g. bcp of SQL Server. https://dba.stackexchange.com/questions/208982/use-non-printable-characters-in-sql-server-bcp-xml-format-file-as-terminator – Wouter Jun 07 '18 at 12:04
  • 1
    Additionally, if there are encoding issue in your source data, the data could potentially already contain those byte sequences. The only way to be sure is to first search your entire dataset for the separator you are planning to use. – Wouter Jun 07 '18 at 12:05
  • If you want to use these, require printable characters and can use Unicode, then the Unicode "Control Pictures" block (U+2400 to U+243F) has symbols for these four characters and the escape character. – Simon G. May 03 '23 at 14:39
49

No matter which character you choose as your separator, you'll want to escape any instance of that character in your data.

Perhaps tilde(~), or go to a high-ASCII character.

Either way, if there's any chance that it could sneak into your data, you'd want to escape it before writing to your plaintext file.

p.campbell
  • 98,673
  • 67
  • 256
  • 322
  • 5
    +1. Escaping is the only solution where _any_ character can be in the data. I usually prefer `|` myself (along with `\|` and `\\` as escapes) but tilde is as good as any other. – paxdiablo Jun 12 '11 at 01:55
  • 5
    This is supposed to be a practical question. I have the one-time freedom of choosing a delimiter character or string, but I don't have control over what can appear in the db cells. In that case, I want to make a decision on choosing a delimiter character/string, that will break at a small chance. – Meng Lu Jun 12 '11 at 02:57
  • 17
    If you have no idea what your data will be, then you cannot choose a delimiter. Any delimiter could break. Practically speaking: your problem is under constrained, it has no correct answer. – Chris Subagio Jun 13 '11 at 02:33
  • 1
    Could anyone specify what he means by High-ASCII characters? – Iqra. Jul 04 '17 at 05:29
  • 4
    @Eqra ASCII only contains 127 characters, but "High-ASCII" or "extended ASCII" extended contains 255 characters. https://en.wikipedia.org/wiki/Extended_ASCII (Better late than never.) – henrycjc Jul 13 '17 at 04:45
  • @henrycjc Yes, I have got the answer, later on, high thanks :-) – Iqra. Jul 13 '17 at 07:44
40

I think the best way to join string with a three cherries '@@@'.

Emis
  • 594
  • 1
  • 5
  • 11
  • 17
    in italian the call them chiocciola or snails – thermite Feb 01 '14 at 18:50
  • 17
    Downvote for lack of explanation/argumentation. What makes @@@ different from any other arbitrary string or character? – Wouter Jun 07 '18 at 12:06
  • 4
    @Wouter The "@" symbol visually stands out, filling more space than other common symbol characters (with most common typefaces). In real data, single at-symbols are extremely common (email addresses). Double at-symbols are common (email address typos). Three are unlikely, and easy to spot-check when debugging/testing/auditing char escaping (compared to e.g. newlines, spaces, tabs, commas, ampersands). – Max Terry Mar 22 '19 at 20:06
  • Possible explanation could be provided. Though it will still depend on the kind of data you put in. In case when you have raw collection of data it is hard that you will get the correctly formatted data whatever you choose your delimiter to be. For general use cases a combination of letters can work out well. – Rajan Chauhan Apr 22 '19 at 18:26
  • 3
    On a side note: in Austria we call the @ “Klammeraffe” or spider monkey. Easy to see why – Stefan Haberl Jul 14 '19 at 20:28
  • 1
    Wouldn't this add a decrease in big(O) performance if it has to match more than 1 character? – Fiddle Freak Jul 06 '20 at 13:54
17

For a particular data warehousing situation where we had control over the source file, but escaping and qualifying were onerous, we were able to make the business decision that one extended ASCII character would be stripped from the data (if it ever occurs, which it hasn't).

On creation of the delimited source file, we stripped out any instances of █ (alt+219) in the data and use that character for the delimiter. Bonus, that character is really easy to spot.

japage
  • 171
  • 1
  • 2
11

Actually, it depends on the type of data you are trying to separate, we needed a separator for the machine events data and a couple of them were proposed:

=) or ^_^.

We chose ^_^ because it actually worked based on the number of samples tested and it also looks cute!

Coder Absolute
  • 5,417
  • 5
  • 26
  • 41
5

I usually prefer non-printable characters like "\u0001", for instance I use this as a column delimiter in most of my Azure Data Analytics U-SQL Scripts. That is assuming you can use a multi-character custom delimiter

dim_user
  • 969
  • 1
  • 13
  • 24
4

Personally I like using « as a delimiter character to split data in CSV files, I don't think I've ever found a naturally occurring instance of « and » personally, so here are my two cents about it.

RedNet
  • 416
  • 5
  • 12
4

You could use the special separator characters (hex 1c -> 1f), yet they are non-printable, and some technologies have issues processing data containing them.

So, plan B, if your data is in UTF-8, you could pick a random UTF-8 character that is extremely unlikely to appear in any source data you receive.

Yet, even then, if you want to be sure you'll not run into issues, you better always scan your entire dataset for this character, and if it appears, simply pick another UTF-8 character.

I tend to hate encapsulation with a passion, and avoid it whenever possible, as explained in my post under the chapter 'encapsulation' here: https://theonemanitdepartment.wordpress.com/2014/12/15/the-absolute-minimum-everyone-working-with-data-absolutely-positively-must-know-about-file-types-encoding-delimiters-and-data-types-no-excuses/

Wouter
  • 1,829
  • 3
  • 28
  • 34
1

If you can't control the data being put into it, don't use a plain text db. There can be no generally right answer here. Without context or constraints this is a false question.

To wit: If I said I was only going to accept lower case letters as data, I could use any other symbol as a separator. Even, say, the number 9, and I'd be fine. No symbol other than a lower case character would be better than any other.

Conversely, if said I could accept any character, then I don't have any characters left for a separator, and I'd be left with a very sorry database that could only store a single value.

If you have to try too hard to get your db into plain text, you probably want a binary db. Have you looked at sqlite? It's pretty darned easy to use, is available in many contexts, and comes with a ton of benefits over a plain text db.

Chris Subagio
  • 6,099
  • 1
  • 15
  • 7
0

If you have the option of a string as column separator, use "" as delimiter. You can make up any string for that matter and gives you flexibility.

svargh
  • 9
  • 1
0

I've used an ePUB convertor before and the delimiter char was the notational quote character, anywhere it had been used it would be rewritten to file as @, simple but effective even if it did destroy the sample material being produced.

-1

I propose the interrobang character "‽". More details: https://en.wikipedia.org/wiki/Interrobang

  • 1
    Plaintext may have various problems using those unicode characters. – Markus Zeller Jan 30 '20 at 10:38
  • 4
    Sorry, downvoting because you have not explained why that character is particularly suited, nor does your linked Wikipedia article mention anything about its use as a delimiter character. I'm puzzled by the upvotes. I can only guess that people are simply upvoting it because it's out of ASCII range. Well, there are hundreds of thousands of other characters that also qualify according to that standard. – JoelFan Feb 25 '22 at 22:58