1

I have been having trouble finding the metacharacter for the 'Unit Separator' to replace the tabs in a flat file.

So far I have this:

File.WriteAllLines(outputFile,
    File.ReadLines(inputFile)
    .Select(t => t.Replace("\t", "\0x1f")));  //this does not work

I have also tried:

File.WriteAllLines(outputFile,
    File.ReadLines(inputFile)
    .Select(t => t.Replace("\t", "\u"))); //also doesn't work

AND

File.WriteAllLines(outputFile,
    File.ReadLines(inputFile)
    .Select(t => t.Replace("\t", 0x1f)));  //also doesn't work

How do I correctly use hex as a parameter? Also, what is the metacharacter for the 'Unit Separator"?

Jashaszun
  • 9,207
  • 3
  • 29
  • 57
J.S. Orris
  • 4,653
  • 12
  • 49
  • 89
  • 1
    The first variant is almost exactly that; only you don't need the `0` between `\` and `x`. Check section 2.4.4.4 of the C# Language Specification (downloadable at https://www.microsoft.com/en-us/download/details.aspx?id=7029) – ach Aug 11 '15 at 16:55

3 Answers3

4

the metacharacter for the unit separator is

U+001f

you should be able to use it like

File.WriteAllLines(outputFile,
File.ReadLines(inputFile)
.Select(t => t.Replace("\t", "\u001f")));

EDIT: Since a discussion about control characters started to happen, Ill add this definition for posterity's sake.

A special, non-printing character that begins, modifies, or ends a function, event, operation or control operation. The ASCII character set defines 32 control characters. Originally, these codes were designed to control teletype machines. Now, however, they are often used to control display monitors, printers, and other modern devices.

from here.

also, here is a description of the unit separator

The smallest data items to be stored in a database are called units in the ASCII definition. We would call them field now. The unit separator separates these fields in a serial data storage environment. Most current database implementations require that fields of most types have a fixed length. Enough space in the record is allocated to store the largest possible member of each field, even if this is not necessary in most cases. This costs a large amount of space in many situations. The US control code allows all fields to have a variable length. If data storage space is limited—as in the sixties—this is a good way to preserve valuable space. On the other hand is serial storage far less efficient than the table driven RAM and disk implementations of modern times. I can't imagine a situation where modern SQL databases are run with the data stored on paper tape or magnetic reels...

from here.

psoshmo
  • 1,490
  • 10
  • 19
  • I'm giving it a go (9 mill records to load)....I will be using bcp to import into SQL Server, in my bcp statement, do I use -t"\u001f" or -t"u001f" or -`tu001f` for the field terminator? – J.S. Orris Aug 11 '15 at 16:41
  • @JeffOrris read my edit about the unicode symbol, \u001f may actually be the correct one. I believe ashes answer didnt work because of the second \ in his`\\u001f`. try it without it and see. as far as the question in your comments, I am unsure – psoshmo Aug 11 '15 at 16:43
  • This seemed like this might have worked...do Unit Separoators look like a clear rectangle (long side verticle)? – J.S. Orris Aug 11 '15 at 16:43
  • For me I get a square with a question mark within it when run in the console for Win10. – Martin Noreke Aug 11 '15 at 16:45
  • @MartinNoreke which unicode character are you using? – psoshmo Aug 11 '15 at 16:45
  • @JeffOrris did you use \u001f or \u241F – psoshmo Aug 11 '15 at 16:46
  • I used char unitSeperatorChar = (char)Convert.ToInt32("0x1f", 16); – Martin Noreke Aug 11 '15 at 16:46
  • @MartinNoreke what happens if you use 0x241F? – psoshmo Aug 11 '15 at 16:49
  • That evaluates to 9247 whcih in VS shows a small UE. Console has a ? as output. – Martin Noreke Aug 11 '15 at 16:51
  • 1
    @JeffOrris You should use the \u001f. Keep in mind these are called control characters and are not meant to be print or read by humans, but are intended to be a nice delimiter for the computer to use. – psoshmo Aug 11 '15 at 16:53
  • @psoshmo `\001f` worked great! Please update the answer in the code block to reflect. – J.S. Orris Aug 11 '15 at 16:54
  • @JeffOrris just edited and added the definition for control characters as well – psoshmo Aug 11 '15 at 16:59
0

I think the correct way to encode unicode characters in C# is to use the \unnnn format. You can try replacing it with the string \u001f, like so:

File.WriteAllLines(outputFile,
    File.ReadLines(inputFile)
    .Select(t => t.Replace("\t", "\001f")));

Does that work?

ashes999
  • 9,925
  • 16
  • 73
  • 124
  • I'm giving it a go (9 mill records to load)....I will be using `bcp` to import into SQL Server, in my `bcp` statement, do I use `-t"\\u001f"` or `-tu001f` for the field terminator? – J.S. Orris Aug 11 '15 at 16:34
  • Why the double backslash? One backslash is enough. – ach Aug 11 '15 at 16:49
  • 1
    There is a little shorter equivalent form, `"\x1f"` (`\u` requires *exactly* 4 hex-digits, `\x` requires *up to* 4 hex-digits) – ach Aug 11 '15 at 16:50
0

This should get you where you need to be:

        char unitSeperatorChar = (char)Convert.ToInt32("0x1f", 16);
        string contents = File.ReadAllText(inputFile);
        string convertedContents = contents.Replace('\t', unitSeperatorChar);
        File.WriteAllText(outputFile, convertedContents);

I loaded into a string, converted, and re-saved. You can combine them for better memory efficiency in string management.

Martin Noreke
  • 4,066
  • 22
  • 34
  • I did not downvote you...I will be trying this next. – J.S. Orris Aug 11 '15 at 16:40
  • 1
    Just a question for those who are down-voting this answer: Why? I'm curious as to why you see it as incorrect or invalid so that I can improve my answers in the future. – Martin Noreke Aug 11 '15 at 16:46
  • 1
    1) `(char)0x1f` is equal to your first line. 2) `File.ReadLines` returns `IEnumerable` not string 3) `File.WriteAllLines`'s second parameter is `IEnumerable` So before being curious, put your code to VS or LinqPad and test it. – EZI Aug 11 '15 at 16:57
  • Changed to use Text versions of methods. Minor lapse when converting a string based test to use the File.Load for posting answer. – Martin Noreke Aug 11 '15 at 17:01