4

I need to parse a file in TSV format (tab separated values). I use a regex to break down the file into each line, but I cannot find a satisfying one to parse each line. For now I've come up this:

(?<g>("[^"]+")+|[^\t]+)

But it does not work if an item in the line has more than 2 consecutive double quotes.

Here's how the file is formatted: each element is separated by a tabulation. If an item contains a tab, it is encased with double quotes. If an item contains a double quote, it is doubled. But sometimes an element contains 4 conscutive double quotes, and the above regex splits the element into 2 different ones.

Examples:

item1ok "item""2""oK"

is correctly parsed into 2 elements: item1ok and item"2"ok (after trimming of the unnecessary quotes), but:

item1oK "item""""2oK"

is parsed into 3 elements: item1ok, item and "2ok (after trimming again).

Has anyone an idea how to make the regex fit this case? Or is there another solution to parse TSV simply? (I'm doing this in C#).

Antoine
  • 5,055
  • 11
  • 54
  • 82

3 Answers3

8

You could use the TextFieldParser. This is technically a VB assembly, but you can use it even in C# by referencing the Microsoft.VisualBasic.FileIO assembly.

The example at the link above even shows using it on a tab-separated file.

Adam Neal
  • 2,147
  • 7
  • 24
  • 39
  • 2
    +1 It's part of the .Net framework: it's supported by Microsoft, it doesn't need separate deployment. – MarkJ Mar 09 '10 at 17:47
  • 2
    Just to be awere this is not usable in Dotnet Core and Dotnet Standard due to the VisualBasic code not being open sourced and not being ported.. ever. – Piotr Kula Feb 28 '18 at 09:01
  • @PiotrKula I just create a project in vb.net in .net 6, I am able to use it. – Anirudha Gupta Aug 08 '22 at 03:43
  • Ahh cool in NET 6 it should be fine now because they implemented NetStandard 2.1 for almost all the code base. So that is awesome. Thanks! – Piotr Kula Aug 18 '22 at 08:58
6

Instead of trying to build your own CSV/TSV file parser (or using String.Split), I'd recommend you have a look at "Fast CSV Reader" or "FileHelpers library".

I'm using the first one, and am very happy with it (it supports any separator characters, e.g. comma, semicolon, tab).

M4N
  • 94,805
  • 45
  • 217
  • 260
  • I've used the Lumenworks CSV reader, works well and would for a good base for a TSV reader. – Lazarus Mar 09 '10 at 16:53
  • That's surely a good solution, but I want to avoid additional dependencies to my code, so the .net class answer suits my needs better. – Antoine Mar 10 '10 at 12:49
  • M4N, Lumenworks' CSV reader works well, except that it's getting confused between CSV and TSV (I think, anyways) on a specific line, since commas and quotes are on the same line or something. Do you know how to get it to only look at tabs for separation? – TankorSmash Jan 07 '13 at 17:00
  • Disregard that, solved the problem: http://stackoverflow.com/questions/2425800/quotes-in-tab-delimited-file/14200980#14200980 – TankorSmash Jan 07 '13 at 17:29
1

Instead of using RegEx, maybe you could try the String.Split Method (Char[]) method.

DaveB
  • 9,470
  • 4
  • 39
  • 66