3

I have a very very large TSV file. The first line is headers. The following lines contain data followed by tabs or double-tabs if a field was blank otherwise the fields can contain alphanumerics or alphanumerics plus punctuation marks.

for example:

Field1<tab>Field2<tab>FieldN<newline>

The fields may contain spaces, punctuation or alphanumerics. The only thing(s) that remains true are:

  1. each field is followed by a tab except the last one
  2. the last field is followed by a newline
  3. blank fields are filled with a tab. Like all other fields they are followed by a tab. This makes them double-tab.

I've tried many combinations of pattern matching in lua and never get it quite right. Typically the fields with punctuation (time and date fields) are the ones that get me.

I need the blank fields (the ones with double-tab) preserved so that the rest of the fields are always at the same index value.

Thanks in Advance!

Egor Skriptunoff
  • 23,359
  • 2
  • 34
  • 64
  • any own ideas? why do you have problems with punctuation if tabs separate your values? show us some of your attempts please. we don't like to just give code away – Piglet Dec 07 '19 at 08:44
  • You could use pattern `\t?[^\t]*` to match both empty and non-empty data. For example, if there are 3 columns in your file: `for col1, col2, col3 in tsv_string:gmatch"\t?([^\t]*)\t\t?([^\t]*)\t\t?([^\t]*)\n" do print(col1, col2, col3) end` – Egor Skriptunoff Dec 07 '19 at 09:06
  • Sorry, I should have included code. Here's where I am currently at: for header in line:gmatch("[(%g\t)$]+") do the problem is it matches the whole line instead of one tab separated field. – Argh Tastic Dec 09 '19 at 18:21

2 Answers2

2

Try the code below:

function test(s)
    local n=0
    s=s..'\t'
    for w in s:gmatch("(.-)\t") do
        n=n+1
        print(n,"["..w.."]")
    end
end

test("10\t20\t30\t\t50")
test("100\t200\t300\t\t500\t")

It adds a tab to the end of the string so that all fields are follow by a tab, even the last one.

lhf
  • 70,581
  • 9
  • 108
  • 149
  • This seems to be what I needed. Thank you very much for the correction. It's been several years since I've done any lua and I was never very good with the pattern matching in Lua. My apologies for not posting code with my initial post. – Argh Tastic Dec 09 '19 at 18:26
0

Rows and columns are separated:

local filename = "big_tables.tsv"  -- tab separated values
-- local filename = "big_tables.csv" -- comma separated values

local lines = io.lines(filename) -- open file as lines
local tables = {} -- new table with columns and rows as tables[n_column][n_row]=value
for line in lines do -- row iterator
    local i = 1 -- first column
    for value in (string.gmatch(line, "[^%s]+")) do  -- tab separated values
--  for value in (string.gmatch(line, '%d[%d.]*')) do -- comma separated values
        tables[i]=tables[i]or{} -- if not column then create new one
        tables[i][#tables[i]+1]=tonumber(value) -- adding row value
        i=i+1 -- column iterator
    end
end
darkfrei
  • 122
  • 5