Split a text file into multiple files based on filename given on each line

Question

I have a large (>10GB) file which is an InfluxDB line protocol export. Line protocol format is roughly

measurement,tag1=value1,tag2=value2,... value=XXX timestamp

for example

deconz.0.Sensors.10.pressure value=998 1622621407241000000
deconz.0.Sensors.10.pressure value=999 1622621970836000000
deconz.0.Sensors.10.pressure value=999 1622624177180000000
deconz.0.Sensors.10.pressure value=999 1622625419255000000
deconz.0.Sensors.10.pressure value=998 1622625975843000000
deconz.0.Sensors.11.battery value=85 1622621407241000000
deconz.0.Sensors.11.battery value=88 1622623616070000000
deconz.0.Sensors.11.battery value=88 1622660536826000000
deconz.0.Sensors.11.battery value=85 1622663594301000000
deconz.0.Sensors.11.battery value=88 1622666692089000000
deconz.0.Sensors.11.temperature value=21.44 1622621407241000000
deconz.0.Sensors.11.temperature value=21.61 1622646781032000000
deconz.0.Sensors.11.temperature value=21.64 1622650221200000000
deconz.0.Sensors.12.humidity value=55.54 1622621407242000000
deconz.0.Sensors.12.humidity value=55.7 1622633302339000000
deconz.0.Sensors.12.humidity value=55.73 1622636722283000000
deconz.0.Sensors.12.humidity value=55.89 1622640061715000000
deconz.0.Sensors.12.humidity value=55.96 1622643481822000000
deconz.0.Sensors.13.battery value=85 1622621407242000000
deconz.0.Sensors.13.battery value=85 1622908043752000000
deconz.0.Sensors.13.temperature value=24.01 1622621407242000000
deconz.0.Sensors.13.temperature value=24.13 1622626969228000000
deconz.0.Sensors.13.temperature value=24.21 1622630216027000000
deconz.0.Sensors.13.temperature value=24.33 1622630974954000000
deconz.0.Sensors.14.humidity value=47.72 1622632937200000000
deconz.0.Sensors.14.humidity value=47.8 1622633311833000000
deconz.0.Sensors.14.humidity value=46.7 1622636659393000000
deconz.0.Sensors.15.pressure value=1002 1622673441206000000
deconz.0.Sensors.15.pressure value=1002 1622685777307000000
deconz.0.Sensors.15.pressure value=1003 1622686242842000000
deconz.0.Sensors.16.temperature value=23.47 1622654455194000000
deconz.0.Sensors.16.temperature value=23.55 1622655939005000000
deconz.0.Sensors.16.temperature value=23.57 1622655959670000000
energymeter_total,uuid=c4695262-624c-11ea-b2f7-374e5ccddc43 value=30436.6 1622594844107000000
energymeter_total,uuid=c4695262-624c-11ea-b2f7-374e5ccddc43 value=30436.6 1622594908800000000
energymeter_total,uuid=c4695262-624c-11ea-b2f7-374e5ccddc43 value=30436.6 1622594973493000000
energymeter_total,uuid=c4695262-624c-11ea-b2f7-374e5ccddc43 value=30436.6 1622595158917000000
energymeter_total,manual=true value=26984.9 1592641140000000000

I want to split this file by measurement, ie. up to the first comma OR space, using the measurement name as the target filename.

This does the job (except for the comma as separator) but is dreadfully slow (runs for 8h on an Intel i5 with SSD storage):

cat ../influx_export | while read FILE VAL TS ; do echo "$FILE $VAL $TS" >> "$FILE" ; done

I'm sure there is a scripted solution (no compiled code) that is at least 10x faster. However, the source file is too big to fit entirely into RAM.

Are there any more efficient approaches using awk, perl, sed, ruby, whatever?

There's a few different things what you've written so far might mean. If you [edit] your question to include a [mcve] with concise, testable sample input and expected output then we can help you. — Ed Morton, Jun 25 '22 at 21:58
by any chance is the input file already sorted by `measurement`? if not, do you have a rough idea of how many unique `measurement`s (aka filenames) we'll have to deal with? can multiple lines have the same `measurement` and if so are all lines with the same `measurement` grouped together? — markp-fuso, Jun 25 '22 at 22:28
please update the question with more sample data (eg, 5-10 lines, 3-4 different `measurements`; couple lines with multiple tag/value pairs); also update the question with the expected results (ie, the names and contents of all the files that should be generated - corresponding to the sample input) — markp-fuso, Jun 25 '22 at 22:29
@markp-fuso, I was hoping the single line was enough. but I'll append a few more. Thanks for the link. The file is not sorted (actually, it's sorted by timestamp but I don't want to rely on that) and there are about ~300 unique measurements. — Jens, Jun 26 '22 at 08:46
Oh wow, also @ed-morton - thanks for the long explanations. Using awk my Haswell i5 was able to finish the task in ~5 minues. Even without any more tweaking. — Jens, Jun 26 '22 at 08:55

chepner · Accepted Answer · 2022-06-25T22:40:45.467

1

bash is notoriously slow for iterating over a file (because read only reads one character at a time to ensure it doesn't consume anything after a newline that may be intended for a following command to read).

Use awk instead:

awk -F'[, ]' '{
   print $0 >> $1
}' ../influx_export

It's possible that, if there are many unique values for $1, you may wind up with a "too many files open" error. In that case, a simple (if inefficient) solution will be to explicitly close each file immediately after writing to it. Even if awk needs to open a file for each line, this should still be faster than using pure bash.

awk -F'[, ]' '{
   print $0 >> $1; close($1)
}' ../influx_export

edited Jun 25 '22 at 22:40

answered Jun 25 '22 at 21:53

chepner

497,756
71
530
681

1

`>>` in that context is shell syntax, not awk syntax. You're also missing a `'`, don't need the `$0` and could get a "too many open files" error from most awks depending on how many unique output files are generated. `( |,)` = `[ ,]`. – Ed Morton Jun 25 '22 at 22:00
But aside from that, what did the Romans do for u... :) – chepner Jun 25 '22 at 22:34
I'm Scottish and they failed to conquer Scotland so all they did for **us** was build Hadrians wall to try to stop us attacking them after they conquered England :-). – Ed Morton Jun 25 '22 at 22:36
One thing that's not clear to me: `awk` is guaranteed to leave a file open unless explicitly closed, right? So the only difference between `>` and `>>` is that `>` will truncate when the file is first opened; either way, text can *accumulate* in an opened file as the same value for `$1` is seen multiple times, right? – chepner Jun 25 '22 at 22:42
Correct. `awk '{print >> "file"}'` is equivalent to shell `while IFS= read -r line; do echo "$line"; done >> file` when what you almost always actually want for a problem like this is `awk '{print > "file"}'` which is equivalent to shell `while IFS= read -r line; do echo "$line"; done > file` so, among other things, you don't have to remember to manually remove the output files between runs if you run the tool twice. – Ed Morton Jun 25 '22 at 22:45
if you're going to close the output files as you go then you should add something like `!seen[$1]++ { printf "" > $1 }` so you can intermix `>` and `>>` as appropriate but the more concise, efficient solution is to sort the files by $1 first and then you can just close them every time $1 changes and only have 1 output file open at a time and always just use `>`. – Ed Morton Jun 25 '22 at 22:50

Ed Morton · Answer 2 · 2022-06-26T01:04:03.670

Don't use shell loops to manipulate text, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

Chances are this, using a DSU approach, is close to what you want if not exactly correct:

awk -F'[, ]' '{print $1, NR, $(NF-1), $NF}' file |
sort -k1,1 -k2,2n |
awk '
    $1 != out {
        close(out)
        out = $1
    }
    { print $3, $4 > out }
'

but it's obviously untested as you didn't provide sample input and expected output we could test with.

The awk commands each just handle 1 line at a time so use almost no memory and the sort command is designed to handle huge files by using demand paging, etc. so it doesn't need to fit the whole input in RAM and so the above should have no problem efficiently handling your input file.

Split a text file into multiple files based on filename given on each line

2 Answers2