How to add header to huge amount of files (empty/non-empty)

Question

I need to add header (single line) to huge (>10k) number of text files. Let as assume that the variable $HEADER does contain appropriate header. The command

find -type f -name 'tdgen_2012_??_??_????.csv' | xargs sed -i "1s/^/$HEADER\n/"

does work well. The problem I face is that some of the data files (tdgen_2012_????????.csv) is empty. The sed(1) cannot address non exist line of the file. I decided to manage empty files in separate way:

echo $HEADER | tee $(find -type f -name 'tdgen_2012_??_??_????.csv' -empty) > /dev/null

Due to amount of empty files the command above does not work. The tee(1) cannot write to unlimited count of the files. Also the number of the command line arguments can be exceeded.

I do not want to use the for-cycle due to low performance (the tee(1) can write many files at once).

My questions:

Does exist one solution for both kind of data files (empty/non-empty) at once?
If not: how to manage empty files effectively?

rzymek · Answer 1 · 2013-05-14T14:11:25.520

5

echo $HEADER > header
find -type f -name 'tdgen_2012_??_??_????.csv' \
    -exec sh -c '{ echo $HEADER; cat {}; } > tmp && mv tmp {}' \; -print

Explanation:

1. -exec sh -c "..." - to be able to call more then one command

2. { echo $HEADER; cat {}; } > tmp && mv tmp {} - concatenate $HEADER and the found file to tmp and rename tmp to the found file. Just because you can't do cat header {} > {}

3. -print - show the filename of every changed file

edited May 14 '13 at 14:11

answered May 14 '13 at 13:53

rzymek

9,064
2
45
59

2

+1 You can also avoid the need for a `header` tempfile by using `{ echo '$HEADER'; cat {}; } > tmp` instead. – chepner May 14 '13 at 13:54
1

Ugh. This forks three times the number of files (sh, cat, mv)! – Jens May 14 '13 at 21:17
@Jens And why would that matter? It's not like these commands are spawn in parallel. Surly it's faster than searching for the files twice. – rzymek May 15 '13 at 06:53
1

@rzymek Because the question indicates performance is an issue: "I do not want to use the for-cycle due to low performance (the tee(1) can write many files at once)." `find -exec` is effectively a loop running the commands for each iteration. – Jens May 15 '13 at 07:39

Jens · Answer 2 · 2013-05-14T21:21:00.850

What about divide and conquer:

echo "$HEADER" > header
find . -type f -size 0   -name 'tdgen_2012_??_??_????.csv' -exec cp header {} \;
find . -type f -size +0c -name 'tdgen_2012_??_??_????.csv' | sed -i ...
rm header

This only execs cp for the empty files and keeps the performance of the xargs/sed for nonempty files. If you want it as a single command, just wrap it in a script.

And thinking outside the box: What is the point of dealing with empty files? Especially when you're writing a header to a file that has no data? I would either try to not even create the empty files in the first place--or remove them. Makes life so much simpler. Remember: only a deleted file is a good file :-)

How to add header to huge amount of files (empty/non-empty)

2 Answers2