4

I have to merge thousands of large files (~200MB each). I would like to know what is the best way to merge this files. Lines will be conditionally copied to the merged file. Could it by using File.AppendAllLines or using Stream.CopyTo?

Using File.AppendAllLines

for (int i = 0; i < countryFiles.Length; i++){
   string srcFileName = countryFiles[i];
   string[] countryExtractLines = File.ReadAllLines(srcFileName);  
   File.AppendAllLines(actualMergedFileName, countryExtractLines);
}

Using Stream.CopyTo

using (Stream destStream = File.OpenWrite(actualMergedFileName)){
  foreach (string srcFileName in countryFiles){
    using (Stream srcStream = File.OpenRead(srcFileName)){
        srcStream.CopyTo(destStream);
    }
  }
}
LUIS PEREIRA
  • 478
  • 4
  • 20
  • 1
    Use a `StreamWriter` for a new file and read all files u want to merge with a `StreamReader` and write to your writer. – Camo Oct 28 '15 at 12:06
  • I suspect many will answer with "try it and compare the two". – Wai Ha Lee Oct 28 '15 at 12:07
  • 1
    I believe you'd want a `StreamReader` and iterate over the file(s) line by line as this way it won't store everything in memory all at once. – sab669 Oct 28 '15 at 12:08
  • 3
    Do you ONLY want to append the files? If so, use `Stream.CopyTo()` but open the existing file that you want to append to using `File.Open("filename", FileMode.Append)`. If you use `File.OpenWrite()` things will go HORRIBLY WRONG. – Matthew Watson Oct 28 '15 at 12:09
  • No, I'll need to manipulate each of the lines conditionally. Some of the lines might might not be copied. – LUIS PEREIRA Oct 28 '15 at 12:13
  • 1
    Then you definitely do not want to do `ReadAllLines` as that would have you load 200MB of data into memory as mentioned by sab669 – Zdeněk Jelínek Oct 28 '15 at 12:28
  • @LUISPEREIRA please put the fact that you will need to manipulate the lines conditionally in your question, most people aren't going to read the comments to find that out – BenVlodgi Oct 28 '15 at 14:11

3 Answers3

4

You can write the files one after the other. For example:

static void MergingFiles(string outputFile, params string[] inputTxtDocs)
{
    using (Stream outputStream = File.OpenWrite(outputFile))
    {
      foreach (string inputFile in inputTxtDocs)
      {
        using (Stream inputStream = File.OpenRead(inputFile))
        {
          inputStream.CopyTo(outputStream);
        }
      }
    }
}

In my view the above code is really high-performance as Stream.CopyTo() has really very simple algorithm so the method is high effective. The reflector renders the heart of it as follows:

private void InternalCopyTo(Stream destination, int bufferSize)
{
  int num;
  byte[] buffer = new byte[bufferSize];
  while ((num = this.Read(buffer, 0, buffer.Length)) != 0)
  {
     destination.Write(buffer, 0, num);
  }
}
StepUp
  • 36,391
  • 15
  • 88
  • 148
3

sab669's answer is correct, you want to use a StreamReader then loop over each line of the file... I would suggest writing each file individually however as otherwise you are going to run out of memory pretty quickly with many 200mb files

For example:

foreach(File f in files)
{
    List<String> lines = new List<String>();
    string line;
    int cnt = 0;
    using(StreamReader reader = new StreamReader(f)) {
        while((line = reader.ReadLine()) != null) {
            // TODO : Put your conditions in here
            lines.Add(line);
            cnt++;
        }
    }
    f.Close();
    // TODO : Append your lines here using StreamWriter
}
2

Suppose you have a condition which must be true (i.e. a predicate) for each line in one file that you want to append to another file.

You can efficiently process that as follows:

var filteredLines = 
    File.ReadLines("MySourceFileName")
    .Where(line => line.Contains("Target")); // Put your own condition here.

File.AppendAllLines("MyDestinationFileName", filteredLines);

This approach scales to multiple files and avoids loading the entire file into memory.

If instead of appending all the lines to a file, you wanted to replace the contents, you'd do:

File.WriteAllLines("MyDestinationFileName", filteredLines);

instead of

File.AppendAllLines("MyDestinationFileName", filteredLines);

Also note that there are overloads of these methods that allow you to specify the encoding, if you are not using UTF8.

Finally, don't be thrown by the inconsistent method naming.File.ReadLines() does not read all lines into memory, but File.ReadAllLines() does. However, File.WriteAllLines() does NOT buffer all lines into memory, or expect them to all be buffered in memory; it uses IEnumerable<string> for the input.

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
  • Thanks. Just read from MSDN:The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient. – LUIS PEREIRA Oct 28 '15 at 13:01
  • @LUISPEREIRA Yep, so I recommend using this simple approach. Also note my last paragraph about Microsoft's inconsistent naming! – Matthew Watson Oct 28 '15 at 13:03