1

I have a database with about 180,000 records. I'm trying to attach a pdf file to each of those records. Each pdf is about 250 kb in size. However, after about a minute my program starts taking about about a GB of memory and I have to stop it. I tried doing it so the reference to each linq object is removed once it's updated but that doesn't seem to help. How can I make it clear the reference?

Thanks for your help

Private Sub uploadPDFs(ByVal args() As String)
    Dim indexFiles = (From indexFile In dataContext.IndexFiles
                     Where indexFile.PDFContent = Nothing
                     Order By indexFile.PDFFolder).ToList
    Dim currentDirectory As IO.DirectoryInfo
    Dim currentFile As IO.FileInfo
    Dim tempIndexFile As IndexFile

    While indexFiles.Count > 0
        tempIndexFile = indexFiles(0)
        indexFiles = indexFiles.Skip(1).ToList
        currentDirectory = 'I set the directory that I need
        currentFile = 'I get the file that I need
        writePDF(currentDirectory, currentFile, tempIndexFile)
    End While
End Sub

Private Sub writePDF(ByVal directory As IO.DirectoryInfo, ByVal file As IO.FileInfo, ByVal indexFile As IndexFile)
    Dim bytes() As Byte
    bytes = getFileStream(file)
    indexFile.PDFContent = bytes
    dataContext.SubmitChanges()
    counter += 1
    If counter Mod 10 = 0 Then Console.WriteLine("     saved file " & file.Name & " at " & directory.Name)
End Sub


Private Function getFileStream(ByVal fileInfo As IO.FileInfo) As Byte()
    Dim fileStream = fileInfo.OpenRead()
    Dim bytesLength As Long = fileStream.Length
    Dim bytes(bytesLength) As Byte

    fileStream.Read(bytes, 0, bytesLength)
    fileStream.Close()

    Return bytes
End Function
Leon
  • 15
  • 2

3 Answers3

4

I suggest you perform this in batches, using Take (before the call to ToList) to process a particular number of items at a time. Read (say) 10, set the PDFContent on all of them, call SubmitChanges, and then start again. (I'm not sure offhand whether you should start with a new DataContext at that point, but it might be cleanest to do so.)

As an aside, your code to read the contents of a file is broken in at least a couple of ways - but it would be simpler just to use File.ReadAllBytes in the first place.

Also, your way of handling the list gradually shrinking is really inefficient - after fetching 180,000 records, you're then building a new list with 179,999 records, then another with 179,998 records etc.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I was thinking of doing it this way but running the initial query that often would really slow down my program. I was hoping there was a more proper way of doing it. – Leon Nov 25 '10 at 16:18
  • 1
    @Leon: With only 180,000 records, a search should be blazingly fast. If it's not, you might want to look at your indexing (particularly on `PDFFolder`). Maybe go up from 10 records to 100... basically try to find an appropriate balance. – Jon Skeet Nov 25 '10 at 16:19
0

Does the DataContext have ObjectTrackingEnabled set to true (the default value)? If so, then it will try to keep a record of essentially all the data it touches, thus preventing the garbage collector from being able to collect any of it.

If so, you should be able to fix the situation by periodically disposing the DataContext and creating a new one, or turning object tracking off.

Sean U
  • 6,730
  • 1
  • 24
  • 43
  • 1
    No wait, turning object tracking off won't cut it because then you'll get an exception when you call SubmitChanges(). I think you'll have to follow Jon Skeet's idea of working through the data in pieces, plus disposing of the DataContext and creating a new one after each piece. – Sean U Nov 25 '10 at 16:29
0

OK. To use the smallest amount of memory we have to update the datacontext in blocks. I've put a sample code below. Might have sytax errors since I'm using notepad to type it in.

    Dim DB as YourDataContext = new YourDataContext
    Dim BlockSize as integer = 25
    Dim AllItems = DB.Items.Where(function(i) i.PDFfile.HasValue=False)

    Dim count = 0
    Dim tmpDB as YourDataContext = new YourDataContext


While (count < AllITems.Count)

    Dim _item = tmpDB.Items.Single(function(i) i.recordID=AllItems.Item(count).recordID)
    _item.PDF = GetPDF()

    Count +=1

    if count mod BlockSize = 0 or count = AllItems.Count then
        tmpDB.SubmitChanges()
         tmpDB =  new YourDataContext
           GC.Collect()
    end if

End While

To Further optimise the speed you can get the recordID's into an array from allitems as an anonymous type, and set DelayLoading on for that PDF field.

Dasith Wijes
  • 1,328
  • 12
  • 22
  • You could also implement .Where(RowID > lastOffset and RowID < lastOffset+BlockSize).ToList to prevent 180,000 select calls. – Dasith Wijes Dec 01 '10 at 04:37