9

I have this situation where I have a parent child relationship between two sets of data. I have a parent document collection and a child document collection. The requirement is that the parents and their corresponding children need to be exported into 'a' pdf document. A simple-implementation of the above situation can be as follows(java-ish pseudo code below):

for(Document parentDocument:Documents){
   ExportToPdf(parentDocument);
    for(Document childDocument:parentDocument.children()){
      AppendToParentPdf(childDocument);  
  }
}

Something as above will probably solve the problem, but all of a sudden the requirements changes and now each of these parents and their corresponding children need to be in separate pdfs, so the above given snippet is modified by changing the AppendToParentPdf() to ExportToPdf() follows:

for(Document parentDocument:Documents){
   ExportToPdf(parentDocument);
    for(Document childDocument:parentDocument.children()){
      ExportToPdf(childDocument);  
  }
}

Going along this way, it will not take long before this seemingly trivial snippet would suffer from some serious code smells.

My questions to SO are:

  1. Are there better representations of parent-child relationships such as the above where instead of brute-forcing my way through all the documents and their children in an O(n^2) fashion, I can use a different data-structure or technique to traverse the entire structure in a more optimal fashion.

  2. In the scenario that I described above, where the business rules are rather fluid about the way the pdfs should be exported, is there a smarter way to code the nature of the export function? Also the export format is transient. PDFs can give way to *.docs/csvs/xmls et al.

It will be great to get some perspective on this.

Thanks

sc_ray
  • 7,803
  • 11
  • 63
  • 100
  • As for your first question, it doesn't look like you are brute forcing, because you are only retrieving parentDocument.Children for every parent and this contains the list of all the children documents for a particular parent. So your time complexity is not o(n^2) but rather o(n+m) where n and m are the count of parent and child documents respectively. – Santhosh Nov 29 '11 at 07:07
  • @sc_ray You should state whether it is possible for a `childDocument` to have more than one `parentDocument`. – Alderath Nov 29 '11 at 08:48
  • @Santosh - I am not sure how this is an O(n+m) problem since the loop needs to iterate through each of the m children n times, giving it a time complexity of O(n*m) or O(n^2). – sc_ray Nov 29 '11 at 13:07
  • @Alderath - Yes, it is quite possible for a child document to have more than one parent document. – sc_ray Nov 29 '11 at 13:09
  • The reason why people started saying the algorithm is O(n) is that, upon seeing the code snipet, they assumed that that each child document had a unique parent. At least that was my first assumption when reading the code. – Alderath Nov 29 '11 at 13:49
  • @Alderath I still have a feeling that regardless of whether the child document has a unique parent or not, the worst-case runtime of the snippet that I have provided would still be O(n^2) unless my understanding of nested for loops is totally off, I am having trouble rationalizing how this could be O(n+m) since the pause in execution of the outer for loop by the CPU in order to completely process the inner for loop would cause the runtime to be quadratic. I might be missing something here though. – sc_ray Nov 29 '11 at 20:39
  • 2
    If the amount of documents is n, and each child has a unique parent, then the complexity is O(n). Why? While looping, the documents will be visited in the order: a, aChild1, aChild2, aChild3, aChild4, b, bChild1, c, cChild1, cChild2. For each of these documents, exportToPdf() will be called exactly once, resulting in 10 calls total. If we add ten more documents (e.g. d with 9 children), exportToPdf() will be called 10 times more, adding up to 20 calls. If the algorithm had been O(n^2), we should've had an increase by a factor 4 in the amount of calls when doubling the amount of documents) – Alderath Nov 29 '11 at 22:23
  • The fact that you have a for loop within another for loop does not necessarily mean that the algorithm is O(n^2). While it might be rather common that O(n^2) algorithms are a for loop within another for loop, that is far from always the case. – Alderath Nov 29 '11 at 22:30
  • @Alderath - Thanks for the clarification. When I look at nested fors like that, I look at the worst case scenario where the parent-child relationships can be drawn as a full graph and as a result, n^2 runtime prevails but I do see your point why the runtime will be O(n) in some cases. – sc_ray Dec 01 '11 at 20:14

6 Answers6

4

Are there better representations of parent-child relationships such as the above where instead of brute-forcing my way through all the documents and their children in an O(n^2) fashion.

This is not O(N^2). It is O(N) where N is the total number of parent and child documents. Assuming that no child has more than one parent document, then you can't significantly improve the performance. Furthermore, the cost of the traversal is probably trivial compared with the cost of the calls that generate the PDFs.

The only case where you might want to consider optimizing is if child documents can be children of multiple parents. In that case, you may want to keep track of the documents that you've already generated PDF's for ... and skip them if you revisit them in the traversal. The test for "have I seen this document before" can be implemented using a HashSet.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks for the answer. I do not understand how two nested for loops make the time complexity O(n+m) or O(n) instead of n^2. The execution of the outer-loop will pause until, the inner loop has finished executing, unless java does some rather radical loop-unrolling turning a n^2 into n. – sc_ray Nov 29 '11 at 13:04
3

You could encapsulate what you want to do with a document in a handler. This will also allow you to define new handlers in the future that you can pass to existing code.

interface DocumentHandler {
    void process(Document d);
}

class ExportToPdf implements DocumentHandler { ... }
class AppendToParentPdf implements DocumentHandler { ... }

// Now you're just passing the interface whose implementation does something with the document
void handleDocument(DocumentHandler parentHandler, DocumentHandler childHandler) {
    for(Document parent : documents) {
        parentHandler.process(parent);

        for(Document child : parent.children()) {
            childHandler.process(child);
        }
    }
}

DocumentHandler appendToParent = new AppendToParentPdf();
DocumentHandler exportToPdf = new ExportToPdf();

// pass the child/parent handlers as needed
handleDocument(exportToPdf, appendToParent);
handleDocument(exportToPdf, exportToPdf);

As for efficiency, I'd say don't try to optimise unless you run into performance issues. In any case, the problem won't be with the nested loop but with the logic itself that processes the documents.

armandino
  • 17,625
  • 17
  • 69
  • 81
2

For your 2nd question, you could use the provider pattern or an extension of it.

Provider pattern : This pattern has its roots in the Strategy pattern and it lets you design your data and behavior in an abstraction so that you can swap out implementation at any time

Santhosh
  • 6,547
  • 15
  • 56
  • 63
1

I tried to fit this into a comment, but there's too much to say...

I don't see how the change you're talking about is a code smell. If the requirements change for this simple function, then they change. If you only need to make the change in one place, then it sounds like you've done a good job. If your client is going to need both ways of doing it (or more), then you might consider some sort of strategy pattern so you don't have to rewrite the surrounding code to do either function.

If you're making dozens of these changes per week, then it could get messy and you probably ought to make a plan for how to more effectively deal with a very busy axis of change. Otherwise, discipline and refactoring can help you keep it clean.

As to whether or not n² is a problem, it depends. How big is n? If you have to do this frequently (i.e. dozens of times per hour) and n is in the 1000's of them, then you might have a problem. Otherwise I wouldn't sweat it as long as you're meeting or exceeding demand and your CPU/disk utilization is out of the danger zone.

BillRobertson42
  • 12,602
  • 4
  • 40
  • 57
  • Thanks for the answer. The problem is that the potentially 1 million + documents(parents+children) can be exported by the 300+ users at any given point and although noone is expecting any kind of sub-second exporter magic to happen, the system has the potential to be taxed by this kind of projected load. – sc_ray Nov 29 '11 at 13:18
  • Even if there are a million documents, how many is a user going to ask for in a given request? How many users will be on the system at one time, and of those, how many will be generating documents at a time? Unless you have some unusual requirements, I would expect those numbers to work out to be something reasonable. Still though, if you're worried about massive work to be done, then you might want to consider job queues to throttle the amount of concurrent work that will occur. I wouldn't worry about that though until some load testing proved to me that I needed them. – BillRobertson42 Nov 29 '11 at 23:29
1

The second questions' problem can be solved by simply creating an inteface Exporter with the method export(Document doc); and then implementing it for each of the various formats, e.g. class DocExporterImpl implements Exporter.

The first one is dependent on your requirements and no design pattern as such solves these problems. Can't help you there.

Varun Achar
  • 14,781
  • 7
  • 57
  • 74
1

Using a Set to keep track of which elements have already been exported might not be the most beautiful solution, but it will prevent the documents from being exported twice.

Set<Document> alreadyExported = new HashSet<Document>();

for(Document parentDocument:Documents){
   ExportToPdf(parentDocument);
   for(Document childDocument:parentDocument.children()){
      if(!aldreadyExported.contains(childDocument)){
         ExportToPdf(childDocument);
         alreadyExported.add(childDocument);
      }  
   }
}
Alderath
  • 3,761
  • 1
  • 25
  • 43