3

I'm profiling an application of mine - it's a macro system built on top of Java. I'm using hprof to profile some basic examples, and here are the top 20 functions where time is spent:

rank   self  accum   count trace  method
   1 14.73% 14.73%     453 303755 java.lang.Object.<init>
   2  9.75% 24.48%     300 303754 java.lang.Object.<init>
   3  7.18% 31.66%     221 303641 java.lang.Object.<init>
   4  6.83% 38.49%     210 303621 java.util.ArrayList.<init>
   5  6.76% 45.25%     208 303620 java.util.ArrayList.<init>
   6  5.95% 51.20%     183 303833 java.lang.Character.toChars
   7  4.55% 55.75%     140 303809 java.lang.Object.<init>
   8  4.42% 60.18%     136 303791 java.lang.Object.<init>
   9  3.77% 63.95%     116 303756 java.lang.Object.<init>
  10  3.64% 67.59%     112 300509 java.lang.Object.<init>
  11  2.67% 70.25%      82 303789 java.lang.Object.<init>
  12  2.54% 72.79%      78 303790 java.lang.Object.<init>
  13  1.69% 74.48%      52 303688 java.lang.Object.<init>
  14  1.66% 76.14%      51 303644 java.lang.Object.<init>
  15  1.46% 77.60%      45 305935 java.lang.Object.<init>
  16  1.40% 79.00%      43 303758 java.lang.Object.<init>
  17  1.20% 80.20%      37 305324 java.lang.Object.<init>
  18  0.98% 81.18%      30 302559 java.lang.Object.<init>
  19  0.62% 81.79%      19 300006 java.util.Arrays.copyOf
  20  0.52% 82.31%      16 305214 java.lang.Object.<init>

As you can see, most of the time is spent is Object.<init>. This is somewhat obscure to me.

My intuition is that the time is taken by memory allocation. Working in C has left me with a strong sense that dynamic memory allocation (i.e. malloc()) is inefficient. However, in Java the gospel seems to be that the JVM deals efficiently with short-lived objects; and that there is consequently nothing to be gained from patterns such as object pools.

I should add that the most performance-hungry part of the application is the parser, which does indeed create a whole lot of short-lived objects as part of its operation.

What do you think the time spent in Object.<init> is due to? Is it indeed linked to memory allocation? Could I benefit from using an object pool or some other trick to reduce memory allocation?

EDIT:

In reaction to Mike Dunlavey's answer, here is the interpretation of hprof's output by JPerfAnal, giving inclusive times.

Method Times by Caller (times inclusive): 3076 ticks
  1: java.lang.Object.<init>: 71,26% (2192 inclusive / 2191 exclusive)
    2: com.sun.demo.jvmti.hprof.Tracker.ObjectInit: 0,03% (1 inclusive / 0 exclusive)
      3: java.lang.Thread.currentThread: 0,03% (1 inclusive / 1 exclusive)
  1: parser.ParseData.<init>: 34,33% (1056 inclusive / 0 exclusive)
    2: parser.ParseErrors.<init>: 13,98% (430 inclusive / 1 exclusive)
      3: java.lang.Object.<init>: 7,18% (221 inclusive / 221 exclusive)
      3: java.util.ArrayList.<init>: 6,76% (208 inclusive / 208 exclusive)
    2: java.lang.Object.<init>: 13,52% (416 inclusive / 416 exclusive)
    2: java.util.ArrayList.<init>: 6,83% (210 inclusive / 0 exclusive)
      3: java.util.ArrayList.<init>: 6,83% (210 inclusive / 210 exclusive)
  1: parser.Matcher.parse: 34,33% (1056 inclusive / 0 exclusive)
    2: parser.ParseData.<init>: 34,33% (1056 inclusive / 0 exclusive)
      3: parser.ParseErrors.<init>: 13,98% (430 inclusive / 1 exclusive)
        4: java.lang.Object.<init>: 7,18% (221 inclusive / 221 exclusive)
        4: java.util.ArrayList.<init>: 6,76% (208 inclusive / 208 exclusive)
      3: java.lang.Object.<init>: 13,52% (416 inclusive / 416 exclusive)
      3: java.util.ArrayList.<init>: 6,83% (210 inclusive / 0 exclusive)
        4: java.util.ArrayList.<init>: 6,83% (210 inclusive / 210 exclusive)
  1: java.util.ArrayList.<init>: 28,38% (873 inclusive / 419 exclusive)
    2: java.util.AbstractList.<init>: 14,76% (454 inclusive / 0 exclusive)
      3: java.util.AbstractCollection.<init>: 14,76% (454 inclusive / 0 exclusive)
        4: java.lang.Object.<init>: 14,76% (454 inclusive / 454 exclusive)

(JPerfAnal also generates an inverted tree where the childs are the callers of the parents. I don't reproduce it for brevity, but suffice to say that about 40% of Object.<init> calls are made from the initialization of ArrayList, ParseData and ParseErrors.)

Now, that doesn't really change how I think about the issue, or my questions. I could change the algorithm so that it instantiates less objects; but for the time being, I'm looking for an orthogonal solution. So: could object pools help me?

Norswap
  • 11,740
  • 12
  • 47
  • 60

2 Answers2

1

1) "self" time is almost always useless because the real problem is in your code that calls those low-level methods - get inclusive time (self + callees).

2) The "accum" column is even more useless because all it does is add up the "self" times.

The other columns are also telling you nothing helpful.

hprof takes stack traces. What you need is for those to be as deep as possible.
You don't need a large number of them (contrary to popular misinformation about statistics) - ten are likely more than enough.
However, you do need to look at them and understand them.

Here's some general info about profiling.

EDIT in response to your edit showing JPerfAnal output:

a) It looks like the stack depth is only 4. You need to see stack depth as deep as possible. I don't care for the way they present the output. I mainly just look at the stack samples themselves. Time is not the really important thing. The important thing is what is going on and why.

b) Stack samples actually contain line-level information. So they will point to the precise lines in your routines like parser.ParseData.<init>, parser.Matcher.parse, and parrser.ParseErrors.<init> where the problematic calls are taking place. That will tell you exactly which allocations are costing all the time, and then you can decide if there's a better way to do them.

I get the impression that the parser runs off data structure tables. I know you don't want to redesign it, but if you redesign it you will get far better performance. Basically, if you can express it as a recursive-descent parser, and if the language does not change often, you write the parser directly in code - no tables - and it will be 10-100 times faster.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thanks for your answer. I updated my question to reflect the feedback. As I explain there, I'm not looking to change the algorithm in the short run (I'll get to it eventually), but would like to know if there is a way to speed up the allocations. – Norswap Aug 15 '13 at 13:59
  • I'm not sure exactly what you mean by "data structure tables", but here is how it works. It is a packrat PEG parser - meaning a recursive-descent parser with memoization (the table you refer to?) using the PEG formalism instead of the more conventional CFG formalism. Memoization is definitely needed to make run times acceptable in some cases. The grammar is represented by a graph-like data structure. This is necessary because, being a syntactic macro system, the grammar can change while parsing. – Norswap Aug 15 '13 at 15:28
  • The stack trace is indeed limited to 4. But I'm well aware exactly where and how new objects are instantiated. I'm not exactly sure how to get rid of them - the data is needed and has to be stored somewhere. Maybe I could manage arrays myself (but this is basically object pooling). Or I could simply save the data on the stack, but I'm afraid of stack overflows. – Norswap Aug 15 '13 at 15:32
  • @Norswap: pooling only makes sense if you're deleting and reallocating objects frequently. If the grammar can change on-the-fly then it may be more difficult to precompile it, which is what I was suggesting. I'd also not worry about stack overflow until it actually happens, but you know your application best. – Mike Dunlavey Aug 15 '13 at 15:51
  • I am deleting and reallocating object frequently. Basically, the parser records some info for each "parsing expression" it processes, and stores this into an object. The object has to be kept while the children of the expressions are being processed. When a children finishes parsing, its info is merged with the parent's info. When the parent expression has finished parsing, it is either discarded or memoized (depending on the memoization policy). – Norswap Aug 15 '13 at 16:03
  • As for stack overflow, the grammar can run a few hundreds of expressions deep. And of course, it is recursive, meaning it is always possible to build problematic input that causes stack overflows. For each expression, you have to make 3 or 4 recursive calls. So, as it stands, it is not rare to go more than a thousand calls deep. I do not know what the hard limits are (it's probably in term of memory and not of number of calls anyway), but that does already sound like a whole lot. – Norswap Aug 15 '13 at 16:12
  • @Norswap: You clearly know what you're doing. My style is minimalist: a) precompile, b) minimize data structure as much as possible, c) use free lists as object pools for re-cycling. – Mike Dunlavey Aug 15 '13 at 18:40
  • @MikeDunlavey do you know how to tell hprof to show the inclusive time (self + callees)? Also your link has broken. – David Winiecki Jun 05 '14 at 18:44
  • @DavidWiniecki: The link wasn't broken. Somebody just deleted the question, so you need like 10k rep to see it. In any case, I replaced the link. Can *hprof* do the job? If you say cpu=samples, and depth=100, then it will do reasonably on the sampling end (if your pgm is CPU-bound). However, on the presentation end it is all balled up. It has no idea what is useful info to present, so it just presents a bunch of stuff that other profilers seem to do, more or less. I can't tell if it lets you see the raw stack samples, which yield the best information of what's actually taking time. – Mike Dunlavey Jun 05 '14 at 18:59
  • @MikeDunlavey Thanks! Is there an hprof output presentation program that can show me inclusive times? jperfanal shows only 14% CPU usage for a method that I think contains nearly 100% of runtime. – David Winiecki Jun 05 '14 at 19:03
  • @DavidWiniecki: I got my information from [*here*](http://docs.oracle.com/javase/7/docs/technotes/samples/hprof.html). Unfortunately it only seems to talk about self-time which, as you know, is basically useless. What you need to know is what fraction of the time method F is on the stack, which it could easily get from stack samples, but does not seem to. – Mike Dunlavey Jun 05 '14 at 19:09
  • @DavidWiniecki: I just looked at [*JPerfAnal*](http://jperfanal.sourceforge.net/JPerfAnal_manual.html). It seems to say you can get inclusive percent by method, and by line. You do need to specify a sufficient depth. I just look at the raw samples. That tells me what's actually going on, as in [*this video*](https://www.youtube.com/watch?v=xPg3sRpdW1U). – Mike Dunlavey Jun 05 '14 at 19:23
1

Short answer:

Time spent in Object.<init> is likely due to CPU profile being generated along with heap profile; the latter instruments all memory allocations, making allocations slow and causing them to surface to the top of the CPU profile.

Long answer:

I suspect the profile shown in the question was generated by asking for both CPU profile AND heap profile at the same time. In other words, it was generated by doing something like:

java -agentlib:hprof=heap=sites,cpu=samples,depth=20 MyProg

Unfortunately, at least when it comes to JVM's built-in hprof profiler, CPU sampling is useless in presence of heap profiling. Heap profiling instruments all memory allocations, making the program run order(s) of magnitude slower. In program instrumented to track memory allocations, the allocations are indeed the bottleneck, and that's why Object.<init> is surfacing to the top of the CPU profile. In real program though, the allocations/garbage collection may, or may not be the bottleneck.

You typically want to generate just one profile at a time. First generate CPU profile by running something like java -agentlib:hprof=cpu=samples,depth=20, and analyze results of that. If CPU profile points to memory allocations/garbage collection as bottleneck, then, and only then, generate and analyze heap profile to see where majority of your allocations are coming from.

The other answer makes lot of salient points about analyzing CPU profile. In particular, it is lot easier to see what's going on using JPerfAnal, than by looking at the raw hperf CPU profile. Also, for inclusive percentages to be reasonably accurate, you may need to increase the stack sampling depth to match your application.