1

I have written some C++ code that is structured as follows:

double kernel(params)
{
  //code
}

void optimize(params)
{
  //some code
  double x = kernel();
  //some more code
}

int main()
{
  //some code
  optimize();
  //some more code
}

I tried profiling it with callgrind using the following commands:

g++ -O3 -g sgd.cpp
valgrind --tool=callgrind ./a.out commandline_args
callgrind_annotate callgrind.out.XXXX

I get the following output:

--------------------------------------------------------------------------------
            Ir 
--------------------------------------------------------------------------------
12,916,968,785  PROGRAM TOTALS

--------------------------------------------------------------------------------
           Ir  file:function
--------------------------------------------------------------------------------
5,862,783,191  /build/buildd/eglibc-2.15/string/../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:__memmove_ssse3 [/lib/i386-linux-gnu/libc-2.15.so]
2,847,653,393  /build/buildd/eglibc-2.15/malloc/malloc.c:_int_malloc [/lib/i386-linux-gnu/libc-2.15.so]
1,327,109,692  /build/buildd/eglibc-2.15/malloc/malloc.c:_int_free [/lib/i386-linux-gnu/libc-2.15.so]
  847,560,182  sgd.cpp:main [a.out]
  503,022,767  /build/buildd/eglibc-2.15/malloc/malloc.c:malloc [/lib/i386-linux-gnu/libc-2.15.so]
  235,458,068  /build/buildd/eglibc-2.15/malloc/malloc.c:free [/lib/i386-linux-gnu/libc-2.15.so]
  213,580,120  /build/buildd/eglibc-2.15/math/../sysdeps/i386/fpu/e_exp.S:__ieee754_exp [/lib/i386-linux-gnu/libm-2.15.so]
  203,349,602  ???:operator new(unsigned int) [/usr/lib/i386-linux-gnu/libstdc++.so.6.0.16]
  192,222,108  /build/buildd/eglibc-2.15/math/../sysdeps/ieee754/dbl-64/w_exp.c:exp [/lib/i386-linux-gnu/libm-2.15.so]
  128,438,068  /build/buildd/eglibc-2.15/string/../sysdeps/i386/i686/multiarch/strcat.S:0x0012ac73 [/lib/i386-linux-gnu/libc-2.15.so]
  128,431,176  ???:operator delete(void*) [/usr/lib/i386-linux-gnu/libstdc++.so.6.0.16]
  128,358,564  /usr/include/c++/4.6/ext/new_allocator.h:main
  117,645,255  /usr/include/c++/4.6/bits/stl_vector.h:main
  112,167,083  /usr/include/c++/4.6/bits/stl_algobase.h:main

Except main(), it doesn't show which parts of the source code take up most of the time. I know for a fact that most of the time is spent in optimize() function, and in turn a significant fraction of that time in kernel() function, but I can't see this from the output. How do I go about getting the details so that I can speed up my code?

If it helps, I am using std::vectors extensively in the code. I had implemented a similar code some time back using arrays, and callgrind seemed to work fine then. Could this be an issue?

If I disable the O3 flag, I get the following output:

--------------------------------------------------------------------------------
            Ir 
--------------------------------------------------------------------------------
19,026,610,083  PROGRAM TOTALS

--------------------------------------------------------------------------------
           Ir  file:function
--------------------------------------------------------------------------------
5,233,252,577  /build/buildd/eglibc-2.15/string/../sysdeps/i386/i686/multiarch/memcpy-ssse3.S:__memmove_ssse3 [/lib/i386-linux-gnu/libc-2.15.so]
2,542,000,057  /build/buildd/eglibc-2.15/malloc/malloc.c:_int_malloc [/lib/i386-linux-gnu/libc-2.15.so]
1,184,626,252  /build/buildd/eglibc-2.15/malloc/malloc.c:_int_free [/lib/i386-linux-gnu/libc-2.15.so]
  983,472,430  sgd.cpp:optimize(std::vector<double, std::allocator<double> >, std::vector<int, std::allocator<int> >, std::vector<double, std::allocator<double> >) [a.out]
  781,018,740  ???:std::vector<double, std::allocator<double> >::operator[](unsigned int) [a.out]
  772,117,839  sgd.cpp:kernel(std::vector<double, std::allocator<double> >, int, int, double) [a.out]
  476,616,742  ???:std::vector<double, std::allocator<double> >::vector(std::vector<double, std::allocator<double> > const&) [a.out]
  449,016,969  /build/buildd/eglibc-2.15/malloc/malloc.c:malloc [/lib/i386-linux-gnu/libc-2.15.so]
  324,200,916  ???:std::vector<double, std::allocator<double> >::size() const [a.out]
  305,705,504  ???:std::_Vector_base<double, std::allocator<double> >::_Vector_base(unsigned int, std::allocator<double> const&) [a.out]
  267,492,204  ???:std::_Vector_base<double, std::allocator<double> >::~_Vector_base() [a.out]
  238,309,873  /usr/include/c++/4.6/bits/stl_algobase.h:double* std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<double>(double const*, double const*, double*) [a.out]
  238,308,370  /usr/include/c++/4.6/bits/stl_algobase.h:double* std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*) [a.out]
  228,776,040  /usr/include/c++/4.6/bits/stl_algobase.h:std::_Miter_base<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > > >::iterator_type std::__miter_base<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > > >(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >) [a.out]
  228,776,038  /usr/include/c++/4.6/bits/stl_algobase.h:double* std::copy<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*) [a.out]
  210,178,748  /build/buildd/eglibc-2.15/malloc/malloc.c:free [/lib/i386-linux-gnu/libc-2.15.so]
  210,172,446  ???:std::vector<double, std::allocator<double> >::~vector() [a.out]
  209,711,018  sgd.cpp:square(double) [a.out]
  190,646,380  /build/buildd/eglibc-2.15/math/../sysdeps/i386/fpu/e_exp.S:__ieee754_exp [/lib/i386-linux-gnu/libm-2.15.so]
  181,517,469  ???:operator new(unsigned int) [/usr/lib/i386-linux-gnu/libstdc++.so.6.0.16]
  171,582,030  /usr/include/c++/4.6/bits/stl_iterator_base_types.h:std::_Iter_base<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, true>::_S_base(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >) [a.out]
  171,581,742  /build/buildd/eglibc-2.15/math/../sysdeps/ieee754/dbl-64/w_exp.c:exp [/lib/i386-linux-gnu/libm-2.15.so]
  152,853,344  ???:__gnu_cxx::new_allocator<double>::allocate(unsigned int, void const*) [a.out]
  152,852,752  ???:std::_Vector_base<double, std::allocator<double> >::_Vector_impl::_Vector_impl(std::allocator<double> const&) [a.out]
  152,517,360  /usr/include/c++/4.6/bits/stl_algobase.h:std::_Niter_base<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > > >::iterator_type std::__niter_base<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > > >(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >) [a.out]
  152,517,360  /usr/include/c++/4.6/bits/stl_iterator_base_types.h:std::_Iter_base<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, false>::_S_base(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >) [a.out]
  152,517,360  /usr/include/c++/4.6/bits/stl_iterator.h:__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >::__normal_iterator(double const* const&) [a.out]
  133,746,571  ???:std::_Vector_base<double, std::allocator<double> >::_M_deallocate(double*, unsigned int) [a.out]
  133,452,690  ???:std::vector<double, std::allocator<double> >::end() const [a.out]
  133,452,690  ???:std::vector<double, std::allocator<double> >::begin() const [a.out]
  131,134,604  sgd.cpp:sign(double) [a.out]
  123,920,353  /usr/include/c++/4.6/bits/stl_algobase.h:double* std::__copy_move_a<false, double const*, double*>(double const*, double const*, double*) [a.out]
  121,192,848  ???:std::vector<int, std::allocator<int> >::operator[](unsigned int) [a.out]
  114,649,360  /build/buildd/eglibc-2.15/string/../sysdeps/i386/i686/multiarch/strcat.S:0x0012ac73 [/lib/i386-linux-gnu/libc-2.15.so]
  114,642,456  ???:operator delete(void*) [/usr/lib/i386-linux-gnu/libstdc++.so.6.0.16]
  114,388,018  /usr/include/c++/4.6/bits/stl_uninitialized.h:double* std::__uninitialized_copy<true>::__uninit_copy<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*) [a.out]
  114,388,018  /usr/include/c++/4.6/bits/stl_uninitialized.h:double* std::uninitialized_copy<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*>(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*) [a.out]
  114,388,018  /usr/include/c++/4.6/bits/stl_uninitialized.h:double* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*, double>(__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, __gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >, double*, std::allocator<double>&) [a.out]
  105,086,674  /usr/include/c++/4.6/bits/stl_vector.h:std::_Vector_base<double, std::allocator<double> >::_M_allocate(unsigned int) [a.out]
   95,533,505  ???:std::_Vector_base<double, std::allocator<double> >::_M_get_Tp_allocator() [a.out]
   95,533,300  /usr/include/c++/4.6/bits/stl_construct.h:void std::_Destroy<double*>(double*, double*) [a.out]
   95,533,300  /usr/include/c++/4.6/bits/stl_construct.h:void std::_Destroy<double*, double>(double*, double*, std::allocator<double>&) [a.out]
   95,532,970  /usr/include/c++/4.6/bits/allocator.h:std::allocator<double>::allocator(std::allocator<double> const&) [a.out]
   95,323,350  /usr/include/c++/4.6/bits/stl_iterator.h:__gnu_cxx::__normal_iterator<double const*, std::vector<double, std::allocator<double> > >::base() const [a.out]
   76,594,040  /usr/include/c++/4.6/bits/allocator.h:std::allocator<double>::~allocator() [a.out]
   76,428,152  /usr/include/c++/4.6/bits/stl_algobase.h:std::_Niter_base<double*>::iterator_type std::__niter_base<double*>(double*) [a.out]
   76,426,584  /usr/include/c++/4.6/ext/new_allocator.h:__gnu_cxx::new_allocator<double>::deallocate(double*, unsigned int) [a.out]
   76,426,344  ???:std::_Vector_base<double, std::allocator<double> >::_Vector_impl::~_Vector_impl() [a.out]
   75,798,592  /usr/include/c++/4.6/bits/stl_algobase.h:__gnu_cxx::__enable_if<std::__is_scalar<double>::__value, double*>::__type std::__fill_n_a<double*, unsigned int, double>(double*, unsigned int, double const&) [a.out]
   47,768,335  /usr/include/c++/4.6/bits/stl_iterator_base_types.h:std::_Iter_base<double*, false>::_S_base(double*) [a.out]
   47,767,040  ???:__gnu_cxx::new_allocator<double>::max_size() const [a.out]
   47,662,045  ???:std::_Vector_base<double, std::allocator<double> >::_M_get_Tp_allocator() const [a.out]
   38,297,020  /usr/include/c++/4.6/ext/new_allocator.h:__gnu_cxx::new_allocator<double>::~new_allocator() [a.out]

This has more information than the previous output, but there are still two issues: One, output on unoptimized code doesn't help me make the optimized code faster. Two, most of the time (~50%) is taken up by libc functions, which I am not directly using in my code. How do I know which parts of the code map to these calls?

  • Without seeing what the missing code is (things like "some more code"), it would be hard to say. When the compiler optimizes, it does a lot of things. It may unroll loops, it may reorganize code (so, if nothing after kernel() uses x, it may jump to kernel at the end of the function, so you don't see that transition as a separate call), and in fact, if kernel or optimize are small enough, AND are in the same source file, it may inline one inside the other (assuming G++, other compilers are even smarter and look between source files). Also, much of STL are headers which get inlined. – digger May 11 '15 at 00:24
  • Simple statistical profilers usually work better than callgrind, in my experience... Try a search for "gperftools" and "flame graphs" for a couple of starting points. – Nemo May 11 '15 at 01:15
  • Try using kcachegrind to show you the results of callgrind - the GUI is great at showing hierarchical time. Caveat: I've only used it on C. – JohnH May 28 '15 at 19:02

0 Answers0