1

INTRO

I have a TCP/HTTP server that supports plugins in form of Shared Libraries (DLL and .so). It has make and .sln files build system via premake. When I start my application I feed to it a configuration file like this with description of what libraries server shall use as plugins and what arguments it shall pass to tham. For some time I had 2 plugins and all worked just fine. and even now works just fine if I feed to my server config fdiles alike this. But Now I have new plugin I am developing and so new config file.

SETUP

Steps required to setup my server on linux are fiew and simple

  • download build script (from here as described here)
  • ./cloud_server_net_setup.sh , no superuser needed, requires curl, make and g++ In regular case (not development this is enought - it will get boost, and other libraries it needs into local folder, it will build all of tham, build server in release form )
  • now you can cd into cloud_server/install-dir/
  • call export LD_LIBRARY_PATH=./:./lib_boost
  • and run our server ./CloudServer

But we need debug wersion so after we call script we

  • cd cloud_server/CloudServer/projects/linux-gmake/
  • make
  • cd bin/debug
  • export LD_LIBRARY_PATH=./:(place from where we called our script)/cloud_server/install-dir/lib_boost

PROBLEM

  • and now, finally we can call gdb.

So we call it. and this is what we see:

 gdb ./CloudServer

GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/ole_jak/cloud_server/CloudServer/projects/linux-gmake/bin/debug/CloudServer...done.
(gdb) r
Starting program: /home/ole_jak/cloud_server/CloudServer/projects/linux-gmake/bin/debug/CloudServer
[Thread debugging using libthread_db enabled]
Cloud Server v0.5
Copyright (c) 2011 Cloud Forever. All rights reserved.

Type 'help' to see help messages.
Config file path: config.xml
[New Thread 0x7ffff5967700 (LWP 11516)]
[New Thread 0x7ffff5166700 (LWP 11517)]
[New Thread 0x7ffff4965700 (LWP 11518)]
[New Thread 0x7ffff4164700 (LWP 11519)]
[New Thread 0x7ffff3963700 (LWP 11520)]
[New Thread 0x7ffff3162700 (LWP 11521)]
[New Thread 0x7ffff2961700 (LWP 11522)]
[New Thread 0x7ffff2160700 (LWP 11523)]
[New Thread 0x7ffff195f700 (LWP 11524)]
[New Thread 0x7ffff115e700 (LWP 11525)]
[New Thread 0x7ffff095d700 (LWP 11526)]
[New Thread 0x7fffebfff700 (LWP 11527)]
[New Thread 0x7fffeb7fe700 (LWP 11528)]
[New Thread 0x7fffeaffd700 (LWP 11529)]
[New Thread 0x7fffea7fc700 (LWP 11530)]
[New Thread 0x7fffe9ffb700 (LWP 11531)]
Library libFileService.so opened.
[New Thread 0x7fffe953c700 (LWP 11532)]
Library libUsersFilesService.so opened.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) x/i $pc
0x0:    Cannot access memory at address 0x0

I am Linux nube and all I know about Segmentation fault I know from wikipedia, but I know one more thing about my server and this new service I am creating - it compiles and runs on Windows with no errors at all (VS2008, 2010 solutions can be created from same premake script).

So I wonder how and where in this 2 files .cpp and .h I have created an error that does not show on windows at alss an shows so dramaticvally on Linux? And is it fixable, or visiable to fresh eye?

UPDATE: Valgrind output

ole_jak@dspproc:~/cloud_server/CloudServer/projects/linux-gmake/bin/debug$ valgrind ./CloudServer
==11682== Memcheck, a memory error detector
==11682== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==11682== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==11682== Command: ./CloudServer
==11682==
Cloud Server v0.5
Copyright (c) 2011 Cloud Forever. All rights reserved.

Type 'help' to see help messages.
Config file path: config.xml
Library libFileService.so opened.
Library libUsersFilesService.so opened.
==11682== Jump to the invalid address stated on the next line
==11682==    at 0x0: ???
==11682==    by 0x4D49BE: sqlite3_free (sqlite3.c:18155)
==11682==    by 0x102242D5: sqlite3OsInit (sqlite3.c:14162)
==11682==    by 0x1029EB28: sqlite3_initialize (sqlite3.c:107299)
==11682==    by 0x102A159F: openDatabase (sqlite3.c:108909)
==11682==    by 0x102A1B29: sqlite3_open (sqlite3.c:109156)
==11682==    by 0x1021CAB0: sqlite3pp::database::connect(char const*) (sqlite3pp.cpp:89)
==11682==    by 0x1021C6E3: sqlite3pp::database::database(char const*) (sqlite3pp.cpp:74)
==11682==    by 0x1020DDDF: users_files_service::create_files_table(std::string) (users_files_service.cpp:171)
==11682==    by 0x1020BAFC: users_files_service::apply_config(boost::shared_ptr<boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> > >) (users_files_service.cpp:38)
==11682==    by 0x4B5432: server_utils::parse_config_services(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:156)
==11682==    by 0x4B6436: server_utils::parse_config(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:208)
==11682==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==11682==
==11682==
==11682== Process terminating with default action of signal 11 (SIGSEGV)
==11682==  Bad permissions for mapped region at address 0x0
==11682==    at 0x0: ???
==11682==    by 0x4D49BE: sqlite3_free (sqlite3.c:18155)
==11682==    by 0x102242D5: sqlite3OsInit (sqlite3.c:14162)
==11682==    by 0x1029EB28: sqlite3_initialize (sqlite3.c:107299)
==11682==    by 0x102A159F: openDatabase (sqlite3.c:108909)
==11682==    by 0x102A1B29: sqlite3_open (sqlite3.c:109156)
==11682==    by 0x1021CAB0: sqlite3pp::database::connect(char const*) (sqlite3pp.cpp:89)
==11682==    by 0x1021C6E3: sqlite3pp::database::database(char const*) (sqlite3pp.cpp:74)
==11682==    by 0x1020DDDF: users_files_service::create_files_table(std::string) (users_files_service.cpp:171)
==11682==    by 0x1020BAFC: users_files_service::apply_config(boost::shared_ptr<boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> > >) (users_files_service.cpp:38)
==11682==    by 0x4B5432: server_utils::parse_config_services(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:156)
==11682==    by 0x4B6436: server_utils::parse_config(boost::property_tree::basic_ptree<std::string, std::string, std::less<std::string> >) (server_utils.cpp:208)
==11682==
==11682== HEAP SUMMARY:
==11682==     in use at exit: 124,050 bytes in 1,083 blocks
==11682==   total heap usage: 1,814 allocs, 731 frees, 183,517 bytes allocated
==11682==
==11682== LEAK SUMMARY:
==11682==    definitely lost: 0 bytes in 0 blocks
==11682==    indirectly lost: 0 bytes in 0 blocks
==11682==      possibly lost: 46,248 bytes in 799 blocks
==11682==    still reachable: 77,802 bytes in 284 blocks
==11682==         suppressed: 0 bytes in 0 blocks
==11682== Rerun with --leak-check=full to see details of leaked memory
==11682==
==11682== For counts of detected and suppressed errors, rerun with: -v
==11682== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
Убито
ole_jak@dspproc:~/cloud_server/CloudServer/projects/linux-gmake/bin/debug$
Rella
  • 65,003
  • 109
  • 363
  • 636
  • The very first thing to do is download [valgrind](http://valgrind.org/), (you should be able to download it through your linux distribution), and run 'valgrind ./CloudServer". It should be able to give you the call stack when the seg fault occurs. Also, just because you Windows build doesn't segfault, doesn't mean that it doesn't have the error. It could be suffering in silence. – Darcy Rayner Nov 05 '11 at 10:16
  • 1
    Also, just a quick guess from you output, but you may be deferencing a NULL pointer. – Darcy Rayner Nov 05 '11 at 10:22
  • Seems now reading it I am starting to see the problem... Thank you All!) – Rella Nov 05 '11 at 10:29
  • According to gdb output, you are dereferencing NULL pointer. You should use gdb to find out why it is happening so. – ks1322 Nov 05 '11 at 10:30
  • AS @DarcyRayner says valgrind will find the bug. But this looks like an unassigned pointer being used. A quick and easy way to find this is to turn on the warnings and set the compiler to treat warning like errors (this should help you find the problem). Note you should do this on Windows in DevStudio as well (turn up the warning level one notch). Warning should be considered as logical errors in your code and your code should compile with zero warnings. – Martin York Nov 05 '11 at 11:45

2 Answers2

2

This is a nasty one. I am unsure about the exact root cause, but this seems to be a multi-threading related issue. The immediate cause of the problem is that the sqlite3Config.m.xSize function pointer is NULL at the place and time the error happens.

This pointer is supposed to be initialized to point to a proper function the first time that sqlite3_initialize() is called, which normally happens the first time you open an SQLite database file. By setting breakpoints and watchpoints in GDB I was able to verify that the pointer is successfully set, yet at the time of the segmentation fault its value is NULL.

That could mean one of two things:

  • The new pointer value is not properly propagated to all threads. SQLite3 is supposed to be thread-safe, but well, threads can be nasty little buggers...

  • Something resets the pointer after it has been initialized. I considered this highly unlikely since the sqlite3Config structure is not usually modified after initialization.

I performed a simple test, which incidentally can be used as a temporary workaround: I added an explicit call to sqite3_initialize() as the first statement in main(), allowing it to be executed before any threads are launched. As a result, the segmentation fault went away and I got a shell prompt for your server, which points to the first of the two alternatives. Note that this is a workaround at best, since sqite3_initialize() is not supposed to be explicitly called. The root cause of the issue may still be present and make itself known otherwise - or, worse, it could break things in subtle, yet hard to detect, ways.

Since SQLite3 is supposed to be thread-safe (and the source code of the sqlite3_initialize() function seems correct in that regard), I am unsure what is happening. It could be a problem with the sqlite3pp wrapper or with the way the threads are launched.

thkala
  • 84,049
  • 23
  • 157
  • 201
  • Well - threadsafeness is irrelevent...) Anyway we found a way to fix it...) See [commit](http://code.google.com/p/cloudobserver/source/detail?r=1579). Problem was simple - we have main app and .so that are both statically linked to SQLite. When we first call some functions from our main app and than from SO its all OK but vice versa gives SIGSEGV. (note: SO is loaded at runtime, and it all happens in one single thread) – Rella Nov 05 '11 at 23:43
0

here are my suggestions.

  1. turn off optimizations. Sometime optimizations cause errors. use -O0 for example.
  2. remove dynamic loading, try linking your code in statically, and see if the problem still occurs.
  3. reduce the size of the problem. Make the smallest possible program that can reproduce the error and then post it here.

thanks, mike

h4ck3rm1k3
  • 2,060
  • 22
  • 33
  • 3
    Sometime optimizations cause errors Unlikely. It is more likely that optimization are exposing undefined behavior in your code. – Martin York Nov 05 '11 at 12:08