0

I designed data warehousing application, but I struggle with poor performance when fetching data from the source and saving them to db. - only approximately 150 kB/s.

Because of the limitations imposed by the customer, I am forced to use Django on 64bit Win machine and save data to MS SQL Express (exact versions below). I am using django-mssql (1.7) backend.

Original data are stored in .dbf file (Visual FoxPro), dbfread returns each row from a file as Python dict (this is not the issue, tested by running just reader discarding data). This dictionary is then checked for data quality (function sanitize_value_for_db() below), data are copied to Django data model (attributes of the object populated; tables are wide, hence each table/object has about 100 columns/attributes) and objects are saved to db. using Django objects.bulk_create() (in batches of 50-100).

I run the code through profiler using cProfile and pstats modules. Results are below. I see that most of the time is spent in PyWin. But I have no clue if there is something I can do. Any hints or opinions will be greatly appreciated. Thanks.

Configuration:

  • Xeon E5-2403 v2 @ 1,8 GHz, 30 GB RAM
  • Windows Server 2012 R2 (64bit)
  • MS SQL Express 64bit, v 11.02.2100.60
  • Python (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:44:40) [MSC v.1600 64 bit (AMD64)] on win32#
  • pywin32-219.win-amd64-py3.4
  • Django (1.7.10)
  • django-mssql (1.7)

Profile (on smaller data sample):

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      2306881  332.596    0.000  332.596    0.000 {built-in method compile}
      4592205  186.028    0.000  186.028    0.000 {method 'InvokeTypes' of 'PyIDispatch' objects}
     39558280  176.963    0.000  176.963    0.000 {method 'Bind' of 'PyITypeComp' objects}
      9889570  173.905    0.000  712.091    0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:390(_LazyAddAttr_)
     17464949  151.818    0.000  279.692    0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:151(_AddFunc_)
         7203  142.584    0.020 2215.686    0.308 c:\Python34\Lib\site-packages\sqlserver_ado\dbapi.py:587(execute)
            4  120.012   30.003  120.012   30.003 {built-in method sleep}
     11484250  100.385    0.000 1269.561    0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:444(__getattr__)
     12167658   67.229    0.000   67.229    0.000 {method 'Invoke' of 'PyIDispatch' objects}
     22100535   50.127    0.000   53.982    0.000 <string>:12(__init__)
      4599456   48.409    0.000  225.498    0.000 c:\Python34\Lib\site-packages\win32com\client\__init__.py:18(__WrapDispatch)
     27390427   45.257    0.000   65.391    0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:420(_ResolveType)
         7189   44.294    0.006 2296.538    0.319 c:\Python34\Lib\site-packages\django\db\models\query.py:911(_insert)
      9889570   43.621    0.000  759.155    0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:381(__LazyMap__)
      2306881   43.605    0.000  130.943    0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:303(MakeDispatchFuncMethod)
      5275612   33.300    0.000  402.786    0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:524(__setattr__)
     17464949   32.793    0.000   32.793    0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:58(__init__)
      4599456   32.494    0.000  124.536    0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:120(Dispatch)
      2292471   30.889    0.000  843.841    0.000 c:\Python34\Lib\site-packages\sqlserver_ado\dbapi.py:266(_configure_parameter)
      7007079   27.918    0.000   53.176    0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:516(MakePublicAttributeName)
      2306881   27.907    0.000   46.266    0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:483(_BuildArgList)
     15201179   25.593    0.000   25.593    0.000 {method 'GetTypeAttr' of 'PyITypeInfo' objects}
     68338972   24.271    0.000   24.758    0.000 {built-in method isinstance}
      2306881   23.056    0.000  538.037    0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:314(_make_method_)
     17464949   21.683    0.000   21.683    0.000 {method 'GetNames' of 'PyITypeInfo' objects}
      7007079   20.396    0.000   20.396    0.000 c:\Python34\Lib\site-packages\win32com\client\build.py:546(<listcomp>)
      4599456   18.166    0.000   18.166    0.000 c:\Python34\Lib\site-packages\win32com\client\dynamic.py:172(__init__)
      3422203   15.624    0.000   20.870    0.000 C:\MRI\mri\dwh\daq\daq_utils.py:152(sanitize_value_for_db)
Mr. Napik
  • 5,499
  • 3
  • 24
  • 18
  • It seems like some a good portion of time is spent building a query. Did you suggest using raw SQL? Since you are already using bulk_create, you don't get signals and other Django stuff executed, so raw SQL may be the way to go. – Назар Топольський Nov 07 '16 at 11:04
  • Another suggestion: if currently you try to insert all items in one bulk_create - you may want to consider splitting your data in batches of, say, 1000 items, and insert them like that. – Назар Топольський Nov 07 '16 at 11:07
  • Hi, thank you very much. For clarification: the bulk insert now works in batches of 50-100, since MS-SQL can handle at most 2100 parameters per single query. I do not need the signalling, so I will thest if raw query helps. – Mr. Napik Nov 08 '16 at 12:18

0 Answers0