0

I am a Python newbie and I have to do some simple work with it.

I use sklearn.mixture methods to process the data, though, it takes too much time.

I have read somewhere here and have decided to cythonize these functions.

I have done python setup.py build_ext --inplace on all *.py files from sklearn.mixture as tutorial described. However, the timings for calling these methods remained absolutely the same. I have even renamed *.py files to be sure that compiled native libraries are linked.

My test application is presented below:

import os
import datetime
from sklearn import mixture
import pickle

def process():
    with open('test_in', 'rb') as f:
        mfcc = pickle.load(f)
    time_start = datetime.datetime.now()
    print(time_start.strftime("%Y-%m-%d %H:%M:%S.%f"))
    gmm = mixture.GaussianMixture(n_components=10,  max_iter=150)
    voice_model = gmm.fit(mfcc)
    time_end = datetime.datetime.now()
    print(time_end.strftime("%Y-%m-%d %H:%M:%S.%f"))
    delta = time_end - time_start
    print('Delta: ' + str(delta))
    with open('test_out', 'wb') as f:
        pickle.dump(voice_model, f)
    return

process()

So, could someone show me what I am doing wrong?

Is there another way to improve performance?

E.Z
  • 1,958
  • 1
  • 18
  • 27
Donz
  • 1,389
  • 1
  • 13
  • 21
  • What do you mean by "cythonize these functions"? You don't seem to talk about any .pyx file, neither about which functions you "cythonized". – TomDLT Aug 04 '17 at 14:10
  • For my understanding I can cythonize .py files too. And it works as I see .pyd as output file. My setup.py is from tutorial. I specified *.py instead of app.py in it and executed it in sklearn.mixture directory. `from distutils.core import setup` `from Cython.Build import cythonize` `setup(` `name = 'Hello world app',` `ext_modules = cythonize("app.py"),` `)` – Donz Aug 04 '17 at 14:14
  • 1
    You can't cythonize `.py` file, you need a `.pyx` file. I suggest you to get more familiar with cython ([tutorial](http://conference.scipy.org/proceedings/SciPy2009/paper_1/full_text.pdf)) before trying to improve scikit-learn performances, which should already be quite optimized. – TomDLT Aug 04 '17 at 14:40
  • Thx. But why I got .so files with native code inside when I did cython for my .py files? And I see .so files in another packages of sklearn, bot not in mixture, so I decided that for some reason they didn't do this optimization. Am I wrong? – Donz Aug 04 '17 at 14:54
  • 1
    @TomDLT You're wrong - you can Cythonize a `.py` file. The benefits are likely to be fairly small since you aren't providing type information but it does work. (You can often provide type information in an associated .pxd file - see "pure Python mode"). I agree that scikit learn is likely to be pretty well-optimized though, so there's little for OP to do. – DavidW Aug 04 '17 at 18:11
  • It does not matter how you name your file. What matters, is what is inside of the file. Actually, you do not cythonize the function mixture but the call of the function, and there is nothing to gain. You should cythonize the file where this function is defined. And do this only after having profiled the function and knowing for every line how much time it uses. – ead Aug 04 '17 at 18:14
  • @ead, I also cythonized files from sklearn.mixture with this function body as described in topic. Timing remained the same. – Donz Aug 09 '17 at 13:37
  • Can you tell me in which line of sklearn.mixture you spend the most time? Why would you think that this line would profit from being compiled with cython? – ead Aug 09 '17 at 13:41
  • @ead, I know that mixture.GaussianMixture function is the bottle neck in my project. Does it make sense to find specific line inside it? Even I find it out I can't cythonize only this line. – Donz Aug 09 '17 at 14:19
  • 1
    When I look into the code, GaussianMixture just calls some numpy functions and the whole work is done inside these functions. Cynthonizing these calls wont make the program run faster. Optimizing without knowing what is the bottle neck is just a shot in the dark - it is unlikely to solve the problems – ead Aug 09 '17 at 15:06
  • @ead, it seems, you're right. Calls are going into another libraries that are already cythonized. Thank you. I will think about another way of optimization. – Donz Aug 10 '17 at 13:05

0 Answers0