-3

I am doing this project for school whereby I am supposed to make the scripts run faster as it is very slow. For the past few months as I had no access to the actual scripts, I was testing on a dummy script written by me, which performs the same tasks. For which I found that pypy together with multiprocessing made my script run at least 10x faster. So upon getting access to the actual script, I applied multiprocessing to it and ran it with pypy. However, surprisingly, the code run with pypy runs 2x SLOWER compared to without pypy instead of showing any performance improvements. What could possibly be the reason? The actual script uses libraries like numpy, pandas and does a database connection to write the output the process to be accessed by the web server later on. Is it that numpy or pandas compiles faster in the regular compiler compared to pypy? If not, what else could explain this? Also, any suggestions to make it faster is also welcome :)

P.S Multiprocessing has already been applied and its only about 40 secs faster than the original code which is not sufficient.

Edit: Adding in the code It is a script to generate who came into contact with whom for how long and where - contact tracing for a hospital. Basically, what it is supposed to do is, it reads in a csv file with all the positions of the sensors at various timings and then there is an algorithm that generates all the contacts and writes it to the database to be picked up by the web server later.

Code is as follows. It is extremely long, probably why I didnt post it earlier :)

def resampleData(_beaconDevice, _timeInterval, _locationPtsX, _locationPtsY, database):

database.child("contact").child("progress").set(20)
beaconData = pd.DataFrame({'timestamp': _timeInterval, 'Device': _beaconDevice, 'Beacon Longtitude': _locationPtsX, 'Beacon Latitude': _locationPtsY})
beaconData.set_index('timestamp', inplace=True)
beaconData.index = pd.to_datetime(beaconData.index)
beaconData = beaconData.groupby('Device').resample('S')['Beacon Longtitude', 'Beacon Latitude'].mean().ffill()
return beaconData

def processTwo(connectedDev, temp, devicelist, increment, _patientlist, _beaconData, _start, _end, _devlist, _scale, database, user, _distance):

for numPatients, patientName in enumerate(_patientlist):
    timestamp = _beaconData.loc[patientName, :].index.tolist()
    patientX = _beaconData.loc[patientName, 'Beacon Longtitude'].tolist()
    patientY = _beaconData.loc[patientName, 'Beacon Latitude'].tolist()
    progressUnit = (55/len(timestamp))
    for t, timeNum in enumerate(timestamp):
        if timeNum >= _start and timeNum <= _end:
            for device, devNum in enumerate(_devlist):
                if devNum != patientName:
                    if devNum in devicelist:
                        logger.log ("Finding Contacts...", timeNum)
                        if increment<55:
                            increment += progressUnit
                            try:
                                database.child("contact").child("progress").set(30+increment)
                            except: logger.info("exception")
                        isContact, contactLoc = inContact(patientName, patientX, patientY, devNum, t, _beaconData, _scale, _distance)
                        if isContact==True:
                            logger.log (patientName, "in contact with", devNum, "!")
                            temp.append(patientName)
                            temp.append(timeNum)
                            temp.append(int(devNum))
                            temp.append(patientX[t])
                            temp.append(patientY[t])
                            temp.append(contactLoc[0])
                            temp.append(contactLoc[1])
                            connectedDev.append(temp)
                            temp = []

The processsTwo function is one of the seven other intense computative functions in the code. The for loops deal with DataFrames.

confused_kid
  • 63
  • 1
  • 11
  • 2
    Possible duplicate of [Am I using PyPy wrong? It's slower 10x than standard Python](https://stackoverflow.com/questions/31992693/am-i-using-pypy-wrong-its-slower-10x-than-standard-python) – Montel Edwards Mar 12 '18 at 02:35
  • 1
    "The actual script uses libraries like numpy…" First question: are you using `numpypy`, or using standard `numpy` and making PyPy adjust to it? – abarnert Mar 12 '18 at 02:43
  • 3
    You're asking us to speculate about behavior of code we cannot see based on a vague, unclear problem description. Not how this site works. Spend some time taking the [tour] and reading the [help] pages, especially [ask] and [mcve]. This site is not for questions asking *Please provide a list of the millions of reasons that something might not be working as expected in this code I'm not going to show you*. – Ken White Mar 12 '18 at 02:45
  • More generally: In (well-written) numpy code, there's very little actual Python computation, and a whole lot of stuff happening inside numpy. There's nothing PyPy can do to speed up what happens inside numpy. And meanwhile, the `cpyext` wrapper that lets PyPy use C extensions like numpy is ["infamously slow"](http://doc.pypy.org/en/latest/faq.html#should-i-install-numpy-or-numpypy). While they've done some great work optimizing that out over the past few years, sharing arrays via multiprocessing and cpyext-numpy-ing them on both sides is one of the areas that's often still slow. – abarnert Mar 12 '18 at 02:47
  • @abarnert its just numpy. Ah I see, that must be the reason. I will forward it to my professor. Thank you! – confused_kid Mar 12 '18 at 02:50
  • @KenWhite I am sorry if the question is vague. I can't help it as I am forced to work with something not done by me with zero explanation from the person who passed it to me. – confused_kid Mar 12 '18 at 02:51
  • You need to try to put together a [minimal, complete, verifiable example](https://stackoverflow.com/help/mcve): the smallest possible bit of code (ideally using as few of those libraries as possible) that runs slower in PyPy, and (as far as you can guess) does so for the same reason as your actual code. Then we can look at it and try to explain why it's running slower. Without that, all we can do is guess. – abarnert Mar 12 '18 at 02:52
  • 2
    That doesn't affect the guidelines of this site at all, I'm afraid. They don't say *unless you come up with some excuse why they don't apply to you*. – Ken White Mar 12 '18 at 02:55
  • The question is better with the code than without—but unless all of that code really is needed to demonstrate and debug the problem (doubtful), it's still nowhere near minimal. And at the same time, without the actual database and CSV files, it's not complete, either. – abarnert Mar 12 '18 at 03:38
  • @abarnert thanks for your suggestion on numba. I was looking for answers like tht. I will try it out. – confused_kid Mar 12 '18 at 03:53
  • [You really expect us to go through that entire wall of code?](http://idownvotedbecau.se/toomuchcode/) Please see: [mcve], emphasis on **minimal**. – EJoshuaS - Stand with Ukraine Mar 12 '18 at 05:50

1 Answers1

4

The actual script uses libraries like numpy, pandas and does a database connection…

If the vast majority of your time is spent in numpy, pandas, and database calls, rather than in Python loops or computation, there's virtually nothing for PyPy to speed up.

Both numpy and pandas are extension modules written in C (with a bit of C++, Fortran, and assembly), and the same goes for most database libraries. Extension modules get compiled to native code at install time. And that native code is going to run exactly the same, no matter what interpreter is driving it. In particular, it doesn't go through any kind of JIT in PyPy.* So, unless you have some significant Python computation somewhere, PyPy can't make anything faster.

And meanwhile, PyPy actually can make things slower. CPython can access C API extensions like numpy directly, but PyPy has to fake being CPython to talk to the extension code, which is does through a wrapper called CPyExt. The FAQ on numpy in PyPy says that CPyExt is "infamously slow". That's a bit unfair/self-deprecating, especially after all the work they've put in over the past 5 years; for many numpy programs, you won't even notice the difference. But there are still some cases where you will. And you mentioned multiprocessing, and many of the cases involve sharing arrays across processes.

Occasionally, using the numpypy fork (which reimplements the core of numpy in a PyPy-friendly way) is worth doing. As of 2018, that's a deprecated solution (and the last little incomplete bits will probably never be finished), but if for some reason you really need to use numpy and PyPy together, and you're running into one of those slow areas, it's still an option.


* If you need to JIT the numeric code, Jython or IronPython can be used with numeric libraries for the JVM or .NET runtime, which do run through a JIT. However, I don't know of any of them that are actually as fast as numpy for most use cases. Meanwhile, you might want to look at numba with numpy in CPython, which can often JIT the wrapper code that you write to drive your numpy work better than PyPy.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Reading further into the code, I realised it doesnt spend a lot of time in these libraries. there are in fact extensive computations happening, with nested for loops which i expected pypy to speed up. I dont even see him use numpy explicitly.only pandas in the resampleData function. – confused_kid Mar 12 '18 at 03:19
  • @confused_kid If those extensive computations deal with numpy or pandas objects, even though they're looping in pure Python, then you definitely should look into using numba under CPython, even though I relegated that to a footnote. But really, this is exactly why you need to provide an MCVE rather than making us guess over code we can't see. – abarnert Mar 12 '18 at 03:23
  • I have alr added the code into my post for your reference – confused_kid Mar 12 '18 at 03:33