I am doing this project for school whereby I am supposed to make the scripts run faster as it is very slow. For the past few months as I had no access to the actual scripts, I was testing on a dummy script written by me, which performs the same tasks. For which I found that pypy together with multiprocessing made my script run at least 10x faster. So upon getting access to the actual script, I applied multiprocessing to it and ran it with pypy. However, surprisingly, the code run with pypy runs 2x SLOWER compared to without pypy instead of showing any performance improvements. What could possibly be the reason? The actual script uses libraries like numpy, pandas and does a database connection to write the output the process to be accessed by the web server later on. Is it that numpy or pandas compiles faster in the regular compiler compared to pypy? If not, what else could explain this? Also, any suggestions to make it faster is also welcome :)
P.S Multiprocessing has already been applied and its only about 40 secs faster than the original code which is not sufficient.
Edit: Adding in the code It is a script to generate who came into contact with whom for how long and where - contact tracing for a hospital. Basically, what it is supposed to do is, it reads in a csv file with all the positions of the sensors at various timings and then there is an algorithm that generates all the contacts and writes it to the database to be picked up by the web server later.
Code is as follows. It is extremely long, probably why I didnt post it earlier :)
def resampleData(_beaconDevice, _timeInterval, _locationPtsX, _locationPtsY, database):
database.child("contact").child("progress").set(20)
beaconData = pd.DataFrame({'timestamp': _timeInterval, 'Device': _beaconDevice, 'Beacon Longtitude': _locationPtsX, 'Beacon Latitude': _locationPtsY})
beaconData.set_index('timestamp', inplace=True)
beaconData.index = pd.to_datetime(beaconData.index)
beaconData = beaconData.groupby('Device').resample('S')['Beacon Longtitude', 'Beacon Latitude'].mean().ffill()
return beaconData
def processTwo(connectedDev, temp, devicelist, increment, _patientlist, _beaconData, _start, _end, _devlist, _scale, database, user, _distance):
for numPatients, patientName in enumerate(_patientlist):
timestamp = _beaconData.loc[patientName, :].index.tolist()
patientX = _beaconData.loc[patientName, 'Beacon Longtitude'].tolist()
patientY = _beaconData.loc[patientName, 'Beacon Latitude'].tolist()
progressUnit = (55/len(timestamp))
for t, timeNum in enumerate(timestamp):
if timeNum >= _start and timeNum <= _end:
for device, devNum in enumerate(_devlist):
if devNum != patientName:
if devNum in devicelist:
logger.log ("Finding Contacts...", timeNum)
if increment<55:
increment += progressUnit
try:
database.child("contact").child("progress").set(30+increment)
except: logger.info("exception")
isContact, contactLoc = inContact(patientName, patientX, patientY, devNum, t, _beaconData, _scale, _distance)
if isContact==True:
logger.log (patientName, "in contact with", devNum, "!")
temp.append(patientName)
temp.append(timeNum)
temp.append(int(devNum))
temp.append(patientX[t])
temp.append(patientY[t])
temp.append(contactLoc[0])
temp.append(contactLoc[1])
connectedDev.append(temp)
temp = []
The processsTwo function is one of the seven other intense computative functions in the code. The for loops deal with DataFrames.