How can i use multithreading (or multiproccessing?) for faster data upload?

Question

I have a list of issues (jira issues):

listOfKeys = [id1,id2,id3,id4,id5...id30000]

I want to get worklogs of this issues, for this I used jira-python library and this code:

listOfWorklogs=pd.DataFrame()                 (I used pandas (pd) lib)
lst={}                                       #dictionary for help, where the worklogs will be stored
for i in range(len(listOfKeys)):
    worklogs=jira.worklogs(listOfKeys[i])    #getting list of worklogs
    if(len(worklogs)) == 0:
        i+=1
    else:
        for j in range(len(worklogs)):
            lst = {
                    'self': worklogs[j].self,  
                    'author': worklogs[j].author,
                    'started': worklogs[j].started,
                    'created': worklogs[j].created,
                    'updated': worklogs[j].updated,
                    'timespent': worklogs[j].timeSpentSeconds
                }
            listOfWorklogs = listOfWorklogs.append(lst, ignore_index=True)
########### Below there is the recording to the .xlsx file ################

so I simply go into the worklog of each issue in a simple loop, which is equivalent to referring to the link: https://jira.mycompany.com/rest/api/2/issue/issueid/worklogs and retrieving information from this link

The problem is that there are more than 30,000 such issues. and the loop is sooo slow (approximately 3 sec for 1 issue) Can I somehow start multiple loops / processes / threads in parallel to speed up the process of getting worklogs (maybe without jira-python library)?

Does the listOfWorlogs have to be in order? Can it be ordered afterwards if it needs to be in some order? — Lauro Bravar, Jan 22 '19 at 08:07
Perhaps I will answer your question with my comment: in the resulting listOfWorklogs there will be a "self" - reference in which the issueid will be stored so I did not bother about the organization of the dictionary: {issueid1: lst1, issueid2: lst2 ...} whether listOfWorklogs should be ordered - not necessarily can it be ordered afterwards - probably yes — Vitaliy Gudukhin, Jan 22 '19 at 08:37

score 0 · Answer 1 · answered Jan 22 '19 at 09:30

I recycled a piece of code I made into your code, I hope it helps:

from multiprocessing import Manager, Process, cpu_count

def insert_into_list(worklog, queue):
    lst = {
        'self': worklog.self,  
        'author': worklog.author,
        'started': worklog.started,
        'created': worklog.created,
        'updated': worklog.updated,
        'timespent': worklog.timeSpentSeconds
    }
    queue.put(lst)
    return

# Number of cpus in the pc
num_cpus = cpu_count()
index = 0

# Manager and queue to hold the results
manager = Manager()

# The queue has controlled insertion, so processes don't step on each other
queue = manager.Queue()

listOfWorklogs=pd.DataFrame()
lst={}                                       
for i in range(len(listOfKeys)):
    worklogs=jira.worklogs(listOfKeys[i])    #getting list of worklogs
if(len(worklogs)) == 0:
    i+=1
else:

    # This loop replaces your "for j in range(len(worklogs))" loop
    while index < len(worklogs):
        processes = []
        elements = min(num_cpus, len(worklogs) - index)

        # Create a process for each cpu
        for i in range(elements):
            process = Process(target=insert_into_list, args=(worklogs[i+index], queue))
            processes.append(process)

        # Run the processes
        for i in range(elements):
            processes[i].start()

        # Wait for them to finish
        for i in range(elements):
            processes[i].join(timeout=10)

        index += num_cpus

    # Dump the queue into the dataframe
    while queue.qsize() != 0:
        listOfWorklogs.append(q.get(), ignore_index=True)

This should work and reduce the time by a factor of little less than the number of CPUs in your machine. You can try and change that number manually for better performance. In any case I find it very strange that it takes about 3 seconds per operation.

PS: I couldn't try the code because I have no examples, it probably has some bugs

Thanks for your help! I will try to use your code to get data and will definitely write here about the results. — Vitaliy Gudukhin, Jan 22 '19 at 10:00
sorry, but could you take a look at my answer below, when you will be able to thanks! — Vitaliy Gudukhin, Jan 22 '19 at 13:30

score 0 · Answer 2 · answered Jan 22 '19 at 13:28

I have some troubles((

1) indents in the code where the first "for" loop appears and the first "if" instruction begins (this instruction and everything below should be included in the loop, right?)

for i in range(len(listOfKeys)-99):
    worklogs=jira.worklogs(listOfKeys[i])    #getting list of worklogs
    if(len(worklogs)) == 0:
    ....

2) cmd, conda prompt and Spyder did not allow your code to work for a reason: Python Multiprocessing error: AttributeError: module '__ main__' has no attribute 'spec' After researching in the google, I had to set a bit higher in the code: spec = None (but I'm not sure if this is correct) and this error disappeared. By the way, the code in Jupyter Notebook worked without this error, but listOfWorklogs is empty and this is not right.

3) when I corrected indents and set __spec __ = None, a new error occurred in this place: processes[i].start () error like this: "PicklingError: Can't pickle : attribute lookup PropertyHolder on jira.resources failed"

if I remove the parentheses from the start and join methods, the code will work, but I will not have any entries in the listOfWorklogs(((

I ask again for your help!)

1) Yes, sorry about that 2) I've never seen that error :( 3) Pickle sounds to me like the serialization library, I dunno what that has to do with this, sincerely I'm lost on this one too About the list of worklogs being empty: try printing a couple of q.get() before the last two lines, and see what you get. Also I missed the initialization of the variable "index", an index = 0 is missing right after the else (before the first while loop) — Lauro Bravar, Jan 23 '19 at 07:51
I think I managed to get around the "PicklingError" problem with some changes in the jira / resources.py file. Also, I declared the index variable and assigned it 0. When I add "print (queue.get ())" (q means queue, right?) before the last two lines - the program is executed, executed ,executed.. and nothing happens. After 10 minutes of execution I killed the process (((( — Vitaliy Gudukhin, Jan 23 '19 at 10:19
Well that's unexpected :( You might have to debug it line by line and see what's exactly going on... — Lauro Bravar, Jan 23 '19 at 17:19

score 0 · Answer 3 · answered Jan 22 '19 at 16:34

0

How about thinking about it not from a technical standpoint but a logical one? You know your code works, but at a rate of 3sec per 1 issue which means it would take 25 hours to complete. If you have the ability to split up the # of Jira issues that are passed into the script (maybe use date or issue key, etc) you could create multiple different .py files with basically the same code, you would just be passing each one a different list of Jira tickets. So you could just run say 4 of them at the same time and you would reduce your time to 6.25 hours each.

answered Jan 22 '19 at 16:34

guitarhero23

1,065
9
11

I thought about a similar course of action, but this is not a solution ..so if I need to unload 60K of worklogs, then I will have to break the getting data into different pieces again at the same time, I know that by splitting into several threads, using just one shell script, these data are unloaded .. so I cannot believe and accept the fact that using Python and heaps of its libraries it is impossible to do this – Vitaliy Gudukhin Jan 22 '19 at 20:28

How can i use multithreading (or multiproccessing?) for faster data upload?

3 Answers3