-3

So I'm running multiple functions, each function takes a section out of the million line .txt file. Each function has a for loop that runs through every line in that section of million line file.

It takes info from those lines to see if it matches info in 2 other files, one about 50,000-100,000 lines long, the other about 500-1000 lines long. I checked if the lines match by running for loops through the other 2 files. Once the info matches I write the output to a new file, all functions write to the same file. The program will produce about 2,500 lines a minute, but will slow down the longer it runs. Also, when I run one of the function, it does in about 500 a minute, but when I do it with 23 other processes it only makes 2500 a minute, why is that?

Does anyone know why that would happen? Anyway, I could import something to make the program run/read through files faster, I am already using the with "as file1:" method.

Can the multi-processes be redone to run faster?

IAmInPLS
  • 4,051
  • 4
  • 24
  • 57
Snoopy_D
  • 1
  • 1
  • more thread != faster. It depends how many cores you have – Whitefret Apr 05 '16 at 14:08
  • if all your data can fit in memory, you could try to process your data using pandas module - it's very fast and very efficient. Don't forget about the slowest part - disk IO system, it will most probably be your bottleneck, not the number of your threads – MaxU - stand with Ukraine Apr 05 '16 at 14:18
  • 1. How many cores do you have? 2. What fraction of the total CPU is this process using (if it's close to 100%, then more cores won't help). 3. How do the threads get to the start of "their" section? If they have to read *n* lines first, I'm surprised this doesn't slow things down. – Martin Bonner supports Monica Apr 05 '16 at 14:19
  • i would also suggest you to open a new question with a bit more detailed description where you would provide an anonymized sample of your input data and an expected output – MaxU - stand with Ukraine Apr 05 '16 at 14:26

1 Answers1

0

The thread can only use your ressources. 4 cores = 4 thread with full ressource. There are a few cases where having more thread can improve performance, but this is not the case for you. So keep the thread count to the number of cores you have.

Also, because you have a concurrent access to a file, you need a lock on this file which will slow down the process a bit.

What could be improve however is your code to compare the string, but that is another question.

Whitefret
  • 1,057
  • 1
  • 10
  • 21