I've been working on a "crawler" of sorts that goes through our repository, and lists directories and files as it goes. For every directory it enounters, it creates a thread that does the same for that directory and so on, recursively. Effectively this creates a very short-lived thread for every directory encountered in the repos. ( it doesn't take very long to request information on just one path, there are just tens of thousands of them )
The logic looks as follows:
import threading
import perforce as Perforce #custom perforce class
from pathlib import Path
p4 = Perforce()
p4.connect()
class Dir():
def __init__(self, path):
self.dirs = []
self.files = []
self.path = path
self.crawlers = []
def build_crawler(self):
worker = Crawler(self)
# append to class variable to keep it from being deleted
self.crawlers.append(worker)
worker.start()
class Crawler(threading.Thread):
def __init__(self, dir):
threading.Thread.__init__(self)
self.dir = dir
def run(self):
depotdirs = p4.getdepotdirs(self.dir.path)
depotfiles = p4.getdepotfiles(self.dir.path)
for p in depotdirs:
if Path(p).is_dir():
_d = Dir(self.dir, p)
self.dir.dirs.append(_d)
for p in depotfiles:
if Path(p).is_file():
f = File(p) # File is like Dir, but with less stuff, just a path.
self.dir.files.append(f)
for dir in self.dir.dirs:
dir.build_crawler()
for worker in d.crawlers:
worker.join()
Obviously this is not complete code, but it represents what I'm doing.
My question really is whether I can create an instance of this Perforce class in the __init__
method of the Crawler class, so that requests can be done separately. Right now, I have to call join()
on the created threads so that they wait for completion, to avoid concurrent perforce calls.
I've tried it out, but it seems like there is a limit to how many connections you can create: I don't have a solid number, but somewhere along the line Perforce just started straight up refusing connections, which I presume is due to the number of concurrent requests.
Really what I'm asking I suppose is two-fold: is there a better way of creating a data model representing a repos with tens of thousands of files than the one I'm using, and is what I'm trying to do possible, and if so, how.
Any help would be greatly appreciated :)