0

I've been working on a "crawler" of sorts that goes through our repository, and lists directories and files as it goes. For every directory it enounters, it creates a thread that does the same for that directory and so on, recursively. Effectively this creates a very short-lived thread for every directory encountered in the repos. ( it doesn't take very long to request information on just one path, there are just tens of thousands of them )

The logic looks as follows:

import threading
import perforce as Perforce #custom perforce class
from pathlib import Path

p4 = Perforce()
p4.connect()

class Dir():
    def __init__(self, path):
        self.dirs = []
        self.files = []
        self.path = path

        self.crawlers = []

    def build_crawler(self):
        worker = Crawler(self)
        # append to class variable to keep it from being deleted
        self.crawlers.append(worker)
        worker.start()

class Crawler(threading.Thread):
    def __init__(self, dir):
        threading.Thread.__init__(self)
        self.dir = dir

    def run(self):
        depotdirs = p4.getdepotdirs(self.dir.path)
        depotfiles = p4.getdepotfiles(self.dir.path)

        for p in depotdirs:
            if Path(p).is_dir():
                _d = Dir(self.dir, p)
                self.dir.dirs.append(_d)

        for p in depotfiles:
            if Path(p).is_file():
                f = File(p) # File is like Dir, but with less stuff, just a path.
                self.dir.files.append(f)

        for dir in self.dir.dirs:
            dir.build_crawler()
            for worker in d.crawlers:
                worker.join()

Obviously this is not complete code, but it represents what I'm doing.

My question really is whether I can create an instance of this Perforce class in the __init__ method of the Crawler class, so that requests can be done separately. Right now, I have to call join() on the created threads so that they wait for completion, to avoid concurrent perforce calls.

I've tried it out, but it seems like there is a limit to how many connections you can create: I don't have a solid number, but somewhere along the line Perforce just started straight up refusing connections, which I presume is due to the number of concurrent requests.

Really what I'm asking I suppose is two-fold: is there a better way of creating a data model representing a repos with tens of thousands of files than the one I'm using, and is what I'm trying to do possible, and if so, how.

Any help would be greatly appreciated :)

MaVCArt
  • 745
  • 8
  • 21
  • Can you add more precision to `Perforce just started straight up refusing connections, which I presume is due to the number of concurrent requests`? What precise error did you receive? Were there messages in the Perforce server log as well? etc. – Bryan Pendleton Sep 09 '16 at 13:43
  • I can't get it to give me the error anymore, unfortunately - but it had to do with connection timeouts. I fixed that by making the perforce connection bit as short-lived as possible, killing it as soon as it's done giving me the data I need. As far as I can tell there were no messages in the perforce server log, no - I did find out how to do what I need another way though, see the answer below. – MaVCArt Sep 09 '16 at 14:23

1 Answers1

2

I found out how to do this (it's infuriatingly simple, as with all simple solutions to overly complicated problems):

To build a data model that contains Dir and File classes representing a whole depot with thousands of files, just call p4.run("files", "-e", path + "\\..."). This will return a list of every file in path, recursively. From there all you need to do is iterate over every returned path and construct your data model from there.

Hope this helps someone at some point.

MaVCArt
  • 745
  • 8
  • 21