9

How can I make os.walk traverse the directory tree of an FTP database (located on a remote server)? The way the code is structured now is (comments provided):

import fnmatch, os, ftplib

def find(pattern, startdir=os.curdir): #find function taking variables for both desired file and the starting directory
    for (thisDir, subsHere, filesHere) in os.walk(startdir): #each of the variables change as the directory tree is walked
        for name in subsHere + filesHere: #going through all of the files and subdirectories
            if fnmatch.fnmatch(name, pattern): #if the name of one of the files or subs is the same as the inputted name
                fullpath = os.path.join(thisDir, name) #fullpath equals the concatenation of the directory and the name
                yield fullpath #return fullpath but anew each time

def findlist(pattern, startdir = os.curdir, dosort=False):
    matches = list(find(pattern, startdir)) #find with arguments pattern and startdir put into a list data structure
    if dosort: matches.sort() #isn't dosort automatically False? Is this statement any different from the same thing but with a line in between
    return matches

#def ftp(
#specifying where to search.

if __name__ == '__main__':
    import sys
    namepattern, startdir = sys.argv[1], sys.argv[2]
    for name in find(namepattern, startdir): print (name)

I am thinking that I need to define a new function (i.e., def ftp()) to add this functionality to the code above. However, I am afraid that the os.walk function will, by default, only walk the directory trees of the computer that the code is run from.

Is there a way that I can extend the functionality of os.walk to be able to traverse a remote directory tree (via FTP)?

Mazdak
  • 105,000
  • 18
  • 159
  • 188
warship
  • 2,924
  • 6
  • 39
  • 65
  • https://pypi.python.org/pypi/ftptool/0.5.1 – Joran Beasley Jul 16 '15 at 21:57
  • I'm trying to avoid any interfaces beyond `ftplib`. Is this possible to do? Disclaimer: I've already tried `ftptool` and could not get it to do what I want. As such, the code above is a Python rehash of the Linux `find` command. I'm trying to extend it by incorporating an FTP switch to `os.walk`. – warship Jul 16 '15 at 22:01
  • If someone can show me how to reimplement this in `ftptool` in a way that works for remote FTP databases, I will accept this as an answer as well. – warship Jul 16 '15 at 22:10
  • what are you trying to actually do? what do you mean "couldnt get it to do what you want"? what do you mean by remote ftp database? – Joran Beasley Jul 16 '15 at 22:30
  • When I use the `find` command in the Terminal, it by default searches the directory tree structure of my system (usually starting from the home directory). However, I am looking for a way to tell `find` to search the directory tree structure of a remote directory tree (such as any FTP database available on the web). Usually, you would need to open up a web browser and navigate to this FTP site. However, I would like to connect to it externally via a Python script and then use the `find` command to search it. – warship Jul 16 '15 at 22:51
  • My question though pertains only to the searching. I've already written code to connect to an FTP website from the Terminal. – warship Jul 16 '15 at 22:53
  • so you want to get a list of all paths that match a file pattern anywhere in the whole system? `locate fname` is much much faster and it runs on most linux machines – Joran Beasley Jul 16 '15 at 23:12
  • Yes, exactly. Should I use `locate fname` in the code of your answer anywhere? – warship Jul 16 '15 at 23:21

4 Answers4

7

All you need is utilizing the python's ftplib module. Since os.walk() is based on a Breadth-first search algorithm you need to find the directories and file names at each iteration, then continue the traversing recursively from the first directory. I implemented this algorithm about 2 years ago for using as the heart of FTPwalker, which is an optimum package for traversing extremely large directory trees Through FTP.

from os import path as ospath


class FTPWalk:
    """
    This class is contain corresponding functions for traversing the FTP
    servers using BFS algorithm.
    """
    def __init__(self, connection):
        self.connection = connection

    def listdir(self, _path):
        """
        return files and directory names within a path (directory)
        """

        file_list, dirs, nondirs = [], [], []
        try:
            self.connection.cwd(_path)
        except Exception as exp:
            print ("the current path is : ", self.connection.pwd(), exp.__str__(),_path)
            return [], []
        else:
            self.connection.retrlines('LIST', lambda x: file_list.append(x.split()))
            for info in file_list:
                ls_type, name = info[0], info[-1]
                if ls_type.startswith('d'):
                    dirs.append(name)
                else:
                    nondirs.append(name)
            return dirs, nondirs

    def walk(self, path='/'):
        """
        Walk through FTP server's directory tree, based on a BFS algorithm.
        """
        dirs, nondirs = self.listdir(path)
        yield path, dirs, nondirs
        for name in dirs:
            path = ospath.join(path, name)
            yield from self.walk(path)
            # In python2 use:
            # for path, dirs, nondirs in self.walk(path):
            #     yield path, dirs, nondirs
            self.connection.cwd('..')
            path = ospath.dirname(path)

Now for using this class, you can simply create a connection object using ftplib module and pass the the object to FTPWalk object and just loop over the walk() function:

In [2]: from test import FTPWalk

In [3]: import ftplib

In [4]: connection = ftplib.FTP("ftp.uniprot.org")

In [5]: connection.login()
Out[5]: '230 Login successful.'

In [6]: ftpwalk = FTPWalk(connection)

In [7]: for i in ftpwalk.walk():
            print(i)
   ...:     
('/', ['pub'], [])
('/pub', ['databases'], ['robots.txt'])
('/pub/databases', ['uniprot'], [])
('/pub/databases/uniprot', ['current_release', 'previous_releases'], ['LICENSE', 'current_release/README', 'current_release/knowledgebase/complete', 'previous_releases/', 'current_release/relnotes.txt', 'current_release/uniref'])
('/pub/databases/uniprot/current_release', ['decoy', 'knowledgebase', 'rdf', 'uniparc', 'uniref'], ['README', 'RELEASE.metalink', 'changes.html', 'news.html', 'relnotes.txt'])
...
...
...
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • It should be noted that using backslashes with FTP servers doesn't always work. Instead, you need to ensure that `os.path.join` doesn't join a path with `\ `. To do this, replace line 40: `path = ospath.join(path, name)` with `path = ospath.join(path, name).replace("\\", "/")`. Worth noting this is only an issue with windows, because the geniuses at Microsoft decided to use backslashes for directories, and `os.path.join` intelligently joins paths based on the OS. – Recessive Sep 30 '21 at 04:29
  • This doesn't seem to handle directories with spaces in the name – user1379351 Aug 31 '22 at 20:28
0

I needed a function like os.walk on FTP and there where not any so i thought it would be useful to write it , for future references you can find last version here

by the way here is the code that would do that :

def FTP_Walker(FTPpath,localpath):
    os.chdir(localpath)
    current_loc = os.getcwd()
    for item in ftp.nlst(FTPpath):
        if not is_file(item):
            yield from FTP_Walker(item,current_loc)

        elif is_file(item):
            yield(item)
            current_loc = localpath
        else:
            print('this is a item that i could not process')
    os.chdir(localpath)
    return


def is_file(filename):
    current = ftp.pwd()
    try:
        ftp.cwd(filename)
    except Exception as e :
        ftp.cwd(current)
        return True

    ftp.cwd(current)
    return False

how to use:

first connect to your host :

host_address = "my host address"
user_name = "my username"
password = "my password"


ftp = FTP(host_address)
ftp.login(user=user_name,passwd=password)

now you can call the function like this:

ftpwalk = FTP_Walker("FTP root path","path to local") # I'm not using path to local yet but in future versions I will improve it. so you can just path an '/' to it 

and then to print and download files you can do somthing like this :

for item in ftpwalk:
ftp.retrbinary("RETR "+item, open(os.path.join(current_loc,item.split('/')[-1]),"wb").write) #it is downloading the file 
print(item) # it will print the file address

( i will write more features for it soon so if you need some specific things or have any idea that can be useful for users i'll be happy to hear that )

hossein hayati
  • 1,088
  • 2
  • 15
  • 34
0

I wrote a library pip install walk-sftp. Event though it is named walk-sftp I included a WalkFTP class that lets you filter by start_date of files & end_date of files. You can even pass in a processing_function that returns True or False to see whether your process to clean & store data works. It also has a log parameter (pass filename) that uses pickle & keeps track of any progress so you don't overwrite or have to keep track of dates making backfilling easier.

https://pypi.org/project/walk-sftp/

-2

Im going to assume this is what you want ... although really I have no idea

ssh = paramiko.SSHClient()
ssh.connect(server, username=username, password=password)
ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command("locate my_file.txt")
print ssh_stdout

this will require the remote server to have the mlocate package `sudo apt-get install mlocate;sudo updatedb();

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • Some databases I'm connecting to have this error: `paramiko.ssh_exception.S SHException: Server 'ftp.server.org' not found in known_hosts`. Does this mean I can't ssh to them using paramiko? I will try the `mlocate` approach and post an update. – warship Jul 16 '15 at 23:31
  • 1
    @warship That's obvious to get such errors with such a protocol. The essence of SSH is connecting securely. – Mazdak May 05 '17 at 07:47