1

I am really new as to the file reading aspects of Python and FTP combination. I searched all over the google and Stack but could not find a working solution for my ask.

My Ask is below.

I have an FTP location - dummy details below.

  1. Location: ftp.xyz.com
  2. username: random
  3. password: password
  4. File Name:testreport.csv
  5. filepath:rootfolder/testfolder/

The file has 20 columns and 10000 rows of data

I am looking to create a dataframe 'df' that will have all the rows captured from the testreport.csv file using Python.Is there a way to directly post it into the datframe rather than downloading it first?

Is there a code I can use for this? I am using Python 3. Any help is appreciated.

Vispal Junior
  • 11
  • 1
  • 2
  • What have you tried? I can see plenty of leads just by typing "*read a file from FTP/SFTP python*" in google – anky Jan 21 '20 at 07:40
  • What about this? [Read FTP file contents in Python and use it at the same time for Pandas and directly](https://stackoverflow.com/q/58320638/850848) – Martin Prikryl Jan 21 '20 at 07:51
  • Hi @anky_91, thank you for the help. I couldnt find any that was csv related nor it involves dataframes or directly storing than downloading – Vispal Junior Jan 21 '20 at 07:51

1 Answers1

1

20 columns and 10000 rows is not so huge, so except if you are working on an embedded system with tiny storage, the simplest way is to just download the file with ftplib and then load it into a dataframe with read_csv (see NicolasDupuy's answer).

If you really and for any reason want to avoid to store it on the local disk, it will be a little trickyer because pandas read_csv is not able to read from a stream and requires a plain file. That means that you will have to parse the file by hand or only with the csv module, and then feed a dataframe from that.

Code could be:

import csv
import ftplib
import pandas as pd

FTP = ftplib.open('ftp.xyz.com', 'random', 'password')
FTP.cwd('rootfolder/testfolder')

first = True     # to identify the header line
data = []
columns = None

def process_row(line):
    global columns
    if first:
        columns = parse(line)
        first = False
    else:
        data.append(parse(line))

def parse(line):
    # Assume trivial csv file here - use the csv module for more complex use cases
    return line.decode().strip().split(',')

FTP.retrline('RETR testreport.csv', process_row)

df = pd.Dataframe(data = data, columns = columns)
# or for automatic conversion to float:
# df = pd.Dataframe(data = data, columns = columns, dtype = float)

Beware:

  • above code is untested and can contain typos...
  • above code contains no exception processing and will be unsafe if given incorrect input
  • as you cannot use read_csv you have no magical guess of the colum types

Said differently: unless you have a strong reason to do so, do not use that and just download the file first....

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • I'm not sure you are right – `read_cvs` accepts a file-like object. See [Read FTP file contents in Python and use it at the same time for Pandas and directly](https://stackoverflow.com/q/58320638/850848) (I've already posted this link at the question). – Martin Prikryl Jan 21 '20 at 09:38
  • @MartinPrikryl: Well it accepts a file-like object but AFAIK requires it to be seekable, what is hardly obtained from a ftp stream. – Serge Ballesta Jan 21 '20 at 09:42
  • But you can download the contents to `BytesIO` or `StringIO` and you do not need all that `process_row`/`parse` machinery. This is what the code in the referenced question does. I believe it does the same what your code, but in 4 lines only. – Martin Prikryl Jan 21 '20 at 10:29
  • @MartinPrikryl; You are right. This solution is interesting only because it demonstrate how to process the stream *on the flight* without requiring the file to be fully downloaded. But it is almost useless when the goal is to put that in a pandas dataframe... Anyway I will let the answer for the *on the flight* part. – Serge Ballesta Jan 21 '20 at 13:56
  • Your answer would be useful, had it process the file line-by-line into Pandas. Then it would be more efficient, as it would prevent having the file twice in the memory (once as an unparsed text and once parsed in Pandas). – Martin Prikryl Jan 21 '20 at 13:59