5

One very useful aspect of R’s data.table reading workhorse “fread” is the “cmd” keyword with which one can programmatically build a shell command and pass it to fread to read the output of the shell command in as a data.table.

This is very powerful for interactive use as the command can be any string, e.g. an ssh one which will run on a remote host and can defer basic parsing to a simple grep/sec/awk all in one line while preventing the need for making temporary directories and files and taking additional steps to fetch remote files.

From what I can tell looking at the latest pandas docs there does not appear to be an equivalent in any of the pd.read_* methods. Is it not a feature? Is there maybe an easy equivalent people use instead?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Palace Chan
  • 8,845
  • 11
  • 41
  • 93
  • 2
    This [blog post](https://op8867555.github.io/posts/2017-10-13-use-your-unix-toolbox-with-pandas.html) explains how to combine the subprocess module with pandas. Pandas does not have a built in module for reading in shell outputs. Another option is to work with ipython and pass the results of ur operation to pandas. [pydatatable](https://github.com/h2oai/datatable) which aims to replicate R’s datatable, offers the command option. – sammywemmy May 09 '20 at 22:07
  • @sammywemmy awesome, I just tried and it seems like I can get similar succinctness by import datatable as dt followed by dt.fread(cmd=...).to_pandas() The blog post alternative is also a good alternative, but more verbose having to create the subprocess and likely slower on large frames. – Palace Chan May 09 '20 at 23:04
  • If you have found solution to the question you posted feel free to self answer it, as there might be others looking into same functionality. – jangorecki May 11 '20 at 10:43

1 Answers1

5

As @sammywemmy pointed out there are two alternatives. The first, and slightly more verbose one than the R equivalent is to use subprocess like this:

import pandas as pd, import subprocess
with subprocess.Popen("shell_cmd", shell=True, stdout=subprocess.PIPE) as p:
    df = pd.read_csv(p.stdout)

A more efficient and less verbose alternative is to use the datatable package and do something like this:

import datatable as dt
df = dt.fread(cmd="shell_cmd").to_pandas()

You can also opt to work natively with the datatable Frame type.

Pasha
  • 6,298
  • 2
  • 22
  • 34
Palace Chan
  • 8,845
  • 11
  • 41
  • 93