-2

Note: I have solve this problem as per below:

I can use to_csv to write to stdout in python / pandas. Something like this works fine:

final_df.to_csv(sys.stdout, index=False)

I would like to read in an actual excel file (not a csv). I want to output CSV, but input xlsx. I have this file

bls_df = pd.read_excel(sys.stdin, sheet_name="MSA_dl", index_col=None)

But that doesn't seem to work. Is it possible to do what I'm trying and, if so, how does one do it?

Notes:

  1. The actual input file is "MSA_M2018_dl.xlsx" which is in the zip file https://www.bls.gov/oes/special.requests/oesm18ma.zip.

I download and extract the datafile like this:

curl -o oesm18ma.zip'https://www.bls.gov/oes/special.requests/oesm18ma.zip'
7z x oesm18ma.zip
  1. I have solved the problem as follows, with script test01.py that reads from stdin and writes to stdout. NOTE the use of sys.stdin.buffer in the read_excel() call.

    import sys import os import pandas as pd

    BLS_DF = pd.read_excel(sys.stdin.buffer, sheet_name="MSA_dl", index_col=None)

    BLS_DF.to_csv(sys.stdout, index=False)

  2. I invoke this as:

    cat MSA_M2018_dl.xlsx | python3 test01.py

  3. This is a small test program to illustrate the idea while removing complexity. It's not the actual program I'm working on.

elbillaf
  • 1,952
  • 10
  • 37
  • 73
  • Is it safe to assume that the sys.stdin is a string of the excel file you would like to read? – MattR Jun 05 '19 at 13:52
  • I'm not sure what you mean by "string." It is the actual contents of the excel file. Say my program above was called foo.py, then I would like to invoke something like: cat excelfile.xlsx | python3 foo.py – elbillaf Jun 05 '19 at 13:57
  • string as in, the `type()` that returns string. You're going to need to provide us more background. Sample data, and more code. I don't understand when you say "it's the actual content of the excel file" – MattR Jun 05 '19 at 13:59
  • 1
    Of course. excel files are binary not text. – Stop harming Monica Jun 05 '19 at 16:12
  • 1
    I can read excel from stdin using pandas, see https://stackoverflow.com/q/59468669/1782641 My problem is that pandas only seems to successfully read from stdin when using `< /path/to/file.xlsx` and not `cat /path/to/file.xlsx |` – dnk8n Dec 25 '19 at 09:16

1 Answers1

1

Basing on this answer, a possibility would be:

import sys
import pandas as pd
import io

csv = ""
for line in sys.stdin:
    csv += line

df = pd.read_csv(io.StringIO(csv))
DLM
  • 555
  • 5
  • 10
  • The source file is an xlsx, not csv. – elbillaf Jun 05 '19 at 15:35
  • You can't simply use cat to output an xlsx file, you either need to convert it first, for example using xlsx2csv, and then treat it like a csv or need to find a way to provide it as a stream of bytes i suppose – DLM Jun 05 '19 at 16:05
  • I think the xlsx2csv would work, though I did find a way of doing it directly as I noted in my revised question above. – elbillaf Jun 05 '19 at 16:09