0

I have to deal with some fits files which contain utf8 text in their header. This means basically all functions of the pyfits package do not work. Also .decode does not work as the fits header is a class not a list. Does someone know how to decode the header so I can process the data? The actual content is not so important so something like ignoring the letters is fine. My current code looks like this:

hdulist = fits.open('Jupiter.FIT')
hdu = hdulist[0].header
hdu.decode('ascii', errors='ignore')

And I get: AttributeError: 'Header' object has no attribute 'decode'

Functions like:

print (hdu)

return:

ValueError: FITS header values must contain standard printable ASCII characters; "'Uni G\xf6ttingen, Institut f\xfcr Astrophysik'" contains characters/bytes that do not represent printable characters in ASCII.

I thought about writing something in the entry so I don't need to care about it. However I can' even retrieve which entry contains the bad characters and I would like to have a batch solution as I have some hundred files.

brium-brium
  • 777
  • 2
  • 8
  • 17
  • `'Uni G\xf6ttingen, Institut f\xfcr Astrophysik'` isn't UTF-8, it's (probably) the Latin1 encoding of `'Uni Göttingen, Institut für Astrophysik'` ; the equivalent UTF-8 is `'Uni G\xc3\xb6ttingen, Institut f\xc3\xbcr Astrophysik'` – PM 2Ring Sep 21 '16 at 12:21

2 Answers2

1

As anatoly techtonik pointed out non-ASCII characters in FITS headers are outright invalid, and make invalid FITS files. That said, it would be nice if astropy.io.fits could at least read the invalid entries. Support for that is currently broken and needs a champion to fix it, but nobody has because it's an infrequent enough problem, and most people encounter it in one or two files, fix those files, and move on. Would love for someone to tackle the problem though.

In the meantime, since you know exactly what string this file is hiccupping on, I would just open the file in raw binary mode and replace the string. If the FITS file is very large, you could read it a block at a time and do the replacement on those blocks. FITS files (especially headers) are written in 2880 byte blocks, so you know that anywhere that string appears will be aligned to such a block, and you don't have to do any parsing of the header format beyond that. Just make sure that the string you replace it with is no longer than the original string, and that if it's shorter it is right-padded with spaces, because FITS headers are a fixed-width format and anything that changes the length of a header will corrupt the entire file. For this particular case then, I would try something like this:

bad_str = 'Uni Göttingen, Institut für Astrophysik'.encode('latin1')
good_str = 'Uni Gottingen, Institut fur Astrophysik'.encode('ascii')
# In this case I already know the replacement is the same length so I'm no worried about it
# A more general solution would require fixing the header parser to deal with non-ASCII bytes
# in some consistent manner; I'm also looking for the full string instead of the individual
# characters so that I don't corrupt binary data in the non-header blocks
in_filename = 'Jupiter.FIT'
out_filename = 'Jupiter-fixed.fits'

with open(in_filename, 'rb') as inf, open(out_filename, 'wb') as outf:
    while True:
        block = inf.read(2880)
        if not block:
            break
        block = block.replace(bad_str, good_str)
        outf.write(block)

This is ugly, and for a very large file might be slow, but it's a start. I can think of better solutions, but that are harder to understand and probably not worth taking the time on if you just have a handful of files to fix.

Once that's done, please give the originator of the file a stern talking to--they should not be publishing corrupt FITS files.

Community
  • 1
  • 1
Iguananaut
  • 21,810
  • 5
  • 50
  • 63
  • Using this code I get: SyntaxError: Non-ASCII character '\xc3' in file ....py on line 103, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details. Basically I can't define the badstring in my python file. – brium-brium Sep 23 '16 at 15:30
  • I mean I would have just done this in the IPython interpreter or something, but if you want to put it in a file you can do as it says in PEP-263 and add `# coding=utf-8` at the top of the file. In Python 3 this is the default I think. – Iguananaut Sep 24 '16 at 12:04
  • Now I get: bad_str = 'Uni Göttingen, Institut für Astrophysik'.encode('latin1') UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128) Sorry if I seem dumb I am quite new to python. – brium-brium Sep 26 '16 at 15:46
0

Looks like PyFITS just doesn't support it (yet?)

From https://github.com/astropy/astropy/issues/3497:

FITS predates unicode and has never been updated to support anything beyond the ASCII printable characters for data. It is impossible to encode non-ASCII characters in FITS headers.

anatoly techtonik
  • 19,847
  • 9
  • 124
  • 140
  • is there a way to find out which entries are corrupted and just overwrite them? – brium-brium Sep 21 '16 at 18:06
  • 1
    @user6858243 it is possible to filter file contents with https://pymotw.com/2/codecs/#error-handling, but I suggest you to ask developers how to handle that properly - https://github.com/astropy/astropy/issues/3497 – anatoly techtonik Sep 22 '16 at 06:06