8

I'm trying to read in XLSX files with PHP. Using gneustaetter/XLSXReader to be exact. However, these XLSX-files are generated by different companies, using different software. So I wanted to check if they have the right encoding and always just found UTF-8.

Therefore my question as above: Are XLSX files UTF-8 encoded by definition? Or are there exceptions that could break the import script I'm working on?

Marco
  • 550
  • 2
  • 6
  • 22
  • They're XML so presumably there's a character set identifier in the XML header. – tadman Jul 19 '17 at 15:20
  • @tadman there is, but it is consequently set to utf-8 in all files I could find. The question is if it has to be UTF-8 to call itself XLSX file. – Marco Jul 19 '17 at 15:22
  • I see, so you'd recommend to check that tag always before importing. But if a customer sends me a Windows 1252 file, can I reject it stating that this is not a valid excel file? – Marco Jul 19 '17 at 15:26
  • Why not create a utf-16 XLSX and see if Excel can load it? Try creating a spreadsheet with non-ASCII (Chinese or Urdu) and see what the output encoding is then. – Neil Jul 19 '17 at 15:27
  • @Neil because its not important if excel or whatever can load it. As said before I need to be sure not to reject customers delivering legit XLSX Files, and I need my script to tell me if theres one that is not legit. – Marco Jul 19 '17 at 15:36
  • @engor If the 'primary' creator of XLSX files creates them with utf-16 then I guess you must reject the notion that 'there aren't any non-utf8 files in the wild'. – Neil Jul 19 '17 at 15:38
  • @tadman thanks for the links, you may post them as an answer and I'd accept it. – Marco Jul 19 '17 at 15:38
  • btw please give a reason if you downvote. – Marco Jul 19 '17 at 15:44
  • I think someone's just upset that the question isn't *directly* related to programming. – tadman Jul 19 '17 at 15:53

1 Answers1

5

It'd be risky to presume it's always UTF-8. I'd just key your expectations to what the XML describes in the XML header. In my experience Windows-1252 encoded data shows up all the time when you least expect it. You might check the XLSX specification more closely to find out more.

Here's a Chromium bug relating to a Windows-1252 encoded XLSX file, so these seem to exist in the wild. Maybe they're produced by programs other than Microsoft Office. With things like LibreOffice becoming more popular, older versions that may not have had the most robust XLSX support might end up interacting with your code. You probably don't want to have a bug like this show up in your code.

Try and be as accommodating as possible unless you have a concrete reason for rejecting invalid encoding. JSON, by strict definition, is UTF-8. XLSX seems to be XML by definition, but the encoding is not as nailed down. UTF-8 simply seems to be the default convention.

tadman
  • 208,517
  • 23
  • 234
  • 262
  • This page also seems to indicate it is what is described in the XML header... If you change the file extension to txt or open in a text editor you should be able to see... https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/How-to-check-for-encoding-or-formatting-issues-with-Excel/ta-p/397305 – andrew pate Oct 11 '22 at 07:20