0

I recently wrote a Python script that uploads local, newline-delimited JSON files to a BigQuery table. It's very similar to the example provided in the official documentation here. The problem I'm having is that non-ASCII characters in the file I'm trying to upload are making my POST request barf.

Here's the relevant part of the script...

def upload(dataFilePath, loadJob, recipeJSON, logger):
    body = '--xxx\n'
    body += 'Content-Type: application/json; charset=UTF-8\n\n'
    body += loadJob
    body += '\n--xxx\n' 
    body += 'Content-Type: application/octet-stream\n\n'

    dataFile = io.open(dataFilePath, 'r', encoding = 'utf-8')
    body += dataFile.read()
    dataFile.close()

    body += '\n--xxx--\n'

    credentials = buildCredentials(recipeJSON['keyPath'], recipeJSON['accountEmail'])
    http = httplib2.Http()
    http = credentials.authorize(http)
    service = build('bigquery', 'v2', http=http)

    projectId = recipeJSON['projectId']

    url = BIGQUERY_URL_BASE + projectId + "/jobs"

    headers = {'Content-Type': 'multipart/related; boundary=xxx'}
    response, content = http.request(url, method="POST", body=body, headers=headers)

...and here's the stack trace I get when it runs...

Traceback (most recent call last):
  File "/usr/local/uploader/upload_data.py", line 179, in <module>
    main(sys.argv)
  File "/usr/local/uploader/upload_data.py", line 170, in main
    if (upload(unprocessedFile, loadJob, recipeJSON, logger)):
  File "/usr/local/uploader/upload_data.py", line 100, in upload
    response, content = http.request(url, method="POST", body=body, headers=headers)
  File "/usr/local/lib/python2.7/site-packages/oauth2client/util.py", line 128, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/oauth2client/client.py", line 490, in new_request
redirections, connection_type)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1570, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1317, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python2.7/site-packages/httplib2/__init__.py", line 1253, in _conn_request
conn.request(method, request_uri, body, headers)
  File "/usr/local/lib/python2.7/httplib.py", line 973, in request
    self._send_request(method, url, body, headers)
  File "/usr/local/lib/python2.7/httplib.py", line 1007, in _send_request
    self.endheaders(body)
  File "/usr/local/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/usr/local/lib/python2.7/httplib.py", line 833, in _send_output
    self.send(message_body)
  File "/usr/local/lib/python2.7/httplib.py", line 805, in send
    self.sock.sendall(data)
  File "/usr/local/lib/python2.7/ssl.py", line 229, in sendall
    v = self.send(data[count:])
  File "/usr/local/lib/python2.7/ssl.py", line 198, in send
    v = self._sslobj.write(data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4586-4611: ordinal not in range(128)

I'm using Python 2.7 and the following libraries: distribute (0.6.36) google-api-python-client (1.1) httplib2 (0.8) oauth2client (1.1) pyOpenSSL (0.13) python-gflags (2.0) wsgiref (0.1.2)

Has anyone else had this problem?

It seems like httplib2's request method takes "body" as a string, which means that it later needs to be encoded before being sent over the wire. I've been searching for a way to override the encoding to UTF-8, but no luck so far.

Thanks in advance!

EDIT:

I was able to resolve this by doing two things: 1.) Reading the contents of my file raw with no decoding. (I could have also just encoded the "body" in my first attempt above...) 2.) Encoding to bytes the url and headers.

The code ended up looking like this:

def upload(dataFilePath, loadJob, recipeJSON, logger):
    part_one = '--xxx\n'
    part_one += 'Content-Type: application/json; charset=UTF-8\n\n'
    part_one += loadJob
    part_one += '\n--xxx\n'
    part_one += 'Content-Type: application/octet-stream\n\n'

    dataFile = io.open(dataFilePath, 'rb')
    part_two = dataFile.read()
    dataFile.close()

    part_three = '\n--xxx--\n'

    body = part_one.encode('utf-8')
    body += part_two
    body += part_three.encode('utf-8')

    credentials = buildCredentials(recipeJSON['keyPath'], recipeJSON['accountEmail'])
    http = httplib2.Http()
    http = credentials.authorize(http)
    service = build('bigquery', 'v2', http=http)

    projectId = recipeJSON['projectId']

    url = BIGQUERY_URL_BASE + projectId + "/jobs"

    headers = {'Content-Type'.encode('utf-8'): 'multipart/related; boundary=xxx'.encode('utf-8')}
    response, content = http.request(url.encode('utf-8'), method="POST", body=body, headers=headers)
Caleb Rackliffe
  • 565
  • 3
  • 10

1 Answers1

1

io.open() will open the file as unicode text. Either use plain open(), or use binary mode:

dataFile = io.open(dataFilePath, 'rb')

You are sending the file contents straight out over the network, so you need to send bytes, not unicode, and as you found out, mixing Unicode and bytes leads to painful errors as python tries to automatically encode back to bytes using the ASCII codec when concatenating the two different types. There is no need to decode to Unicode at all here.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Yes, ultimately, it makes no sense to have to decode the contents of the file. I tried to read just the binary data into "body", but that generated... `code` ... _send_request self.endheaders(body) File "/usr/local/lib/python2.7/httplib.py", line 969, in endheaders self._send_output(message_body) File "/usr/local/lib/python2.7/httplib.py", line 827, in _send_output msg += message_body UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4586: ordinal not in range(128) `code` – Caleb Rackliffe May 21 '13 at 23:35
  • Make sure **everything** is bytes. Since you are using JSON data to build the credentials, for example, that could well mean `credentials` is Unicode too. The same applies to `url`. – Martijn Pieters May 21 '13 at 23:38
  • Thanks, @Martijn. I ended up converting the headers and url to bytes and it worked. It looks like the credentials object, which is of the type SignedJwtAssertionCredentials, wasn't causing a problem. – Caleb Rackliffe May 22 '13 at 00:06