1

I've been stuck on this for way too long. I tried to decode the byte object received from the request. When I try to decode to UTF-8 and print, I don't see the string representation of the byte object. What am I missing here?

import urllib.request

url = 'https://www2.census.gov/geo/docs/reference/codes/files/national_cousub.txt'

data = urllib.request.urlopen(url)

counter = 0
for line in data:

    print('byte string:')
    print(line)
    print('after decoding:')
    print(line.decode('utf-8'))

    counter = counter + 1
    if counter > 5:
        break

This is what I see on console:

byte string:
b'STATE,STATEFP,COUNTYFP,COUNTYNAME,COUSUBFP,COUSUBNAME,FUNCSTAT\r\r\n'
after decoding:


byte string:
b'AL,01,001,Autauga County,90171,Autaugaville CCD,S\r\r\n'
after decoding:


byte string:
b'AL,01,001,Autauga County,90315,Billingsley CCD,S\r\r\n'
after decoding:


byte string:
b'AL,01,001,Autauga County,92106,Marbury CCD,S\r\r\n'
after decoding:


byte string:
b'AL,01,001,Autauga County,92628,Prattville CCD,S\r\r\n'
after decoding:


byte string:
b'AL,01,003,Baldwin County,90207,Bay Minette CCD,S\r\r\n'
after decoding:

I am on Windows 10. Python version 3.5.5. I install python via anaconda. I am running this in PyCharm.

sys.stdout.encoding = 'UTF-8'

Same results with print(line.decode('utf-8'), file=sys.stderr)

J.Oh
  • 49
  • 1
  • 8
  • I cannot reproduce your error. – DYZ Aug 05 '18 at 00:06
  • I am on Windows 10. Python version 3.5.5. I install python via anaconda. I am running this in PyCharm. sys.stdout.encoding = 'UTF-8' Same results with print(line.decode('utf-8'), file=sys.stderr) – J.Oh Aug 05 '18 at 00:22
  • As a side note, the tag `python-requests` is for the `requests` library, but you're not using that; you're using `urllib.request`. – abarnert Aug 05 '18 at 00:22

1 Answers1

3

Your strings all end with \r\r\n. This is wrong, but (a) it's not your fault but the census website's fault, and (b) it shouldn't be causing this problem.

Assuming you're on Windows, the \r\n at the end is a normal newline. But the extra \r before it, without a \n, is a carriage return that moves the cursor back to the start of the current line. Then printing the \r\n newline is overwriting the rest of the line.

That last part is what shouldn't happen. Printing a newline should just move to the next line. You can see that by running this program at the Windows command line, in a macOS or Linux terminal, or on repl.it.

But you're running in PyCharm, with your output going to the PyCharm debugging console. The PyCharm debugging console doesn't work like a complete terminal emulator, and on of the differences is, apparently, that it handles \r strangely. This question has more information about that. (And the same thing happens in other JetBrains IDEs, like printing the same text with Java in IntelliJ, just as you'd expect.)

There doesn't seem to be a fix for the debugging console; that's just how it works.

You can send output to PyCharm's terminal output instead of its debugging window, or run the program in its terminal, or use your Windows command prompt instead of PyCharm, or use a different IDE… but all of those mean you can't use the PyCharm debugging console for debugging, which may not be a tradeoff worth having.

If you want to work around the problem without changing your setup, the simplest solution is to remove those extra \r characters:

print(line.decode('utf-8').replace('\r\r\n', \r\n'))

Or, better, as suggested by aldo in the comments, call either strip or rstrip to remove all those newline-ish characters. If you want the line to end with a proper newline (so you still get a blank line after each line):

print(line.decode('utf-8').rstrip()+'\n')

And if you don’t, it’s even simpler:

print(line.decode('utf-8').rstrip())
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 1
    Calling strip() on the decoded string might be even easier and more flexible. – aldo Aug 05 '18 at 00:45
  • @aldo Yeah, if they don’t want a newline at the end in the first place, that’s definitely better—and, now that you bring it up, even if they _do_ want the newline, it might still be better to strip it and add a proper one… I’ll edit the answer; thanks. – abarnert Aug 05 '18 at 00:46
  • Good catch with the print() function and the newlines. Tricky! – aldo Aug 05 '18 at 00:57
  • 1
    I'd run the whole stream through `csv.reader`. It won't care about the odd newlines and will parse the columns as well. – Mark Tolonen Aug 07 '18 at 02:20
  • @MarkTolonen Good point. It clearly is a CSV, there doesn’t seem to be anything weird about its dialect, and if the OP wants to actually do anything with these values, they probably want them as a list of columns, not a string… – abarnert Aug 07 '18 at 02:56