0

I'm trying to pull units out of a data file to use in its post processing. The file is a .csv and after struggling with pandas, I've resorted to using pandas for the channel names and skipping the 2 rows after (units, and "Raw") and the data itself.

I'm separately using np.genfromtxt to extract the units:

def get_df(f):
    df = pd.read_csv(os.path.join(pathname, f), skiprows=[0, 1, 2, 3, 4, 6, 7])
    units = np.genfromtxt(os.path.join(pathname, f), skip_header = 6, delimiter = ',', max_rows = 1, dtype = np.string_)

    return df, units

And, since some of these units contain '/', I'm changing them (these values end up being joined to the names of the channels and used in file names for the plots generated).

df, units = get_df(f)

unit_dict = {}
for column, unit in zip(df.columns, units):
    unit = string.replace(unit, '/', ' per ')
    unit_dict[column] = unit

When I get to a channel name that has a degree symbol in it, I get an error:

CellAmbTemp �C
Traceback (most recent call last):
  File "filepath_omitted/Processing.py", line 112, in <module> df_average[column], column)
  File "path/Processing.py", line 30, in contour_plot
plt.title(column_name)
  File "C:\Python27\lib\site-packages\matplotlib\pyplot.py", line 1465, in title
return gca().set_title(s, *args, **kwargs)
  File "C:\Python27\lib\site-packages\matplotlib\axes\_axes.py", line 186, in set_title title.set_text(label)
  File "C:\Python27\lib\site-packages\matplotlib\text.py", line 1212, in set_text
self._text = '%s' % (s,)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 12: 
ordinal not in range(128)

Process finished with exit code 1

I printed out the dictionary in which I'm pairing channels with the units and in this case, the entry looks like:

'CellAmbTemp': '\xb0C'
  • What encoding is that?
  • I've tried various things like string.decode() and unicode(string) and dtype = unicode_
  • Is there a better way to do what I need to do? Or at least cobble something together to fix it?

Added: chunk of the file

Logger description:                                     
Log period: 1 s                                 
Statistics period: 30 s                                 
Statistics window: 300 s                                    
Maximum duration:                                   
Time    Time    Time    ActSpeed    ActTorque   ActPower    FuelMassFlowRate    BarometricPress CellAmbTemp ChargeCoolerInPressG
Date    Time    ms  rev/min Nm  kW  g/h kPa °C  kPa
Raw Raw Raw Raw Raw Raw Raw Raw Raw Raw
1/12/2018   12:30:01 PM 153.4   600.0856308 132.4150085 7.813595703 2116.299996 97.76997785 11.29989827 0.294584802
1/12/2018   12:30:02 PM 153.4   600.1700702 132.7327271 7.989128906 2271.800016 97.76997785 11.29989827 0.336668345
1/12/2018   12:30:03 PM 153.4   600.0262537 128.7541351 7.427545898 2783.199996 97.78462672 11.29989827 0.241980373

ETA:

I ended up switching how I acquired the units to pandas:

def get_df(f):
    df = pd.read_csv(os.path.join(pathname, f), skiprows=[0, 1, 2, 3, 4, 6, 7])
    units = pd.read_csv(os.path.join(pathname, f), skiprows = 6, delimiter = ',')
    units = units.columns
    return df, units

Then I decoded / encoded outside:

df, units = get_df(f)

unit_dict = {}
for column, unit in zip(df.columns, units):
    encoding = chardet.detect(unit)['encoding']
    unit = unit.decode(str(encoding)).encode('utf-8')
    unit_dict[column] = unit

Now I'm getting the error when I'm trying to use that text as the title of a plot in matplotlib, but I'm getting farther into the code before the error.

mauve
  • 2,707
  • 1
  • 20
  • 34
  • 1
    I had similar problem once. Take a look here: https://stackoverflow.com/a/37678518/7851470 – Georgy Jan 18 '18 at 21:29
  • This is starting to make sense, but I haven't figured out how to successfully implement it. Thank you! – mauve Jan 18 '18 at 21:44
  • 1
    Can you give correct code and CSV file. As with above provided code and file . I get :- TypeError: zip argument #2 must support iteration – aberry Jan 18 '18 at 21:44
  • What does the `units` array constructed by `np.genfromtxt` look like? – Grr Jan 18 '18 at 22:45
  • Your question seems to indicate that the value for °C in bytes is `\xb0C` when it should be `\xc2\xb0C` – Grr Jan 18 '18 at 23:04
  • @Grr: `\xb0` would be the degree symbol, `°`, in `latin-1` and `cp1252` codecs (and probably a few other one byte per character ASCII superset codecs). The `\xc2` is only needed if it's encoded in `utf-8`. Specifying the encoding as `latin-1` would likely fix it. – ShadowRanger Jan 18 '18 at 23:15

2 Answers2

1

You have to know the encoding of your input file (or just try the common utf-8). If you don't, and utf-8 does not work, try using chardet on the file and use its result.

progmatico
  • 4,714
  • 1
  • 16
  • 27
0

If you already had a string you would do:

codecs.decode(s, encoding='utf-8')

But since you are reading a CSV to dataframe, tell pd.read_csv your source encoding:

pd.read_csv(..., encoding='utf-8')

A technique i've also used when encountering single character issues that I didn't bother to solve is to just find and replace. Something like:

pd.read_csv(StringIO(open(path).read().replace('\xb0', '')))

This is the lazy option though.

Kyle
  • 2,814
  • 2
  • 17
  • 30