0

I have a .txt file (23820 rows × 300 columns). It is '\t' seperated and the decimal is ','

When reading it in with csv_read, nearly every column in my file should be a float but it totally messes it up. I don't get float data (which has a dot as decimal) but string like '25,73234' This leads to my problem when trying to convert it. See the error message:

ValueError                                Traceback (most recent call last)
<ipython-input-3-87b4fe6976d1> in <module>
----> 1 DF['T Ans v F_deg_C'] = [float(x) for x in DF['T Ans v F_deg_C']]

<ipython-input-3-87b4fe6976d1> in <listcomp>(.0)
----> 1 DF['T Ans v F_deg_C'] = [float(x) for x in DF['T Ans v F_deg_C']]

ValueError: could not convert string to float: '25,79243'

Here is a screenshot of my DF after import:

DF Example:

I just could skip the first 2 rows and the data is fine BUT I want to put 1rd and 2nd row elementwise together as header of my DF. I know I could seperately read the headers in and then the data. I also could change the data after importing with replacing ',' to '.' before change string to float in each column but still I want to know what is going on there.

Now here is the question I can't answer myself. I figured out it is the 2nd line who causes my problem. When searching for answers I often read about misinterpreted '\n' or Regex so I printed the 2nd line including every piece of format with:

with open(data_path) as f:
    my_list = list(f)
my_list[1]

'yymmdd\thh:mm:ss\trpm\tNm\tkW\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\t°C\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tbar\tbar\tmbar\tmbar\tmbar\tkg/h\t%\tkg\tg/h\tkg/h\tl/min\tppm\tppm\tppm\tppm\tppm\tppm\t%\tppm\t%\tppm\tppm\tppm\t%\tppm\t%\tppm\tppm\tppm\t%\tppm\t%\tNm\tkW\tkW\tppm\tppm\tppm\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\t%\t%\t%\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\tg/h\tg/h\tg/h\tg/kWh\tg/kWh\tg/kWh\tg/kWh\tg/kWh\tg/kWh\tkg/h\tg/kg\t°C\t-\t-\t-\t-\t-\tg/kWh\tg/kWh\tbar\t-\t-\t-\t-\tkPa\tmbar\tmbar\tmg/l\tmg/l\t-\tkWh\t°C\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\tmbar\t?\t?\tcbm\t1\t\t?\t?\t\t\t\t\t\t?\t?\t?\t?\t?\t?\t?\t?\t?\t?\t?\t\t?\t?\t?\t?\t1\t1\t1\t1\t1\tppm\t\t1\tmg/Asp\t1\t\tbar\tbar\tbar\tbar\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t%\t°C\tbar\t°C\t1\t°C\t°C\t°C\t\tmbar\tmbar\tppm\t°C\tbar\t°C\tbar\tbar\tppm\t°C\t\t°C\t1/min\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\tkg/s\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1\t1,0\t1,0\t1,0\t1,0\t1,0\t1,0\t1,0\tkW\t1,0\t1,0\t1,0\tppm\t1,0\t1,0\tg/s\t1,0\t1,0\t1,0\tppm\tppm\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\t0\n'

This 2nd row represents the units of the headers. Can somebody explain to me why this line causes my problem?

furas
  • 134,197
  • 12
  • 106
  • 148
Rabinzel
  • 7,757
  • 3
  • 10
  • 30
  • Can you share the content of the file, an example ? better structured than actually – azro May 06 '21 at 09:05
  • I added a screenshot of the DF. The zeros at the bottom and at the right end will be removed when cleaning up the data. – Rabinzel May 06 '21 at 09:19
  • how do you read it? Do you use options in `csv_read()` like `sep='\t'` ? – furas May 06 '21 at 09:42
  • always put full error message (starting at word "Traceback") in question (not comment) as text (not screenshot, not link to external portal). There are other useful information. – furas May 06 '21 at 09:43
  • `csv_read()` should have also option to format float values. You should check documentation. – furas May 06 '21 at 09:44
  • @furas `DF = pd.read_csv(data_path, delimiter = '\t', decimal = ',', error_bad_lines = False, low_memory = False)` for reading in my data – Rabinzel May 06 '21 at 09:54
  • @furas yes i checked csv_read() docu but i didn't come to a solution how to format the columns to float values. Do I have to predefine a dict with all columns and its format in dtype? What if I don't know the exact number and names of the columns but want to have the whole data beginning in col 3 containing floats (col 1 and 2 are date and time) – Rabinzel May 06 '21 at 10:01
  • When I test some data with `delimiter='\t', decimal=','` then it convert it to number with dot `25.73234` - maybe it uses char similar to `,` but for system it is different char (with different code) and it can't convert it to `float`. You could try to get text `25,73234` from file or dataframe and check if `"," in '25,73234'` gives `True`. Eventually there can be other char which looks like `space` - so you don't see on screen - but system has problem to convert it. Because text `°C` gives you strange char in table so it can means this file is not in `UTF-8` but other encoding – furas May 06 '21 at 10:16
  • OR maybe you should `csv_read( header=[0,1])` to convert first and second row to header and maybe then it will conver it correctly. Because second row can make problem even if you try to convert to float manually - it can't convert `°C` to `float` – furas May 06 '21 at 10:21
  • yes, me too. same file, same `csv_read` , just add `header=None, skiprows=2` and my numbers are also shown as `25.73234` with dot. tested two things now: `','` and also `''`gives `True`for `25,73234` – Rabinzel May 06 '21 at 10:24

1 Answers1

0

How about adding header=[0,1] to the function call? This specifies the first two lines in the file as the header.

In your case: pd.read_csv(data_path, delimiter='\t', decimal=',', header=[0,1])

Hein Schnell
  • 322
  • 5
  • 15
  • as @fluras suggested in the above comment, I tried `header=[0,1]` and it worked. I thought I tried this already, but it seems not. Thanks a lot to both of you! ......Tbh I'm still curious how I can find out how this 2nd row causes the problem for all the data below. – Rabinzel May 06 '21 at 10:41
  • Could it be that the very first row is missing a linebreak `\n` ? Because the columns to the far right of your image seem like values to me, not column names. Therefore the first row has much more values than all the other rows. You might want to check if 300 columns (as determined by pandas) are what you would expect for your table/usecase. Liebe Grüße :D – Hein Schnell May 06 '21 at 12:00