Get datatype for a particular string part in python3

Question

I have input in a csv file like -0.02872239612042904, -0.19755002856254578.. with 128 values and when I read that array from a csv file it gets read as '-0.02872239612042904, -0.19755002856254578..' I have figured out a way to map all strings to a specific datatype. Right now i am doing it like this:-

result=list(map(float, re.findall(r'\d+', en)))  #en=string read from csv file

But since these are face encodings and when the distance is calculated it returns False all the time which i believe is because of the fact that after converting to string the array becomes like 1906684972345829.0 and so on.

I can't find a datatype to represent numbers like -0.02872239612042904 that's why when mapping I am converting to float which is the wrong format. Can anyone please tell me what is the correct datatype for numbers like -0.02872239612042904 in python3. Much thanks, it is giving me headache now.

EDIT:- This is how I am reading data from the csv file:-

def get_encodings():
    df=pd.read_csv('Encodings/encodings.csv')       #getting file
    with tqdm(total=len(list(df.iterrows()))) as prbar:
        encodings=[]
        images=[]
        for index, row in df.iterrows():
            r=[]
            en=df.loc[index,'Encoding']
            print(en)   #prints correctly
            print(type(en))   #prints string and I want exact same data in its original form which looks like I have shown below

"[-0.19053705  0.06230173  0.04058716 -0.08283613 -0.07159504 -0.10155849
  0.06008045 -0.06842063  0.1317966  -0.10250588  0.203399   -0.01436609
 -0.21249449 -0.09238856  0.0279788   0.08926097 -0.09177385 -0.1628615
 -0.03505187 -0.12979373  0.05772705  0.00208503 -0.06933809  0.00741822
 -0.17499965 -0.25000119 -0.0205064  -0.03139503  0.01130889 -0.1057417
  0.13554846  0.06285821 -0.18908061 -0.02082938  0.04383367  0.23148835
 -0.05068404 -0.00925579  0.1900605  -0.05617992 -0.12842563 -0.06219928
  0.07317995  0.26369438  0.10394366  0.05749369  0.02448226 -0.07668396
  0.1266536  -0.23425353  0.04819498  0.07290804  0.111645    0.08294459
  0.10209186 -0.21581331  0.07399686  0.07748453 -0.22381224  0.01746997
  0.0188249  -0.06403829 -0.07789861 -0.0249712   0.21001905  0.03979192
 -0.12171203 -0.06864078  0.21658717 -0.17392246 -0.06753681  0.09808435
 -0.0076007  -0.18134885 -0.23990698  0.07026891  0.3552466   0.17010394
 -0.16684352  0.03726491  0.02757547  0.01445537  0.10094975  0.04033324
 -0.10441576  0.0377433  -0.09693146  0.04404883  0.16759454  0.0402087
 -0.05915016  0.1369293   0.05408669  0.05787617  0.03509152  0.01340439
 -0.06379045  0.04323686 -0.09738267 -0.02683797  0.14505677 -0.10747927
  0.03247242  0.11747092 -0.18656668  0.22448684 -0.00474619 -0.00586929
 -0.05853979  0.06613642 -0.065335    0.02921261  0.08723848 -0.30918318
  0.23265852  0.20364268 -0.07978678  0.19747412  0.08048097  0.04772019
  0.06427031 -0.03703914 -0.14493702 -0.12132056 -0.01301065 -0.02351468
  0.10600268  0.06480799]"

One row of my data looks like this^ and i just want all of it without quotes in this type dtype('

score 1 · Accepted Answer · edited Feb 27 '19 at 15:19

If you have a csv, use the csv-module to read it (or read up on pandas, wich will auto-convert your values to suitable types):

Create demo file:

data =  """-0.02872239612042904, -0.19755002856254578, 0.31345692434, -0.0009348573822
-1.02872239612042904, -1.19755002856254578, 1.31345692434, -1.0009348573822
-2.02872239612042904, -2.19755002856254578, 2.31345692434, -2.0009348573822
-3.02872239612042904, -3.19755002856254578, 3.31345692434, -3.0009348573822
apple, prank, 0.23, nothing
"""

with open("datafile.csv","w") as f:
    f.write(data)

Read demofile back in

def safeFloat(text):
    try:
        return float(text)
    except ValueError: # maybe even catchall here
        return float("nan")

data = []    
import csv
with open("datafile.csv","r") as r:
    csv = csv.reader(r, delimiter=',')
    for l in csv:
        data.append(list(map(safeFloat,l))) # safeFloat to capture errors

print(data)

If you have non-floats in your data, you may want to use a def safeFloat(text) instead of float inside map to guard against parsing errors some text is not convertable to float.

Output:

[[-0.02872239612042904, -0.19755002856254578, 0.31345692434, -0.0009348573822], 
 [-1.028722396120429, -1.1975500285625458, 1.31345692434, -1.0009348573822], 
 [-2.028722396120429, -2.1975500285625458, 2.31345692434, -2.0009348573822], 
 [-3.028722396120429, -3.1975500285625458, 3.31345692434, -3.0009348573822], 
 [nan, nan, 0.23, nan]]

You could also use regex, but then your pattern needs to allow the optional sign as well as a dot and numbers before/after it:

r'[+-]?\d+\.\d+'  # would allow for 123.1245 - but not for 123 or .1234 
                  # would allow an optional +- before numbers

You can check patterns f.e. at http://regex101.com - this pattern with demo data can be found here: https://regex101.com/r/xSiyO1/1

pandas solution (only valid data):

data =  """-0.02872239612042904, -0.19755002856254578, 0.31345692434, -0.0009348573822
-1.02872239612042904, -1.19755002856254578, 1.31345692434, -1.0009348573822
-2.02872239612042904, -2.19755002856254578, 2.31345692434, -2.0009348573822
-3.02872239612042904, -3.19755002856254578, 3.31345692434, -3.0009348573822
"""

with open("datafile.csv","w") as f:
    f.write(data)

import pandas as pd
import numpy as np

df = pd.read_csv("datafile.csv", dtype={"a":np.float64,"b":np.float64,"c":np.float64,"d":np.float64},names=["a","b","c","d"] )
print(df)

Output:

      a        b         c         d
0 -0.028722 -0.19755  0.313457 -0.000935
1 -1.028722 -1.19755  1.313457 -1.000935
2 -2.028722 -2.19755  2.313457 -2.000935
3 -3.028722 -3.19755  3.313457 -3.000935

@Asim this: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas-read-csv should create a pandas dataframe with all the type converting already done for - no need to do it line-wise. Added the `safeFloat()` for other readers — Patrick Artner, May 26 '18 at 14:38
@Asim you have checked that what you see isnt only the _pretty printed variable content_, have you? - try `pd.options.display.float_format = '${:,.20f}'.format` — Patrick Artner, May 26 '18 at 20:58
@Asim Pandas automatically chooses the correct datatype _per column_ - as long as you have _one_ string in any field of one column, the whole column can not be represented by `float` - the next one would be "str" then "object" (i think). See [change-data-type-of-columns-in-pandas](https://stackoverflow.com/questions/15891038/change-data-type-of-columns-in-pandas) - you might want to `df.apply(pd.to_numeric, errors='coerce')` - you will replace str with NaN in that case. Without you providing your dataframe its hard to solve. Edit your questions with your real data. — Patrick Artner, May 26 '18 at 21:12
I have edited the question, please have a look at it now. Also much thanks for trying to solve my problem :) — Asim, May 26 '18 at 21:20
@Asim - open another question. Your data is no csv at all - it looks as if someone printed the `__repr__` of a numpy array not list to a file. You might need to use `ast.literal_eval()` to each line to reconstruct the list from it - not to parse it as csv. See https://docs.python.org/3/library/ast.html#ast.literal_eval - you can feed it a string and gert a pythin obj returned, f.e. `k = ast.literal_eval("[1,2,3,4,5]")` will make `k` a python list of integers 1 to 5 - also you are lacking `,` in between list elements - so it might be some kind of numpy output? — Patrick Artner, May 26 '18 at 21:27
@Asim Propably easiest would be to create a MVCE that creates a pandas df and ask how to seperate the numbers as floats into seperate columns, tag it with python&pandas - thats beyong my panda-fu. — Patrick Artner, May 26 '18 at 21:32

Get datatype for a particular string part in python3

1 Answers1