I have some really big txt files (> 2 gb) where the quality of the data is not good. In some columns (that should be integer), for values below 1000.00 , '.' is used as the decimal point (e.g. 473.71886) but for values above 1000.00 then the form is like that 7.541,72419. So ',' is used as the decimal point and '.' for the thousands separator.
I have already read the text file using pd.read_csv with the below command
df = pd.read_csv('mseg.txt',delimiter=("#|#"),nrows=(1000),engine = 'python')
I tried to build the regular expression to be used but it doesn't work
pattern = "[0-9]+[\.][0-9]+[,][0-9]+"
I was thinking of using the below code to correct the above problem but it doesn't work. (in the below code I used as pattern2 = ","
to test the code)
for i in df.iloc[:,-5]:
df3 = []
if re.search(pattern2,i):
k= i.replace(".","")
print(k)
df3.append(k)
else:
df3.append(k)
return dfe3
The print(k)
in the loop seems to work fine but when I run df3 then I get the below output
['\x00 \x003\x004\x00\x006\x006\x005\x00,\x002\x001\x007\x006\x000\x00']
Could anyone help?