Python : Changing the original data using a for loop

Question

I have some really big txt files (> 2 gb) where the quality of the data is not good. In some columns (that should be integer), for values below 1000.00 , '.' is used as the decimal point (e.g. 473.71886) but for values above 1000.00 then the form is like that 7.541,72419. So ',' is used as the decimal point and '.' for the thousands separator.

I have already read the text file using pd.read_csv with the below command

df = pd.read_csv('mseg.txt',delimiter=("#|#"),nrows=(1000),engine = 'python')

I tried to build the regular expression to be used but it doesn't work pattern = "[0-9]+[\.][0-9]+[,][0-9]+"

I was thinking of using the below code to correct the above problem but it doesn't work. (in the below code I used as pattern2 = "," to test the code)

for i in df.iloc[:,-5]:
    df3 = []
    if re.search(pattern2,i):
        k= i.replace(".","")
        print(k)
        df3.append(k)
    else:
        df3.append(k)
return dfe3

The print(k) in the loop seems to work fine but when I run df3 then I get the below output

['\x00 \x003\x004\x00\x006\x006\x005\x00,\x002\x001\x007\x006\x000\x00']

Could anyone help?

score 1 · Answer 1 · answered Jul 07 '21 at 14:56

I would suggest to do the following:

If there is a ',' in the number replace it with a '.' but get rid of the ',' before. So you would change a 1.234,567 to 1234,567 and then to 1234.567. Then all of your numbers should be in the same format.

df3 = []
for index,i in df.iloc[:,-5]:  
    if ',' in i:
        i= i.replace(".","").replace(',','.')
    df3[index] = i

Corralien · Answer 2 · 2021-07-07T15:08:16.097

0

You can try this:

>>> df
             0
0    473.71886
1  7.541,72419

>>> df[0].str.split(r'[^\d]') \
         .apply(lambda x: f"{''.join(x[:-1])}.{x[-1]}")

0      473.75410
1    71886.72419
dtype: float64

edited Jul 07 '21 at 15:08

answered Jul 07 '21 at 14:54

Corralien

109,409
8
28
52

Python : Changing the original data using a for loop

2 Answers2