Clean up this int64 variable in python

Question

This is the raw distribution of the var FREQUENCY

NaN    22131161
1.0     4182626
7.0      218343
3.0      145863
1         59432
0.0       29906
2.0       28129
4.0       15237
5.0        4553
8.0        3617
3          2754
7          2635
9.0         633
2           584
4           276
0           112
8            51
5            42
6.0          19
A             9
I             7
9             6
Q             3
Y             2
X             2
Z             1
C             1
N             1
G             1
B             1
Name: FREQUENCY, dtype: int64

group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom
How could frequency be render as int64 where letters are present?
Desired outcome 1: group all letter groups +NaN into one group. Remaining numeric value groups consolidate (1.0 and 1 =1,for example). In SAS, I just run this : y=1*X. I just give a value of 10 to represent character groups + NaN. How to do it in Python, especially elegantly?
Outcome 2: extract a binary variable z=1 if x=NaN. Otherwise z=0

This looks like a homework assignment? What have you done so far to solve it? — Roland Weber, Jun 14 '19 at 06:08
It is not a homework. It is serious advanced analytics work in a big enterprise. It is a process moving code off SAS to Python. Never saw 1.0 can be separated from 1 before. OK. Since I am playing with Python, like to see if there is any quick way to write it in Python. This does not happen on mainframe JCL either. — opt135, Jun 14 '19 at 17:45
I first used sas7bdat in python to read the raw data in SAS data format. It went very slow and the job kept on bending down. So I ran SAS to export the data set to CSV and switch to pd.read_csv. It went very, very fast. Regular Python 3.7.2 ~10 minutes. Turning on pyspark went about 2 minutes. Worked really great. I naturally have to audit. All other variables,~98, turn out as expected. Except this one. No idea what happened to it. Do know what I want it to be. But coding away one int64 like this proves not easy for me. Thought somebody here may know. — opt135, Jun 15 '19 at 01:00

score 0 · Answer 1 · answered Jul 03 '19 at 14:11

The first issue " group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom" was fixed once I add dtype={'FREQUANCY':'object'} while reading the csv file. Group 1.0 collapsed with group 1... After than replace works just fine.

All other issues pretty much are resolved, except issue 2 in that it still sets the variable type to be int64 where character variables are present. My guess is perhaps Python adopts a majority rule to vote on data type. It is indeed true numeric values dominate the count.

Clean up this int64 variable in python

1 Answers1