-3

This is the raw distribution of the var FREQUENCY

NaN    22131161
1.0     4182626
7.0      218343
3.0      145863
1         59432
0.0       29906
2.0       28129
4.0       15237
5.0        4553
8.0        3617
3          2754
7          2635
9.0         633
2           584
4           276
0           112
8            51
5            42
6.0          19
A             9
I             7
9             6
Q             3
Y             2
X             2
Z             1
C             1
N             1
G             1
B             1
Name: FREQUENCY, dtype: int64
  1. group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom
  2. How could frequency be render as int64 where letters are present?
  3. Desired outcome 1: group all letter groups +NaN into one group. Remaining numeric value groups consolidate (1.0 and 1 =1,for example). In SAS, I just run this : y=1*X. I just give a value of 10 to represent character groups + NaN. How to do it in Python, especially elegantly?
  4. Outcome 2: extract a binary variable z=1 if x=NaN. Otherwise z=0
cs95
  • 379,657
  • 97
  • 704
  • 746
opt135
  • 141
  • 1
  • 8
  • This looks like a homework assignment? What have you done so far to solve it? – Roland Weber Jun 14 '19 at 06:08
  • It is not a homework. It is serious advanced analytics work in a big enterprise. It is a process moving code off SAS to Python. Never saw 1.0 can be separated from 1 before. OK. Since I am playing with Python, like to see if there is any quick way to write it in Python. This does not happen on mainframe JCL either. – opt135 Jun 14 '19 at 17:45
  • I am just seeking knowledge here. Work is not delayed. – opt135 Jun 14 '19 at 17:50
  • I first used sas7bdat in python to read the raw data in SAS data format. It went very slow and the job kept on bending down. So I ran SAS to export the data set to CSV and switch to pd.read_csv. It went very, very fast. Regular Python 3.7.2 ~10 minutes. Turning on pyspark went about 2 minutes. Worked really great. I naturally have to audit. All other variables,~98, turn out as expected. Except this one. No idea what happened to it. Do know what I want it to be. But coding away one int64 like this proves not easy for me. Thought somebody here may know. – opt135 Jun 15 '19 at 01:00

1 Answers1

0

The first issue " group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom" was fixed once I add dtype={'FREQUANCY':'object'} while reading the csv file. Group 1.0 collapsed with group 1... After than replace works just fine.

All other issues pretty much are resolved, except issue 2 in that it still sets the variable type to be int64 where character variables are present. My guess is perhaps Python adopts a majority rule to vote on data type. It is indeed true numeric values dominate the count.

opt135
  • 141
  • 1
  • 8