0

The program retrieves JSON data from RESTApi

import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows',1000)
url = 'http://xxxxxxxxxxx:7180/api/v15/clusters/cluster/services/impala/impalaQueries?from=2018-07-23T14:00:00&to=2018-07-23T21:00:00&filter=(query_state+%3D+%22FINISHED%22)&offset=0&limit=1000'
username = 'xxxxx'
password = 'xxxxx'
result = requests.get(url, auth=(username,password))
outJSON = result.json()
df = pd.io.json.json_normalize(outJSON['queries'])
filename ="tempNew.csv"
df.to_csv(filename)

CSV data contains nulls for some fields and NaN for few fields.

Input:

Admitted immediately,,BLAHBLAH,0,NaN,0,0,0,0,0.0,,,,

While using fillna to replace all Nulls & NaN to 0 because they are number fields in the target Table.

Tried Codes:

for col in df:
   df[col].fillna(0,inplace=True)

df.fillna(0,inplace=True)

Output:

'Admitted immediately', '0', 'BLAHBLAH', '0', 'NaN', '0', '0', '0', '0', '0.0', '0','0','0'

How can I ensure that all NaN values are changed to 0 in my dataframe because the table they are loading to is rejecting values because of NaN values?

I switched from processing data line by line from RESTAPI to Dataframe with an impression that its easier to massage data using DF. Is there a better way to massage data in a df without iterating row by row if the fillna won't work?

Update:

df = pd.io.json.json_normalize(outJSON['queries'])
fname = "WithouFilna_1.csv"
df.to_csv(fname)
df.fillna(0,inplace=True)
filename ="fillna_1.csv"
df.to_csv(filename)

I tried to write the output of df.fillna before and after. Partial changes are seen for few fields, but not for all of them

Before:

859,Unknown,,,2,0,xxxx,RESET_METADATA,,,,,,,,,,,,,,
860,Admitted immediately,0,,1,2,xxxx,,0,,NaN,0,0,,0
861,Admitted immediately,0,,0,0,xxxx,,0,,NaN,0,0,,0

After:

859,Unknown,0,0,2,0,xxxx,RESET_METADATA,0,,0,0,0,0,0,0,0,0,0,0,0
860,Admitted immediately,0,0,1,2,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0
861,Admitted immediately,0,0,0,0,xxx,0,0,,NaN,0,0,0,0,0,0,0,0,0

df.dtypes Output

attributes.admission_result                              object
attributes.admission_wait                                object
attributes.bytes_streamed                                object
attributes.client_fetch_wait_time                        object
attributes.client_fetch_wait_time_percentage             object
attributes.connected_user                                object
attributes.ddl_type                                      object
attributes.estimated_per_node_peak_memory                object
attributes.file_formats                                  object
attributes.hdfs_average_scan_range                       object
attributes.hdfs_bytes_read                               object
attributes.hdfs_bytes_read_from_cache                    object
attributes.hdfs_bytes_read_from_cache_percentage         object
attributes.hdfs_bytes_read_local                         object
attributes.hdfs_bytes_read_local_percentage              object
attributes.hdfs_bytes_read_remote                        object
attributes.hdfs_bytes_read_remote_percentage             object
attributes.hdfs_bytes_read_short_circuit                 object
attributes.hdfs_bytes_read_short_circuit_percentage      object
attributes.hdfs_scanner_average_bytes_read_per_second    object

df.values[5:6, :15]

array([['Unknown', nan, nan, '1', '8', 'xxxxx',
        'SHOW_DBS', nan, '', nan, nan, nan, nan, nan, nan]], dtype=object)
Dvusrgme
  • 369
  • 1
  • 5
  • 13
  • The "Output" you've shown makes it look like every element is actually a _string_, which would explain why fillna isn't doing anything -- there are actually no nulls to fill -- but since that's not a standard output format it's hard to tell if that's actually what's going on or if it's something you introduced for display purposes. Please edit your question to include the output you get from a call like `print(df.values)`. – DSM Jul 23 '18 at 20:41
  • It would be really helpful if you make clear which code exactly produces this output. – christinabo Jul 23 '18 at 20:43
  • @christinabo Added the code that pulls the data. – Dvusrgme Jul 23 '18 at 21:31
  • `fillna` only fills `nan` values. so you either change empty cells to nan the fill them ie `df[df==""|df=='NaN] = np.nan; df.fillna(0)`or just fill the empty cells directly with zero. ie `df[df==""|df=='NaN] = 0`. Also note that `fillna` is vectorized. No need of the for loop. fillna will will the whole dataframe and not just one column – Onyambu Jul 23 '18 at 21:55
  • @Dvusrgme: you can slice to show the parts of the frame that matter, like `df.values[5:6, :5]`, etc. Basically we just need to confirm that your NaN is really just a string. – DSM Jul 23 '18 at 23:10
  • @Onyambu : I get this error when I tried your recommendation TypeError: Could not operate '' with block values unsupported operand type(s) for |: 'str' and 'int' – Dvusrgme Jul 24 '18 at 04:28
  • `df[df==""|df=='NaN'] = 0.` you should have seen the missing `'` after `NaN` – Onyambu Jul 24 '18 at 04:29
  • @DSM : Thanks.. Here is the output array([['Unknown', nan, nan, '1', '9']], dtype=object) – Dvusrgme Jul 24 '18 at 04:30
  • @Onyambu : Yup,Pycharm highlighted it. Error was after I added the quote. I feel like this is only happening for one field. I tried df['FieldName'] = df['FieldName'].str.replace('NaN','0').. still no go... is there a way I can ensure the columns do not have NaN. I don't think JSON object sends it like, but DF is doing this. – Dvusrgme Jul 24 '18 at 04:42
  • what does `df==""` or `df=='NaN'` give?? Do they give a logical dataframe?? – Onyambu Jul 24 '18 at 04:44
  • 0 False False False False False False False 1 False False False False False False False This is what I get.. – Dvusrgme Jul 24 '18 at 05:01

1 Answers1

0

Issue was due to the restAPI returning inconsistent data. The data from API for the affected field, came as 'NaN'

Which obviously was not considered during df.fillna(0, inplace=True)

I got around this using:

df.replace({'NaN': '0'}, regex=True, inplace=True)
Dvusrgme
  • 369
  • 1
  • 5
  • 13