i'm totally new to spark; i have a spark dataframe like so:
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
| time| hostname|group| mountpoint| inode| size| ctime| mtime| atime|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...| 23| 92|2012-09-03 15:14:56|2012-04-10 19:08:05|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...| 22| 74|2012-09-03 15:14:56|2011-08-09 16:16:40|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...|189926604|167541|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...|189926608| 354|2012-09-19 05:47:49|2009-12-22 17:06:15|2013-03-11 20:35:23|
|2016-09-20 00:32:01|sysdev4500| scs|/zpool1/wain008/g...|189926601| 10580|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
only showing top 5 rows
i would like to get a graphical plot of histogram distribution of the atime
's. what is the most efficient way of doing this on pyspark (with jupyter)? i will have literally a billion rows of data.
i also also like a CDF of the sum of the size
of the files plotted to show much much data (in bytes) has/has not been accessed based on when the file was last accessed.