0

i'm totally new to spark; i have a spark dataframe like so:

+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
|               time|  hostname|group|          mountpoint|    inode|  size|              ctime|              mtime|              atime|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|       23|    92|2012-09-03 15:14:56|2012-04-10 19:08:05|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|       22|    74|2012-09-03 15:14:56|2011-08-09 16:16:40|2013-02-07 19:05:06|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|189926604|167541|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|189926608|   354|2012-09-19 05:47:49|2009-12-22 17:06:15|2013-03-11 20:35:23|
|2016-09-20 00:32:01|sysdev4500|  scs|/zpool1/wain008/g...|189926601| 10580|2012-09-19 05:47:48|2009-12-22 17:06:14|2013-03-11 20:35:19|
+-------------------+----------+-----+--------------------+---------+------+-------------------+-------------------+-------------------+
only showing top 5 rows

i would like to get a graphical plot of histogram distribution of the atime's. what is the most efficient way of doing this on pyspark (with jupyter)? i will have literally a billion rows of data.

i also also like a CDF of the sum of the size of the files plotted to show much much data (in bytes) has/has not been accessed based on when the file was last accessed.

yee379
  • 6,498
  • 10
  • 56
  • 101

0 Answers0