0

I am working on my thesis, and i have the opportunity to set up a working environment to test the functionality and how it works.

the following points should be covered:

  • jupyterhub (within a private cloud)
  • pandas, numpy, sql, nbconvert, nbviewer
  • get Data into DataFrame (csv), analyze Data, store the data (RDD?, HDF5?, HDFS?)
  • spark for future analysis

The test scenario will consist:

  • multiple user environment with notebooks for Users/Topics
  • analyze structured tables (RSEG, MSEG, EKPO) with several million lines in a 3-way-match with pandas, numpy and spark (spark-sql), matplotlib.... its about 3GB of Data in those 3 tables.
  • export notebooks with nbconvert, nbviewer to pdf, read-only notbook and/or reveal.js

Can you guys please give me some hints or experiences on how many notes i should use for testing, which Linux distribution is a good start? i am sure there are many more questions, i have problems to find ways or info how to evaluate possible answers.

thanks in advance!

Charuක
  • 12,953
  • 5
  • 50
  • 88
Lanfear
  • 23
  • 6
  • "Which Linux disto" itself is a far too broad, opinionated question. What has convinced you that you need Linux (or Hadoop) at all? Spark does not need Hadoop. Pandas can very easily read 3GB of data of its own (assuming you have the RAM available). Or you can load up some SQL database so not everything is in memory at once – OneCricketeer Dec 22 '16 at 03:32
  • yes, you are right. my problem is, i last used a linux when i had to reprogramm a keyboard driver for a system, and i decided to use ubuntu because i had an tutorial which stated exactly how to compile the kernel afterwards. i never worked with linux, not alone with cluster. – Lanfear Dec 22 '16 at 07:57
  • the possible use of spark is a given from the project owner. the system has to be able to store data we get as files (dtd, csv, xml), or as return of a funktion (SAP RFC), as well as to cover a vast row of possible future data like logfiles, scanned invoices and more. those 3GB of data is also just one first testing data set. if the system will be running in the future, there will be most likely hundreds of those data sets. – Lanfear Dec 22 '16 at 08:05
  • For the most part Ubuntu "just works" with most systems. But if you have little to no Linux experience, then you are probably in over your head as even getting Hadoop + Spark setup requires very good Linux experience (in my opinion). – OneCricketeer Dec 22 '16 at 16:42

0 Answers0