8

I've just started administering a Hadoop cluster. We're using Bright Cluster Manager up to the O/S level (CentOS 7.1) and then Ambari together with Hortonworks HDP 2.3 for Hadoop.

I'm constantly getting requests for new python modules to be installed. Some modules we've installed at setup using yum and as the cluster has progressed some modules have been installed using pip.

What is the "right" way to do this? Always use yum and not be able to provide the latest and greatest modules? Always use pip and not have one point of truth (yum) showing which packages are installed? Or is it fine to use both pip and yum together?

I'm just worried that I'm filling the system with junk and too many versions of python modules. Any suggestions?

ClusterAdmin
  • 81
  • 1
  • 2
  • 1
    Better use separate python (not messing with system python) and use pip on top of it to manage python modules with exact versions. Since you are managing clusters for hadoop .. you can automate installations too. – Murali Mopuru Jan 19 '16 at 10:14
  • What do you mean "separate python"? You mean installing python from scratch instead of using the yum packages that CentOS has available? And yes, we are automating installations. In the Bright Cluster Manager I can install software/modules in a base image and then update all nodes. – ClusterAdmin Jan 19 '16 at 11:45
  • "separate python" means using virtualenv, I guess. – HUA Di Jan 11 '17 at 09:52

1 Answers1

6

Packages which are part of your distribution should be preferred, because they have been tested to work properly on your system. These packages are installed system-wide.

However if a suitable RPM package is not provided, go ahead and install it from e.g. PyPi or github with pip, but deploy virtual Python environments whenever possible. With virtual envs you don't have to install third-party packages system-wide. You will have several smaller sets of packages which are much better manageable as one set.

VPfB
  • 14,927
  • 6
  • 41
  • 75