1

I am using ConfigParser to read through key values which are passed to my pyspark program. The code works fine when I execute from edge node of a hadoop cluster,with the config file in local directory of edge node. This doesn't if the config file is uploaded to a hdfs path and I try accessing the same using the parser.

The config file para.conf has below contents

[tracker]
port=9801

On local client mode, with para.conf in local directory, to access the values i am using the below.

from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read("para.conf")
myport = parser.get('tracker', 'port')

The above works fine...

On Hadoop Cluster : Uploaded para.conf file to hdfs directory path bdc/para.conf

parser.read("hdfs://clusternamenode:8020/bdc/para.conf")

this doesn't return anythin, neither does the below by escaping..

parser.read("hdfs:///clusternamenode:8020//bdc//para.conf")

Although using sqlCOntext i can read this file which returns a valid rdd.

sc.textFile("hdfs://clusternamenode:8020/bdc/para.conf")

though am not sure if using configParser can extract the key values from this..

Can anyone advise if configParser can be used to read files from hdfs ? Or is there any alternative ?

Dhruv
  • 13
  • 1
  • 3
  • The problem is that ConfigParser can't handle hdfs file paths. What you could do is implementing your own configreader or read with `bla = sc.textFile("hdfs://clusternamenode:8020/bdc/para.conf").collect()` which gives you a list of strings. The configreader can handle strings with [read_string](https://docs.python.org/3/library/configparser.html#configparser.ConfigParser.read_string). – cronoik Apr 10 '19 at 14:46
  • read_string is not option as I am using Python 2.7+ . Tried using as sugested in https://stackoverflow.com/questions/21766451/how-to-read-config-from-string-or-list buf = StringIO.StringIO(s_config) config = ConfigParser.ConfigParser() config.readfp(buf) But this gives a no Section error ! – Dhruv Apr 11 '19 at 07:36
  • Can you please extend your question with the code you have used and complete error message you have got? – cronoik Apr 11 '19 at 12:22
  • Using read_string option import ConfigParser credstr = sc.textFile("hdfs://clusternamenode:8020/bdc/cre.conf").collect() parse_str=ConfigParser.ConfigParser() parse_str.read_string(credstr) Error received : AttributeError: ConfigParser instance has no attribute 'read_string' – Dhruv Apr 12 '19 at 07:01
  • Using File buffer option ` import ConfigParser import StringIO credstr = sc.textFile("hdfs://clusternamenode:8020/bdc/cre.conf").collect() buf = StringIO.StringIO(credstr) parse_str = ConfigParser.ConfigParser() parse_str.readfp(buf) parse_str.get('tracker','port') ` Error received :- raise NoSectionError(section) ConfigParser.NoSectionError: No section: 'tracker' – Dhruv Apr 12 '19 at 07:01

1 Answers1

2

I have copied most of the code you have provided in the comments. You were really close to the solution. Your problem was that sc.textFile produces a row in the rdd for every newline character. When you call .collect() you get a list of strings for every line of your document. The StringIO is not expecting a list, it is expecting a string and therefore you have to restore the previous document structure from your list. See working example below:

import ConfigParser 
import StringIO 
credstr = sc.textFile("hdfs://clusternamenode:8020/bdc/cre.conf").collect() 
buf = StringIO.StringIO("\n".join(credstr)) 
parse_str = ConfigParser.ConfigParser() 
parse_str.readfp(buf) 
parse_str.get('tracker','port') 

Output:

'9801'
cronoik
  • 15,434
  • 3
  • 40
  • 78
  • Awesome ! this works :) I found quite a bad solution by taking the list elements and substring the same to get my value ....! Will use this one now.. thanks !.. – Dhruv Apr 15 '19 at 06:17