-1

I have a JSON file which I converted to string to remove HTML tags, but the function returns unicode values as shown below:

[u'', u'', u'', u'c', u'i', u's', u' ', u'b', u'y', u' ', u'd', u'e', u'l', u'o', u'i', u't', u't', u'e', u'']

I want to extract the words from above output cis by deloitte. Let me know how to resolve this. The code I have tried is shown below:

def cleaning_data(input_json_data):
   jd = input_json_data['description']    
   jd = [x.lower() for x in jd]
   jd = str(jd)
   jd = re.sub('<[^>]*>', '', jd)
   print jd
Rishabh Rusia
  • 173
  • 2
  • 4
  • 19
  • Why are you converting the `jd` list into a string with `jd = str(jd)`? – PM 2Ring Jan 22 '17 at 12:45
  • Since the re module works only on buffer or string. I had to convert it into string. Please let me know if there is any other way as well.. – Rishabh Rusia Jan 22 '17 at 13:54
  • Is `input_json_data['description']` a string, or is it a list of strings? If it's a single string you should've converted it to lowercase with `jd = input_json_data['description'].lower()`. But you can join a list of strings into a string with `''.join(jd)`, as shown in the answer below and in [the linked question](http://stackoverflow.com/questions/12453580/concatenate-item-in-list-to-strings). – PM 2Ring Jan 22 '17 at 16:12
  • @PM 2Ring input_json_data is a json file from which I am taking description key data. The type for 'input_json_data['description']' is unicode. It is therefore converted to string. If there is way convert Json data into DataFrame, do let me know, it will be helpful for my task – Rishabh Rusia Jan 22 '17 at 17:13
  • The re module works perfectly fine with a unicode object (aka unicode string). There's no need to convert it to `str`. – lenz Jan 23 '17 at 14:33

1 Answers1

1

Simply join the elements in the list on empty string if its a list.

a = [u'', u'', u'', u'c', u'i', u's', u' ', u'b', u'y', u' ', u'd', u'e', u'l', u'o', u'i', u't', u't', u'e', u'']
print(''.join(a))

If it's not a list and is a string, then you can eval it first like so:

from ast import literal_eval

a = """[u'', u'', u'', u'c', u'i', u's', u' ', u'b', u'y', u' ', u'd', u'e', u'l', u'o', u'i', u't', u't', u'e', u'']"""
a = literal_eval(a)
print(''.join(a))

Output:

u'cis by deloitte'
Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78