0

I am trying to process a CSV file into a dict using a Dataflow template and Python.

As it is a template I have to use ReadFromText from the textio module, to be able to provide the path at runtime.

| beam.io.ReadFromText(contact_options.path)

All I need is to be able to extract the first line of this text/csv file, I can then use this data in DictReader as the fieldnames.

If I use split lines it brings back a each element of the text file in a list:

return element.splitlines()

or

csv_data = []

split_element = element.split('\n')
for row in split_element:
    csv_data.append(row)

return csv_data

['phone_number', 'cid', 'first_name', 'last_name']
['          ', '101XXXXX', 'MurXXX', 'LevXXXX']
['3052XXXXX', '109XXXXX', 'MerXXXX', 'CoXXXX']
['954XXXXX', '10XXXXXX', 'RoXXXX', 'MaXXXXX']

Although If I then use say element[0], it just brings everythin back without the list brackets. I have also tried splitting by '\n', then using a for loop to produce a list object, although it produces almost the same result.

I cannot rely on using predetermined fieldnames as the csv files to be processed will all have different fieldnames and DictReader will not work effectively without fieldnames given.

EDIT:

The expected output is:

[{'phone_Number': '561XXXXX', 'first_Name': '', 'last_Name': 'BeXXXX', 'cid': '745XXXXX'}, {'phone_Number': '561XXXXX', 'first_Name': 'A', 'last_Name': 'BXXXX', 'cid': '61XXXXX'}]

EDIT:

Element contents:

"phone_Number","cid","first_Name","last_Name"
"5616XXXXX","745XXXX","","BeXXXXX"
"561XXXXXX","61XXXXX","A","BXXXXXXt"
"95XXXXXXX","6XXXXXX","A","BXXXXXX"
"727XXXXXX","98XXXXXX","A","CaXXXXXX"
jmoore255
  • 321
  • 4
  • 15
  • What is the expected output? Is it just the first row in dict format? what would be the values for the keys? – mad_ Jul 30 '18 at 17:28
  • Hi @mad_ thanks for the comment, I have put in my expected output, the only think is that I am unable to extract the first line of the list element I have, if I try [0] it returns all the text, the same with next(). If I can somehow retrieve the first line of text it would be easy to do, although I am unsure how to extract it within a ParDo function in Dataflow. – jmoore255 Jul 30 '18 at 17:50
  • If you wish to convert the first line which would be a list after splitting on any criteria you can get idea from my answer below in order to convert the list into dict. – mad_ Jul 30 '18 at 17:53
  • Yes I understand, but that will not give me my required result as I cannot single out the first list, if I try to return say element[0], it brings back the entire text file, without any brackets. – jmoore255 Jul 30 '18 at 17:59
  • Updated my answer. You can use pandas to load the values – mad_ Jul 30 '18 at 18:12
  • Thanks for your updated answer, although I can't seem to get to the stage of having the list elements, within one large list. – jmoore255 Jul 30 '18 at 18:47
  • What are the contents of split_element? – mad_ Jul 30 '18 at 18:51
  • I've added the elements contents – jmoore255 Jul 30 '18 at 19:01
  • read directly as csv then. `pd.read_csv()` – mad_ Jul 30 '18 at 19:04

2 Answers2

1

Use Pandas to load the values and use first line as colheaders

import pandas as pd
a_big_list=[['phone_number', 'cid', 'first_name', 'last_name'],
['          ', '101XXXXX', 'MurXXX', 'LevXXXX'],
['3052XXXXX', '109XXXXX', 'MerXXXX', 'CoXXXX'],
['954XXXXX', '10XXXXXX', 'RoXXXX', 'MaXXXXX']]

df=pd.DataFrame(a_big_list[1:],columns=a_big_list[0])

df.to_dict('records')
#[{'cid': '101XXXXX',
  'first_name': 'MurXXX',
  'last_name': 'LevXXXX',
  'phone_number': '          '},
 {'cid': '109XXXXX',
  'first_name': 'MerXXXX',
  'last_name': 'CoXXXX',
  'phone_number': '3052XXXXX'},
 {'cid': '10XXXXXX',
  'first_name': 'RoXXXX',
  'last_name': 'MaXXXXX',
  'phone_number': '954XXXXX'}]
mad_
  • 8,121
  • 2
  • 25
  • 40
1

I was able to figure this problem out with inspiration from @mad_'s answer, but this still didn't give me the correct answer initally, as I needed to first group my pcollection into one element. I found a way of doing this inspired from this answer from Jiayuan Ma, and slightly altered it as so:

class Group(beam.DoFn):
  def __init__(self):
     self._buffer = []

  def process(self, element):
     self._buffer.append(element)

  def finish_bundle(self):
     if len(self._buffer) != 0:
        yield list(self._buffer)
        self._buffer = []

lines = p | 'File reading' >> ReadFromText(known_args.input)
          | 'Group' >> beam.ParDo(Group(known_args.N)
          ...

Thus it grouped the entire CSV file as one object, and then I was able to apply mad_'s method to turn it into a dictionary.

jmoore255
  • 321
  • 4
  • 15