1

I have a document with the following structure:

CUSTOMERID1
    conversation-id-123
    conversation-id-123
    conversation-id-123
CUSTOMERID2
    conversation-id-456
    conversation-id-789

I'd like to parse the document to get a frequency distribution plot with the number of conversations on the X axis and the # of customers on the Y axis. Does anyone know the easiest way to do this with Python?

I'm familiar with the frequency distribution plot piece but am struggling with how to parse the data into the right data structure to build the plot. Thank you for any help you can provide ahead of time!

coco
  • 41
  • 3
  • You have two unique customers and three unique conversations. You can not plot #customers vs #conversations. Do you want a number of conversations per each customer or number of customers per each conversations. Could you provide a example output plot that you want for the example you give? How does CUSTOMERID1 have the same conversation three times? – Jakub Jul 31 '20 at 20:11

1 Answers1

1

You can try the following:


>>> dict_ = {}
    
>>> with open('file.csv') as f:
        for line in f:
            if line.startswith('CUSTOMERID'):
                dict_[line.strip('\n')] = list_ = []
            else:
                list_.append(line.strip().split('-'))
    
>>> df = pd.DataFrame.from_dict(dict_, orient='index').stack()
>>> df.transform(lambda x:x[-1]).groupby(level=0).count().plot(kind='bar')

Output:

enter image description here

If you want only 1 and 2 in X axis, just change dict_[line.strip('\n')] = list_ = [] this line to dict_[line.strip('CUSTOMERID/\n')] = list_ = [].

Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
  • CUSTOMERID1 is just an example of an actual ID. An actual ID would not start with "CUSTOMERID. Same with the conversation ID. " What you want is something like this: customers = [i.strip() for i in lines if not i.startswith('---- ')] and conversations = [i.strip() for i in lines if i.startswith('----')] to create your dictionary. Note: unable to format this comment for consecutive spaces, so the four dashes should be replaced with four spaces. – Jakub Jul 31 '20 at 20:29