2

The goal is to write csv files for each Invoice from a webpage. I'm trying to do this with a webscraper, mainly using selenium

Each Invoice has its own number, date, close_date, amount, and list of Records

Each Record in an Invoice has its own id, description, storage, weight, price, and quantity

I was able to successfully print out to the console all of the data I need. Like so:

Going to: https://thewebsite/thing/my_account.whatever?is=checkout#invoices/429807/paid-invoices
Extracting...
------------------------------
ID: 30795 Description: YOGURT, BLUEBERRY, LOW FAT, DANNON
Storage:  35 Degree Cooler Weight:  110 Price:  $0.00 Quantity:  22
------------------------------
ID: 86546 Description: SWEET POTATOES, P/L
Storage:  55 Degree Cooler Weight:  240 Price:  $0.00 Quantity:  6
------------------------------
ID: 36446 Description: PINEAPPLE, FRESH, P/L
Storage:  55 Degree Cooler Weight:  560 Price:  $0.00 Quantity:  20

I did this with these:

class myRecord(object):
    id = ""
    description = ""
    storage = ""
    weight = ""
    price = ""
    quantity = ""

    def _init_(self, id, description, storage, weight, price, quantity):
        self.id = id
        self.description = description
        self.storage = storage
        self.weight = weight
        self.price = price
        self.quantity = quantity    

class myInvoice(object):
    number = ""
    date = ""
    close_date = ""
    amount = ""

    def _init_(self, number, date, close_date, amount, formatted_records_list = None):
        self.number = number
        self.date = date
        self.close_date = close_date
        self.amount = amount
        if formatted_records_list is None:
            self.formatted_records_list = []
        else:
            self.formatted_records_list = formatted_records_list

I assigned values to each attribute from html elements like this (I'll just use the "number" attribute as an example)

invoice_number_list = []
invoice_number_list = browser.find_elements_by_class_name("tranid") 
i = 0
for invoice_link in invoice_link_list:     #invoice links are basically urls to each invoice
    invoice = myInvoice()         
    invoice.number = invoice_number_list[i].get_attribute('innerHTML')         
    i += 1

From what I've seen online, it's not super obvious how to make a csv file out of the objects I used

I found this: Writing list of objects to csv file

That guy basically says I should use namedtuples, which to my understanding are kind of like stripped-down objects made on a budget. With those, I (should) have an easier time making csv files. So I made this:

Record = namedtuple('Record', ['id', 'description', 'storage', 'weight', 'price', 'quantity'])
Invoice = namedtuple('Invoice', ['number', 'date', 'close_date', 'amount', 'Record_list'])

Already alarm bells are going off. Can I have a list of namedtuples be an attribute for a namedtuple? I need one csv file per Invoice. Each invoice has only one number, date, close_date, and amount. However, it can have a ton of Records. My thought process is telling me I need to have a list of Records attached to each Invoice.

I tried assigning values to a Invoice that is a namedtuple and had trouble.

Invoice_number_list = []
invoice_number_list = browser.find_elements_by_class_name("tranid") 
i = 0
for Invoice_link in Invoice_link_list:
            #Invoice.number = Invoice_number_list[i].get_attribute('innerHTML') #doesn't work
            Invoice_list.extend(Invoice._make((Invoice_number_list[i].get_attribute('innerHTML'), None, None, None, None)))  
            i +=0

The other Invoice values of date, close_date, and amount go into index [1], [2], and [3]. I leave [4] as None since that's where the list of Records for the Invoice should go.

The "extend()" ends up making my Invoice list into a string, which looks like it could be useful for making a dictionary (which I might need if I make a csv the hard way - I think with the right namedtuple it's almost as easy as saying "write this data to a csv"), but I need to be able to attach a list of Records to each individual Invoice - I can't do that with a string.

Here are what I think my options are:

  • make Invoice a regular object and make Record a regular object -> make csv out of these
  • make Invoice a namedtuple and Record a namedtuple ->make csv out of these
  • make one a namedtuple and the other a regular object -> make csv out of these

Nothing is glaringly obvious to me at the moment.

TL;DR: I'm trying to figure out how to write csv files from data. Should I stick with trying to make csvs out of namedtuples, or try and figure out how to do it with object attributes instead? How would I do either?

Glen Cote
  • 21
  • 2

1 Answers1

0

I think one of the most straightforward solutions here is to implement like to_dict and get_csv_fields methods for myInvoice class and then write each object into csv with built-in csv.DictWriter.

Of course, if you're on Python 3.6 and higher, data classes are much more elegant way to do it, but the idea remains the same.

class myInvoice(object):
    number = ""
    date = ""
    close_date = ""
    amount = ""
    def __init__(self, number, date, close_date, amount, formatted_records_list = None):
        self.number = number
        self.date = date
        self.close_date = close_date
        self.amount = amount
        if formatted_records_list is None:
            self.formatted_records_list = []
        else:
            self.formatted_records_list = formatted_records_list

    @staticmethod
    def get_csv_fields():
        return ['number', 'date', 'close_date', 'amount', 'formatted_records_list']

    def to_dict(self):
        return {el: getattr(self, el) for el in self.get_csv_fields()}           

Then you can write the objects to csv like following:

import csv

...

imvoices = [] # list of myInvoice instances

with open('names.csv', 'w', newline='') as csvfile:
    fieldnames = ['first_name', 'last_name']
    writer = csv.DictWriter(csvfile, fieldnames=myInvoice.get_csv_fields())

    writer.writeheader()
    for invoice in invoices:
        writer.writerow(invoice.to_dict())
Oleh Rybalchenko
  • 6,998
  • 3
  • 22
  • 36