This question is similar to: Populating a SQLite3 database from a .txt file with Python but my goal is different since I want to populate the database from a pandas dataframe rather than a text file.
My goal is to take an XML dump (StackExchange data dump) and populate my database based on some of the data contained in it. I started by using xml.etree.ElementTree to parse the file, then I pulled the data into a pandas data frame for easy indexing.
I need to iterate through the data frame to check for conditions on the questions, make some modifications on fields, and then populate the fields in my question model based on that. Here is my question model:
class Question(models.Model, HitCountMixin):
"""Model class to contain every question in the forum"""
title = models.CharField(max_length=200, blank=False)
description = MarkdownxField()
pub_date = models.DateTimeField('date published', auto_now_add=True)
tags = TaggableManager()
user = models.ForeignKey(settings.AUTH_USER_MODEL, on_delete=models.CASCADE)
closed = models.BooleanField(default=False)
positive_votes = models.IntegerField(default=0)
negative_votes = models.IntegerField(default=0)
total_points = models.IntegerField(default=0)
I know that I can start the django shell as follows:
python manage.py shell
And that I need to do something as follows to create the objects:
from your_app.models import Question
# If you're using different field names, change this list accordingly.
# The order must also match the column order in the CSV file.
fields = [<question fields>]
data = <stuff from PD dataframe>
Question.objects.create(**dict(zip(fields, data)))
But the issue is that I don't know how to then get my modified data from the data frame to the script above. I was looking at using subprocess.call, but I'm wondering if there's a way to do this directly without starting the shell. Maybe some direct call to populate the db from my main script?