Import CSV into google cloud datastore

Question

I have a CSV file with 2 columns and 20,000 rows I would like to import into Google Cloud Datastore. I'm new to the Google Cloud and NoSQL databases. I have tried using dataflow but need to provide a Javascript UDF function name. Does anyone have an example of this? I will be querying this data once it's in the datastore. Any advice or guidance on how to create this would be appreciated.

score 5 · Accepted Answer · answered Jan 27 '18 at 07:58

5

Using Apache Beam, you can read a CSV file using the TextIO class. See the TextIO documentation.

Pipeline p = Pipeline.create();

p.apply(TextIO.read().from("gs://path/to/file.csv"));

Next, apply a transform that will parse each row in the CSV file and return an Entity object. Depending on how you want to store each row, construct the appropriate Entity object. This page has an example of how to create an Entity object.

.apply(ParDo.of(new DoFn<String, Entity>() {
    @ProcessElement
    public void processElement(ProcessContext c) {
        String row = c.element();
        // TODO: parse row (split) and construct Entity object
        Entity entity = ...
        c.output(entity);
    }
}));

Lastly, write the Entity objects to Cloud Datastore. See the DatastoreIO documentation.

.apply(DatastoreIO.v1().write().withProjectId(projectId));

answered Jan 27 '18 at 07:58

Andrew Nguonly

2,258
1
17
23

Thank you Andrew. I have a coupeIe noob questions. I have taken a look at the TextIO documentation and have a question. Where will I be running TextIO? In Apache Beam or in Dataflow? Also, where will I be applying the transform and writing the entities to Cloud Datastore? I see that I can run jobs in Dataflow. Is this what you are referring to? – IamSule Jan 27 '18 at 20:19
2

Apache Beam is a programming model for defining pipelines. The pipelines can be run on an execution engine such as Cloud Dataflow. You don't actually "run" `TextIO`. You define a pipeline using the Apache Beam SDKs. For example, using the Java SDK, `TextIO.Read` is the input transform and `DatastoreV1.Read` is the output transform. You can apply any transform in between to implement ETL logic. Once the pipeline is defined/implemented, it can then be deployed/run. – Andrew Nguonly Jan 27 '18 at 21:46

score 2 · Answer 2 · answered Jan 28 '18 at 05:11

Simple in python, but can easily adapt to other langauges. Use the split() method to loop through the lines and comma-separated values:

from google.appengine.api import urlfetch
from my.models import MyModel

csv_string   = 'http://someplace.com/myFile.csv'
csv_response = urlfetch.fetch(csv_string, allow_truncated=True) 

if csv_response.status_code == 200:
    for row in csv_response.content.split('\n'):
        row_values = row.split(',')
        # csv values are strings.  Cast them if they need to be something else
        new_entry = MyModel(
            property1 = row_values[0],
            property2 = row_values[1]
        )
        new_entry.put()

else:
    print 'cannot load file: {}'.format(csv_string)

Import CSV into google cloud datastore

2 Answers2

Linked