I have a CSV file with 2 columns and 20,000 rows I would like to import into Google Cloud Datastore. I'm new to the Google Cloud and NoSQL databases. I have tried using dataflow but need to provide a Javascript UDF function name. Does anyone have an example of this? I will be querying this data once it's in the datastore. Any advice or guidance on how to create this would be appreciated.
Asked
Active
Viewed 5,148 times
2 Answers
5
Using Apache Beam, you can read a CSV file using the TextIO
class. See the TextIO documentation.
Pipeline p = Pipeline.create();
p.apply(TextIO.read().from("gs://path/to/file.csv"));
Next, apply a transform that will parse each row in the CSV file and return an Entity
object. Depending on how you want to store each row, construct the appropriate Entity
object. This page has an example of how to create an Entity
object.
.apply(ParDo.of(new DoFn<String, Entity>() {
@ProcessElement
public void processElement(ProcessContext c) {
String row = c.element();
// TODO: parse row (split) and construct Entity object
Entity entity = ...
c.output(entity);
}
}));
Lastly, write the Entity
objects to Cloud Datastore. See the DatastoreIO documentation.
.apply(DatastoreIO.v1().write().withProjectId(projectId));

Andrew Nguonly
- 2,258
- 1
- 17
- 23
-
Thank you Andrew. I have a coupeIe noob questions. I have taken a look at the TextIO documentation and have a question. Where will I be running TextIO? In Apache Beam or in Dataflow? Also, where will I be applying the transform and writing the entities to Cloud Datastore? I see that I can run jobs in Dataflow. Is this what you are referring to? – IamSule Jan 27 '18 at 20:19
-
2Apache Beam is a programming model for defining pipelines. The pipelines can be run on an execution engine such as Cloud Dataflow. You don't actually "run" `TextIO`. You define a pipeline using the Apache Beam SDKs. For example, using the Java SDK, `TextIO.Read` is the input transform and `DatastoreV1.Read` is the output transform. You can apply any transform in between to implement ETL logic. Once the pipeline is defined/implemented, it can then be deployed/run. – Andrew Nguonly Jan 27 '18 at 21:46
2
Simple in python, but can easily adapt to other langauges. Use the split()
method to loop through the lines and comma-separated values:
from google.appengine.api import urlfetch
from my.models import MyModel
csv_string = 'http://someplace.com/myFile.csv'
csv_response = urlfetch.fetch(csv_string, allow_truncated=True)
if csv_response.status_code == 200:
for row in csv_response.content.split('\n'):
row_values = row.split(',')
# csv values are strings. Cast them if they need to be something else
new_entry = MyModel(
property1 = row_values[0],
property2 = row_values[1]
)
new_entry.put()
else:
print 'cannot load file: {}'.format(csv_string)

GAEfan
- 11,244
- 2
- 17
- 33