How should a TSV file be formatted in DRUID?

Question

I am curious how a TSV file should look when we are ingesting data from a local TSV file using DRUID.

Should it just be like:

Please note this is just for testing:

quickstart/sample_data.tsv file:

name lastname email time Bob Jones bobj@gmail.com 1468839687 Billy Jones BillyJ@gmail.com 1468839769

Where this part is my dimensions: name lastname email
And this part is my actual data: Bob Jones bobj@gmail.com 1468839687 Billy Jones BillyJ@gmail.com 1468839769

{
        "type" : "index_hadoop",
    "spec" : {
        "ioConfig" : {
                        "type" : "hadoop",
            "inputSpec" : {
                "type" : "static",
                "paths" : "quickstart/sample_data.tsv"
            }
        },
        "dataSchema" : {
            "dataSource" : "local",
            "granularitySpec" : {
                "type" : "uniform",
                        "segmentGranularity" : "hour",
                        "queryGranularity" : "none",
                        "intervals" : ["2016-07-18/2016-07-18"]
            },
            "parser" : {
                "type" : "string",
                "parseSpec" : {
                    "format" : "tsv",
                    "dimensionsSpec" : {
                        "dimensions" : [
                            "name",
                            "lastname",
                            "email"
                        ]
                    },
                    "timestampSpec" : {
                                 "format" : "auto",
                         "column" : "time"
                    }
                }
            },
            "metricsSpec" : [
                {
                    "name" : "count",
                    "type" : "count"
                },
                {
                    "name" : "added",
                    "type" : "longSum",
                    "fieldName" : "deleted"
                }
            ]
        }
    }
}

I had some questions about my spec file as well since I was not able to find the answers to them on the doc. I would appreciate it if someone can answer them for me :)!

1) I noticed in the example spec they added the line "type" : "index_hadoop" at the very top. What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory? Also where can I read about the different values I should put for this "type" key in the docs? I didn't get a explanation for that.

2) Again there is a type variable in the ioConfig: "type" : "hadoop". What would I put for the type if I am ingesting a TSV file from my local computer in the quickstart directory?

3) For the timestampSpec, the time in my TSV file is in GMT. Is there any way I can use this as the format. Since I read you should convert it to UTC and is there a way to convert to UTC during the process of posting the data to the Overlord? Or will I have to change all of those GMT time formats to UTC similar to this: "time":"2015-09-12T00:46:58.771Z".

score 0 · Answer 1 · answered Aug 21 '16 at 12:46

Druid supports two ways of ingesting batch data

Hadoop Index Task
Index Task

The spec you are referring to is of a Hadoop Index Task hence "type" is "index_hadoop" and also ioconfig type is "hadoop".

Here is a sample spec for a index task which can read from local file: { "type": "index", "spec": { "dataSchema": { "dataSource": "wikipedia", "parser": { "type": "string", "parseSpec": { "format": "json", "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": ["page", "language"] } } }, "metricsSpec": [{ "type": "count", "name": "count" }, { "type": "doubleSum", "name": "added", "fieldName": "added" }], "granularitySpec": { "type": "uniform", "segmentGranularity": "DAY", "queryGranularity": "NONE", "intervals": ["2013-08-31/2013-09-01"] } }, "ioConfig": { "type": "index", "firehose": { "type": "local", "baseDir": "examples/indexing/", "filter": "wikipedia_data.json" } } } }

How should a TSV file be formatted in DRUID?

1 Answers1