3

We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:

  1. How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
  2. How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?

I looked over docs but could not get a substantial overview for both use cases.

KRS
  • 132
  • 2
  • 12

2 Answers2

1

For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers

Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888

  1. Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.

  2. Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value

One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.

You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888

For question 2 it think it is necessary a reindex of the data - http://druid.io/docs/latest/ingestion/update-existing-data.html - http://druid.io/docs/latest/ingestion/schema-changes.html

I hope this helps

aovelhanegra
  • 111
  • 3
0

1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.

More detail about such approach can be found here: Druid Schema less Ingestion

2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.

In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html

Additional info on schema changes can be found here: http://druid.io/docs/latest/ingestion/schema-changes.html

Jainik
  • 2,352
  • 1
  • 19
  • 27