3

Is there any way to create a dynamic container of arrow::ArrayBuilder objects? Here is an example

int main(int argc, char** argv) {
  std::size_t rowCount = 5;
  arrow::MemoryPool* pool = arrow::default_memory_pool();  
  std::vector<arrow::Int64Builder> builders;
  for (std::size_t i = 0; i < 2; i++) {
    arrow::Int64Builder tmp(pool);
    tmp.Reserve(rowCount);
    builders.push_back(tmp);
  }

  return 0;
}

This yields error: variable ‘arrow::Int64Builder tmp’ has initializer but incomplete type

I am ideally trying to build a collection that will hold various builders and construct a table from row-wise data I am receiving. My guess is that this isn't the intended use for builders, but I couldn't find anything definitive in the Arrow documentation

Will Ayd
  • 6,767
  • 2
  • 36
  • 39

1 Answers1

3

What do your includes look like? That error message seems to suggest you are not including the right files. The full definition for arrow:Int64Builder is in arrow/array/builder_primitive.h but you can usually just include arrow/api.h to get everything.

The following compiles for me:

#include <iostream>

#include <arrow/api.h>

arrow::Status Main() {
    std::size_t rowCount = 5;
    arrow::MemoryPool* pool = arrow::default_memory_pool();
    std::vector<arrow::Int64Builder> builders;
    for (std::size_t i = 0; i < 2; i++) {
      arrow::Int64Builder tmp(pool);
      ARROW_RETURN_NOT_OK(tmp.Reserve(rowCount));
      builders.push_back(std::move(tmp));
    }
  return arrow::Status::OK();
}

int main() {
  auto status = Main();
  if (!status.ok()) {
    std::cerr << "Err: " << status << std::endl;
    return 1;
  }
  return 0;
}

One small change to your example is that builders don't have a copy constructor / can't be copied. So I had to std::move it into the vector.

Also, if you want a single collection with many different types of builders then you probably want std::vector<std::unique_ptr<arrow::ArrayBuilder>> and you'll need to construct your builders on the heap.

One challenge you may run into is the fact that the builders all have different signatures for the Append method (e.g. the Int64Builder has Append(long) but the StringBuilder has Append(arrow::util::string_view)). As a result arrow::ArrayBuilder doesn't really have any Append methods (there are a few which take scalars, if you happen to already have your data as an Arrow C++ scalar). However, you can probably overcome this by casting to the appropriate type when you need to append.

Update:

If you really want to avoid casting and you know the schema ahead of time you could maybe do something along the lines of...

std::vector<std::function<arrow::Status(const Row&)>> append_funcs;
std::vector<std::shared_ptr<arrow::ArrayBuilder>> builders;
for (std::size_t i = 0; i < schema.fields().size(); i++) {
  const auto& field = schema.fields()[i];
  if (isInt32(field)) {
    auto int_builder = std::make_shared<Int32Builder>();
    append_funcs.push_back([int_builder] (const Row& row) ({
      int val = row.GetCell<int>(i);
      return int_builder->Append(val);
    });
    builders.push_back(std::move(int_builder));
  } else if {
    // Other types go here
  }
}

// Later
for (const auto& row : rows) {
  for (const auto& append_func : append_funcs) {
    ARROW_RETURN_NOT_OK(append_func(row));
  }
}

Note: I made up Row because I have no idea what format your data is in originally. Also I made up isInt32 because I don't recall how to check that off the top of my head.

This uses shared_ptr instead of unique_ptr because you need two copies, one in the capture of the lambda and the other in the builders array.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • To your second point about the casting. I am creating the schema up front which would already tell me the types each column needs. Are you aware of any facility within Arrow I can use to associate the Field type to its appropriate builder? Or is that something I can only do through casting? – Will Ayd Nov 22 '21 at 01:54
  • Note that I also cannot use ArrayBuilder in the above example as it would yield `error: variable type 'arrow::ArrayBuilder' is an abstract class` . Might be bordering on a separate SO question but figured I'd comment here first – Will Ayd Nov 22 '21 at 02:05
  • 1
    `std::unique_ptr` not `arrow::ArrayBuilder` In c++ you can have a vector of pointers-to-abstract-base but you cannot have a vector of abstract-base. `shared_ptr` (or any other pointer type) is fine too. I'll update the answer with some pseudo-code of an approach you can take that avoids the casting using the schema ahead of time. – Pace Nov 22 '21 at 08:06
  • That is totally awesome. Thanks as always! – Will Ayd Nov 23 '21 at 03:12