0

I'm new to iceberg, and i have a question about query big table.

We have a Hive table with a total of 3.6 million records and 120 fields per record. and we want to transfer all the records in this table to other databases, such as pg, kafak, etc.

Currently we do like this:

 Dataset<Row> dataset = connection.client.read().format("iceberg").load("default.table");
// here will  stuck for a very long time
dataset.foreachPartition(par ->{
    par.forEachRemaining(row ->{
       ```
    });
});

but it can get stuck for a long time in the foreach process.

and I tried the following method, the process does not stay stuck for long, but the traversal speed is very slow, the traverse efficiency is about 50 records/second.

HiveCatalog hiveCatalog = createHiveCatalog(props);
Table table = hiveCatalog.loadTable(TableIdentifier.of("default.table"));
CloseableIterable<Record> records = IcebergGenerics.read(table) .build();
records.forEach( record ->{
    ```
});

Neither of these two ways can meet our needs, I would like to ask whether my code needs to be modified, or is there a better way to traverse all records? Thanks!

xujin
  • 1
  • 1
  • This progress is running on spark local mode. I think it takes a long time to generate Spark tasks, and eventually it would generate over 10,000 tasks. – xujin Jan 07 '22 at 07:47
  • Are you writing the data row by row? This will be much slower than writing in batches, in most target databases. – shay__ Jan 15 '22 at 09:19

1 Answers1

1

In addition to reading row by row, here is another idea.

If your target database can import files directly, try retrieving files from Iceberg and importing them directly to the database.

Example code is as follows:

   Iterable<DataFile> files = FindFiles.in(table)
        .inPartition(table.spec(), StaticDataTask.Row.of(1))
        .inPartition(table.spec(), StaticDataTask.Row.of(2))
        .collect();

You can get the file path and Format from the DataFile.

Saurabh Garg
  • 39
  • 11
liliwei
  • 294
  • 1
  • 8