2

I'm looking to extract some data from an Oracle database and transferring it to a remote HDFS file system. There appears to be a couple of possible ways of achieving this:

  1. Use Sqoop. This tool will extract the data, copy it across the network and store it directly into HDFS
  2. Use SQL to read the data and store in on the local file system. When this has been completed copy (ftp?) the data to the Hadoop system.

My question will the first method (which is less work for me) cause Oracle to lock tables for longer than required?

My worry is that that Sqoop might take out a lock on the database when it starts to query the data and this lock isn't going to be released until all of the data has been copied across to HDFS. Since I'll be extracting large amounts of data and copying it to a remote location (so there will be significant network latency) the lock will remain longer than would otherwise be required.

Stormcloud
  • 2,065
  • 2
  • 21
  • 41

1 Answers1

1
  • Sqoop issues usual select queries on the Oracle batabase, so it does the same locks as the select query would. No extra additional locking is performed by Sqoop.

  • Data will be transferred in several concurrent tasks(mappers). Any expensive function call will put a significant performance burden on your database server. Advanced functions could lock certain tables, preventing Sqoop from transferring data in parallel. This will adversely affect transfer performance.

  • For efficient advanced filtering, run the filtering query on your database prior to import, save its output to a temporary table and run Sqoop to import the temporary table into Hadoop without the —where parameter.

  • Sqoop import has nothing to do with copy of data accross the network. Sqoop stores at one location and based on the Replication Factor of the cluster HDFS replicates the data

hadooper
  • 726
  • 1
  • 6
  • 18
  • Thanks for the information, it's really helpful. I get that once the query has been completed the lock(s) will be released, but do you know if this is before or after the the data is transferred to HDFS? This is important as the transfer time could be significant and I don't want the locks held longer than they need to be. – Stormcloud Feb 17 '16 at 09:43