0

I am facing an issue. In my case this behavior is random. After loading an hive external table and running msck repair successfully (Job 1) we have a subsequent spark job which pulls the data from these tables and loads to some other table (Job2). At random, the Job 2 is retrieving 0 records from the table loaded in Job 1. Some facts:

  • We pull the data using select *
  • We use spark SQL for doing this
  • We run Hive on Tez engine
  • We are running on AWS EMR
  • The behavior is purely random and we have not been able to identify a pattern in any way
  • The same table the same query gives the right results after sometime and then again returns no records at its will.

Any help in this area will be very helpful. We have been running around with no resolution.

Roy
  • 1
  • 1
  • use say select * but what about your where clause? – Matt Andruff Dec 17 '21 at 14:35
  • Is your table in S3 or HDFS? – Matt Andruff Dec 17 '21 at 14:38
  • Can you add a wait after job 1 before running job 2? – Matt Andruff Dec 17 '21 at 14:42
  • @MattAndruff : My answers 1) There is a where clause which selects all the data in that partition. The select statement is verified to work, infact it is doing just fine in case of some jobs and is fetching zero records in case of others (for which I am writing this post). 2) Our data sits on S3. 3) Eventually if we figure out it a consistency issue which is time dependent we may need to, but it is better not done now. We are still trying to understand why this is happening. – Roy Dec 21 '21 at 20:06
  • Did you have any luck finding the issue? – Matt Andruff Jan 21 '22 at 15:33

1 Answers1

0

For a result of '0' there are 3 possible solutions

  1. Hive definition was in flux. The hive logs know all. Find the query that was problematic in the hive logs and look up and look down to see what else was happening to that table during this time. This will also help you narrow window for your look into HDFS

  2. There is no data. Use Ranger/HDFS/S3 audit log to see what was happening at the time you ran the query. S3 should now be consistent and well has been a consistent source of issues working with hadoop.

  3. A bug. This is really really less likely and I keep an open mind it's possible but really not looking at this until we look at 1 and 2.

When I say "no Data" really I mean the folder/files were not availble at the time they were asked for. This doesn't mean the data is gone, just not where it should have been according to the file systems when it was requested. (S3 is great for long term storage but consistency has been a struggle and data isn't always reported where it is supposed to be until later.)

FYI MSCK REPAIR TABLE runs slow on S3(uses lots of directory listings) and therefore takes a long time to complete or sometimes times out. This could cause inconsistency that's being described.

Use Add Partition instead of MSCK REPAIR table.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
  • Thanks Matt. We have done some digging and zeroed down the problem to something in the line of the the first point you mentioned. But in this case the Hive definition is being repaired through MSCK repair after the data is loaded to the S3 buckets, but even after the repair we are unable to see the partitions. We are trying to dig deeper into the logs. A question for you : Do you know where the logs for Spark connectivity to Hive metastore can be found in EMR ? Similarly where could we get such logs for Ranger access ? Thanks for your help in advance. – Roy Dec 21 '21 at 20:18
  • I could find them at one point. Now I'd have to look it up. According to Amazon the Ranger records are in CloudWatch.(https://aws.amazon.com/about-aws/whats-new/2021/05/amazon-emr-6-3-now-supports-apache-ranger-for-fine-grained-data-access-control/) – Matt Andruff Dec 22 '21 at 16:50
  • I did see an interesting comment that mentions using Glue for hive Metadata and alluded to the fact that alleviated issues with different accounts writing data. Something to look into. (does one account write the data and another read?) – Matt Andruff Dec 22 '21 at 20:37
  • I went looking at MSCK(in S3) and sure enough it's a known performance issue. That's likely your issue, see my updated answer for how to work around it. – Matt Andruff Dec 22 '21 at 20:57