57

Quoting the Spark DataFrames, Datasets and SQL manual:

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

Being new to Spark, I'm a bit baffled by this for two reasons:

  1. Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?

  2. Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

user4157124
  • 2,809
  • 13
  • 27
  • 42
hillel
  • 2,343
  • 2
  • 18
  • 25
  • 2
    `Spark` isn't a database. So what do you mean by `index`? You can write a process in Spark that indexes raw text files. You can take a raw text file and save it as a `Parquet` file with your data partitioned out as required. What more do you expect it to do for you? – David Griffin Apr 29 '16 at 13:40
  • 2
    @DavidGriffin The quoted section is about Hive optimizations. Spark SQL will execute over a Hive table. So the optimization that indices provide to Hive are not supported by SparkSQL. That is what it means, not that Spark itself is a database – OneCricketeer Apr 29 '16 at 14:00
  • 4
    @DavidGriffin Spark SQL provides a SQL abstraction. disregarding implementation considerations and storage mechanisms, I think it's reasonable to hope for Spark SQL to provide functionality equivalent to the sql CREATE INDEX statement without me needing to implement it myself. – hillel Apr 29 '16 at 14:13
  • 1
    Why is that reasonable? Can you create a Hive table in Spark and then go to Hive and create an index there on the table? If so, why does that need to be Spark's responsibility? I can't create a Kafka topic or add / remove partitions for Kafka topics in Spark. And it's not reasonable for me to expect to be able to. – David Griffin Apr 29 '16 at 14:22
  • 8
    @David Griffin, notice I didn't mention Hive in my reply and I don't really think it's relevant. All I'm saying that Indexing in general can save a lot of time and is very useful to speed up SQL queries. From the passage I quoted it seems that the Spark SQL team doesn't consider Indexing to be important enough to implement and being a Spark/BigData novice, I want to understand why. I'm looking for answers like "the implementation is hard because of the Spark architecture", "indexing is not useful enough for typical Spark SQL applications" or "Indexing is covered by other more suitable tools". – hillel Apr 29 '16 at 14:36
  • 2
    Spark can be used with a variety of data storage formats, some of which support indexing and others that do not. Spark + Postgres allows you to use Postgres indices. Spark + Parquet files doesn't let you use indices because Parquet lakes don't support indexing. – Powers Jan 17 '21 at 15:32

2 Answers2

53

Indexing input data

  • The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
  • If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.

Indexing Distributed Data Structures:

  • standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
  • high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.

That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.

Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.

Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

10465355
  • 4,481
  • 2
  • 20
  • 44
zero323
  • 322,348
  • 103
  • 959
  • 935
  • @zero323: Regarding "Since it doesn't own the data it is using it cannot reliably monitor changes". Doesn't this contradict the fact that Spark supports partitioning? – shridharama Sep 14 '16 at 19:36
  • @shridharama What do you mean by partitioning in this context? – zero323 Sep 15 '16 at 13:33
  • @zero323 I was referring to spark supporting functions such as repartition() as well as the ability for spark to [discover partitioned parquet data](http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery) – shridharama Sep 16 '16 at 08:24
  • 2
    @shridharama If that's the case the answer is negative. `repartition` doesn't monitor anything. It creates immutable and effectively ephemeral data structure which is limited to a given application. Partition discovery just assumes that structure is valid. It doesn't monitor data and doesn't validate beyond standard Parquet capabilities. – zero323 Sep 16 '16 at 08:47
  • @zero323 Bucketing means that the data may not be limited to an application, does it not? – thebluephantom Jul 31 '18 at 22:03
20

In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".

So what is an index, anyway?

Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.

But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.

40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?

Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.

Seriouusly. Think about that a second.

The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.

If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.

Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.

And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.

What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.

Tell me again why you want indexes.

Indexes vs ___ ?

Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?

Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.

If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.

If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!

But I digress...

Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

David Griffin
  • 13,677
  • 5
  • 47
  • 65
  • 1
    @DavidGriffin, I understand your point that if you're query requires a very large number of head moves you might be better of reading sequential data. But to be precise we should consider the the speed of sequential reading: if the size of the data is large enough that reading it takes more time then the head moves, we will still benefit from indexing, wouldn't we? e.g. simple SELECT x WHERE y=z queries will require very few head moves and might operate on very large data sizes (terabytes). Wouldn't sequential reading be magnitudes of order slower in these cases? – hillel Apr 29 '16 at 16:02
  • 1
    Does your result return one row or thousands? If thousands or millions, all spread more or less evenly across your data, an index hurts not helps. If only one row, then maybe a different database makes more sense (I never work on a single record in Spark, do you?) – David Griffin Apr 29 '16 at 16:12
  • 1
    And if you think about it, you never in real life work off of a single row (with no related rows joined to it). In testing, yes. But not in production. Data is totally meaningless by itself, and your example looks compelling until you realize it's a useless use case. – David Griffin Apr 29 '16 at 16:17
  • 1
    Thanks @DavidGriffin, I think I understand your point. I'll probably accept zero323's answer as it address more aspects of this issue, but it's been a very education discussion for me. – hillel Apr 29 '16 at 16:31
  • I don't blame you -- my answer was only marginally on topic. – David Griffin Apr 29 '16 at 16:56
  • It's Friday afternoon here and I'm bored. I added some thoughts -- everyone have a nice weekend. – David Griffin Apr 29 '16 at 17:56
  • I disagree that "disk space is essentially free." Maybe for small websites / apps, but not for data science applications. – Jordan P Sep 12 '17 at 15:42
  • $35 a terabyte is essentially free, and more importantly the price point is asymptotic -- it approaches zero over time. The more important part being, that the economics of hard drive space is drastically different than it was 40 years ago. – David Griffin Sep 15 '17 at 10:49
  • 2
    "40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second." It should be pointed out that these figures are *drastically* different for SSDs, which alters the full table scan vs index scan calculation significantly. – Joe Stevens Sep 01 '18 at 15:41
  • 1
    @JoeStevens How do the figures change for SSDs, and how does that alter the full table scan versus the index scan calculation? – jstnchng Dec 06 '18 at 00:59
  • 1
    @jstnchng How many head movements can an SSD make per second? ;-) The "seek", as opposed to "scan" time for an SSD, is essentially zero in comparison to spinning disks. This means that the total amount of data read is decisive. – Joe Stevens Aug 02 '19 at 04:02