0

I have the following code that iterates recursively and does something over the network. As it goes over the network, I would like to do some optimizations, where the very first optimization would be to avoid going over the network for certain elements that I have already tried.

For example., in the case below, I call a URL, extract the HREF's found in that URL and call those URL's and report the status. As it could be possible that certain URL's might be fetched again, for those URL's that failed, I would like to add them to a global state so that when I encounter this URL the next time, I will avoid those network calls.

Here is the code:

def callURLWithCache(url: String): Task[HttpResult] = {
    Task {
      Http(url).timeout(connTimeoutMs = 1000, readTimeoutMs = 3000).asString
    }.attempt.map {
      case Left(err) =>
        println(s"ERR happened ----------------- $url ************************ ${err.getMessage}")
        // Add to the cache
        val httpResult = HttpResult(source = url, isSuccess = false, statusCode = 1000, errorMessage = Some(err.getMessage))
        val returnnnn: Try[Any] = httpResultErrorCache.put(url)(httpResult)
        httpResult
      case Right(doc) =>
        if (doc.isError) {
          HttpResult(source = url, isSuccess = doc.isSuccess, statusCode = doc.code)
        } else {
          val hrefs = (browser.parseString(doc.body) >> elementList("a[href]") >?> attr("href"))
            .distinct.flatten.filter(_.startsWith("http"))
          HttpResult(source = url, isSuccess = doc.isSuccess, statusCode = doc.code, elems = hrefs)
        }
    }
  }

You can see in the case Left(....) block that I add the failed case class to the cache which I define globally on the enclosing class of this function as:

val underlyingCaffeineCache: cache.Cache[String, Entry[HttpResult]] = Caffeine.newBuilder().maximumSize(10000L).build[String, Entry[HttpResult]]
implicit val httpResultErrorCache: Cache[HttpResult] = CaffeineCache(underlyingCaffeineCache)

Here is the function that I do a recursive operation:

def parseSimpleWithFilter(filter: ParserFilter): Task[Seq[HttpResult]] = {
    def parseInner(depth: Int, acc: HttpResult): Task[Seq[HttpResult]] = {
      import cats.implicits._
      if (depth > 0) {
        val batched = acc.elems.collect {
          case elem if httpResultErrorCache.get(elem).toOption.exists(_.isEmpty) =>
            callURLWithCache(elem).flatMap(newElems => parseInner(depth - 1, newElems))
        }.sliding(30).toSeq
          .map(chunk => Task.parSequence(chunk))
        Task.sequence(batched).map(_.flatten).map(_.flatten)
      } else Task.pure(Seq(acc))
    }
    callURLWithCache(filter.url).map(elem => parseInner(filter.recursionDepth, elem)).flatten
  }

It can be seen that I'm checking if the url that I have as my current elem is already in the cache, meaning that I have already tried it and failed, so I would like to avoid doing a HTTP call once again for it.

But what happens is that, the httpResultErrorCache turns up always empty. I'm not sure if the Task chunk is causing this behavior. Any ideas on how to get the cache to work?

joesan
  • 13,963
  • 27
  • 95
  • 232
  • You seem to only put stuff in cache with `isSuccess=false`, but only `get` once that have `isSuccess=true` ... Also, looks like you are running 30 tasks in parallel, so, those will likely not have access to results from one another. – Dima Apr 07 '21 at 17:31
  • I guessed that the concurrency is the culprit behind it. Is there a work around for this? I think you are confusing with the isSuccess in the case class to the isSucess from the Try – joesan Apr 07 '21 at 17:36
  • Oh ... yeah ... Don't use `Try` like this, it's confusing like hell. Just do `if cache.get(elem).toOption.exists(_.isEmpty)` – Dima Apr 07 '21 at 17:41
  • @Dima Would passing the state as a reference to the callURLWithCache and getting it back in the parseInner work? What do you think? – joesan Apr 07 '21 at 18:24
  • I still don't understand what you are trying to do with that code. You are saying that you don't want to try the failed things again. So you put them in cache. But then in your `parseInner` func you basically _only_ ever look at those you got from cache - which are supposed to be exactly the ones you wanted to _skip_. What gives? – Dima Apr 07 '21 at 18:50
  • The problem domain here is about URL check. All I want to do is to avoid checking failed URL's once again. In the parseInner, I just check if I got that particular url already in the cache, which means that it already failed for a previous run. – joesan Apr 07 '21 at 19:00
  • I start with a base URL, get all the URLs that are found in the base URL and I basically repeat the process recursively (controlled by the depth). – joesan Apr 07 '21 at 19:01
  • You check if you got it in the cache, but then you don't do anything else: ``` acc.elems.collect { case elem if httpResultErrorCache.get(elem).toOption.exists(_.isEmpty) => ... ``` - this only returns `elems` that exist in the cache. – Dima Apr 07 '21 at 19:14
  • Yes, so assume that I had an element xyz.com for which it failed previously (in the callURLWithCache function, so I add this to the cache. Sometime later I encounter the same URL and in my case collect condition, I want to call the callURLWithCache if that particular URL does not exist in the cache. Hope it is clear? – joesan Apr 07 '21 at 19:28
  • "I want to call the callURLWithCache if that particular URL does not exist in the cache" -- yes, but you are calling it when it _does_ exist. – Dima Apr 07 '21 at 19:42
  • cache.get(elem).toOption.exists(_.isEmpty) -> Returns false for the URL if it is in the cache. Is that correct? If so then guard condition in the case collect is false and it does not call the contents inside it, which is the call to callURLWithCache. – joesan Apr 07 '21 at 19:54
  • Ugh, so `get(elem)` returns `Try[Option[String]]`? I didn't realize that ... Gotta catalogue this under my "completely nonsesical semantics" catgory :) In that case, a better way to unwrap that monster would be `.get(elem).toOption.flatten.isEmpty` (at least, this makes it obvious, that there are two layers of wrappers you are getting through, but obviously this has nothing to do with your actual problem, so you can just ignore me. – Dima Apr 07 '21 at 21:04
  • Try[Option] seems strange to me. Either would have been appropriate! Yes aou are correct. This has nothing to do with the actual problem. I'm thinking of using Akka Actor in the mix to get to what I want! – joesan Apr 08 '21 at 00:14
  • 1
    Another thing is are you sure you want to use `sliding` rather than `grouped`? You seem to be processing the same element numerous times ... Maybe, that's your problem? If you are seeing multiple errors from the same url, this could be why - two different chunks are doing it in parallel ... – Dima Apr 08 '21 at 11:23
  • Good point! Let me try that with grouped! And yes, I'm seeing multiple errors for the same urls! – joesan Apr 08 '21 at 11:29
  • @Dima You were spot on! I wan indeed processing the same element several times with sliding. As soon as I changed it to grouped, it ran under 31 seconds for checking around 400 HTTP endpoints. – joesan Apr 08 '21 at 11:34
  • Glad I could hep! It probably also makes sense to move the check for cache inside `callURLWithCache`: it's better design having cache access localized like that, and more importantly, it has a potential of increasing hit rate because if the same element appears in multiple chunks being processed in parallel. Speaking of which, you may want to also consider adding a `.distinct` before `.grouped` ... and also I am not sure why you want to only cache errors - caching successes seems to actually be more beneficial, because it would save you a lot on recursion. – Dima Apr 08 '21 at 11:38
  • For the successful ones, I have a bloom filter in place (is not shown here), for the failure ones I need to have a cache as every failed url that fails with a timeout is dragging the response time. – joesan Apr 08 '21 at 11:40

0 Answers0