0

I am trying to run spark-nlp as azure function.

I have a function app which is run with a docker container. My function app code is run on python and I also install java as I run pyspark within it. I use python's flask within one function to handle incoming requests.

Once the function app starts and container is running, for the first few seconds I get responses for my API calls but after only few seconds (~15-20 seconds) the API calls start timing out due to no response from the server.

The function app is running on dedicated app service plan and is set to 'always on'.

What is the reason for such a behavior?

Here is my function app code:

import logging
import azure.functions as func

# Imports for Spark-NLP
import os
import sys

sys.path.append('/home/site/wwwroot/contextSpellCheck/spark-2.4.7-bin-hadoop2.7/python')
sys.path.append('/home/site/wwwroot/contextSpellCheck/spark-2.4.7-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip')

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from sparknlp.annotator import *

from flask import Flask, request

app = Flask(__name__)

spark = sparknlp.start()
documentAssembler =  DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = RecursiveTokenizer().setInputCols(["document"]).setOutputCol("token").setPrefixes(["\"", "(", "[", "\n"]).setSuffixes([".", ",", "?", ")", "!", "'s"])
spellModel = ContextSpellCheckerModel.load("/home/site/wwwroot/contextSpellCheck/spellcheck_dl_en_2.5.0_2.4_1588756259065").setInputCols("token").setOutputCol("checked")
finisher = Finisher().setInputCols("checked")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, spellModel, finisher])
empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

@app.route('/api/testFunction', methods = ['GET', 'POST'])
def annotate():
    global lp
    if request.method == 'GET':
        text = request.args.get('text')
    elif request.method == 'POST':
        req_body = request.get_json()
        text = req_body['text']
    return lp.annotate(text)


def main(req: func.HttpRequest, context: func.Context) ->     func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')
    return func.WsgiMiddleware(app).handle(req, context)
Furqan Rahamath
  • 2,034
  • 1
  • 19
  • 29

1 Answers1

0

It may be that you are creating a pipeline per request. You have a stack with several languages, it could be that one of the libraries has this functionality.

See the section on "Avoid creating lots of pipelines" in https://stanfordnlp.github.io/CoreNLP/memory-time.html#avoid-creating-lots-of-pipelines

Shiraz Bhaiji
  • 64,065
  • 34
  • 143
  • 252
  • Thanks for the suggestion @shirazbhaiji. But it doesn't appear to be the case. I have my pipeline declarations outside the functions in my flask app code and it only executes once. – Furqan Rahamath Nov 10 '20 at 20:22