0

I have a setup where I need to extract data from Elasticsearch and store it on an Azure Blob. Now to get the data I am using Elasticsearch's _search and _scroll API. The indexes are pretty well designed and are formatted something like game1.*, game2.*, game3.* etc.

I've created a worker.py file which I stored in a folder called shared_code as Microsoft suggests and I have several Timer Trigger Functions which import and call worker.py. Due to the way ES was setup on our side I had to create a VNET and a static Outbound IP address which we've then whitelisted on ES. Conversely, the data is only available to be extracted from ES only on port 9200. So I've created an Azure Function App which has the connection setup and I am trying to create multiple Functions (game1-worker, game2-worker, game3-worker) to pull the data from ES running in parallel on minute 5. I've noticed if I add the FUNCTIONS_WORKER_PROCESS_COUNT = 1 setting then the functions will wait until the first triggered one finishes its task and then the second one triggers. If I don't add this app setting or increase the number, then once a function stopped because it finished working, it will try to start it again and then I get a OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted error. Is there a way I can make these run in parallel but not have the mentioned error?

Here is the code for the worker.py:

#!/usr/bin/env python
# coding: utf-8

# # Elasticsearch to Azure Microservice
import json, datetime, gzip, importlib, os, re, logging
from elasticsearch import Elasticsearch
import azure.storage.blob as azsb
import azure.identity as azi
import os
import tempfile

def batch(game_name, env='prod'):

    # #### Global Variables
    env = env.lower()
    connection_string = os.getenv('conn_storage')
    lowerFormat = game_name.lower().replace(" ","_")
    azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
    storageContainerName = azFormat
    stateStorageContainerName = "azure-webjobs-state"
    minutesOffset = 5
    tempFilePath = tempfile.gettempdir()
    curFileName = f"{lowerFormat}_cursor.py"
    curTempFilePath = os.path.join(tempFilePath,curFileName)
    curBlobFilePath = f"cursors/{curFileName}"
    esUrl = os.getenv('esUrl')

    # #### Connections
    es = Elasticsearch(
        esUrl,
        port=9200,
        timeout=300)

    def uploadJsonGzipBlob(filePathAndName, jsonBody):
        blob = azsb.BlobClient.from_connection_string(
            conn_str=connection_string,
            container_name=storageContainerName,
            blob_name=filePathAndName
        )
        blob.upload_blob(gzip.compress(bytes(json.dumps(jsonBody), encoding='utf-8')))

    def getAndLoadCursor(filePathAndName):
        # Get cursor from blob
        blob = azsb.BlobClient.from_connection_string(
            conn_str=os.getenv('AzureWebJobsStorage'),
            container_name=stateStorageContainerName,
            blob_name=filePathAndName
        )
        # Stream it to Temp file
        with open(curTempFilePath, "wb") as f:
            data = blob.download_blob()
            data.readinto(f)
        
        # Load it by path
        spec = importlib.util.spec_from_file_location("cursor", curTempFilePath)
        cur = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(cur)
        return cur
    
    def writeCursor(filePathAndName, body):
        blob = azsb.BlobClient.from_connection_string(
            conn_str=os.getenv('AzureWebJobsStorage'),
            container_name=stateStorageContainerName,
            blob_name=filePathAndName
        )
        blob.upload_blob(body, overwrite=True)

    # Parameter and state settings

    if os.getenv(f"{lowerFormat}_maxSizeMB") is None:
        maxSizeMB = 10 # Default to 10 MB
    else:
        maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
    
    if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is None:
        maxProcessTimeSeconds = 300 # Default to 300 seconds
    else:
        maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))

    try:
        cur = getAndLoadCursor(curBlobFilePath)
    except Exception as e:
        dtStr = f"{datetime.datetime.utcnow():%Y/%m/%d %H:%M:00}"
        writeCursor(curBlobFilePath, f"# Please use format YYYY/MM/DD HH24:MI:SS\nlastPolled = '{dtStr}'")
        logging.info(f"No cursor file. Generated {curFileName} file with date {dtStr}")
        return 0
    
    # # Scrolling and Batching Engine

    lastRowDateOffset = cur.lastPolled
    nrFilesThisInstance = 0

    while 1:
        # Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
        initTime = datetime.datetime.utcnow()

        ## Filter lt (less than) endDate to avoid infinite loops.
        ## Filter lt manually when compiling historical based on 
        endDate = initTime-datetime.timedelta(minutes=minutesOffset)
        endDate = f"{endDate:%Y/%m/%d %H:%M:%S}"

        doc = {
        "query": {
            "range": {
            "baseCtx.date": {
                "gt": lastRowDateOffset,
                "lt": endDate
            }
            }
        }
        }

        Index = lowerFormat + ".*"
        if env == 'dev': Index = 'dev.' + Index

        if nrFilesThisInstance == 0:
            page = es.search(
                index = Index,
                sort = "baseCtx.date:asc",
                scroll = "2m",
                size = 10000,
                body = doc
            )
        else:
            page = es.scroll(scroll_id = sid, scroll = "10m")

        pageSize = len(page["hits"]["hits"])
        data = page["hits"]["hits"]
        sid = page["_scroll_id"]
        totalSize = page["hits"]["total"]
        print(f"Total Size: {totalSize}")
        cnt = 0
        
        # totalSize might be flawed as it returns at times an integer > 0 but array is empty
        # To overcome this, I've added the below check for the array size instead
        if pageSize == 0: break

        while 1:
            cnt += 1
            page = es.scroll(scroll_id = sid, scroll = "10m")
            pageSize = len(page["hits"]["hits"])
            sid = page["_scroll_id"]
            data += page["hits"]["hits"]

            sizeMB = len(gzip.compress(bytes(json.dumps(data), encoding='utf-8'))) / (1024**2)
            loopTime = datetime.datetime.utcnow()
            processTimeSeconds = (loopTime-initTime).seconds

            print(f"{cnt} Results pulled: {pageSize} -- Cumulative Results: {len(data)} -- Gzip Size MB: {sizeMB} -- processTimeSeconds: {processTimeSeconds} -- pageSize: {pageSize} -- startDate: {lastRowDateOffset} -- endDate: {endDate}")

            if sizeMB > maxSizeMB: break
            if processTimeSeconds > maxProcessTimeSeconds: break
            if pageSize < 10000: break

        lastRowDateOffset = max([x['_source']['baseCtx']['date'] for x in data])
        lastRowDateOffsetDT = datetime.datetime.strptime(lastRowDateOffset, '%Y/%m/%d %H:%M:%S')
        outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}.json.gz"
        
        uploadJsonGzipBlob(outFile, data)
        writeCursor(curBlobFilePath, f"# Please use format YYYY/MM/DD HH24:MI:SS\nlastPolled = '{lastRowDateOffset}'")
        nrFilesThisInstance += 1
            
        logging.info(f"File compiled: {outFile} -- {sizeMB} MB\n")

        # If the while loop ran for more than maxProcessTimeSeconds then end it
        if processTimeSeconds > maxProcessTimeSeconds: break
        if pageSize < 10000: break
    
    logging.info(f"Closing Connection to {esUrl}")
    es.close()
    return 0

And these are 2 of the timing triggers I am calling:

game1-worker

import logging
import datetime

import azure.functions as func
#from shared_code import worker
import importlib


def main(mytimer: func.TimerRequest) -> None:
    utc_timestamp = datetime.datetime.utcnow().replace(
        tzinfo=datetime.timezone.utc).isoformat()

    if mytimer.past_due:
        logging.info('The timer is past due!')

    # Load a new instance of worker.py
    spec = importlib.util.spec_from_file_location("worker", "shared_code/worker.py")
    worker = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(worker)
    
    worker.batch('game1name')

    logging.info('Python timer trigger function ran at %s', utc_timestamp)

game2-worker

import logging
import datetime

import azure.functions as func
#from shared_code import worker
import importlib


def main(mytimer: func.TimerRequest) -> None:
    utc_timestamp = datetime.datetime.utcnow().replace(
        tzinfo=datetime.timezone.utc).isoformat()

    if mytimer.past_due:
        logging.info('The timer is past due!')

    # Load a new instance of worker.py
    spec = importlib.util.spec_from_file_location("worker", "shared_code/worker.py")
    worker = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(worker)
    
    worker.batch('game2name')

    logging.info('Python timer trigger function ran at %s', utc_timestamp)
Andrei Budaes
  • 591
  • 7
  • 22

1 Answers1

1

TL;DR

Based on what you described, multiple worker-processes share underlying runtime's resources (sockets).

For your usecase you just need to leave FUNCTIONS_WORKER_PROCESS_COUNT at 1. Default value is supposed to be 1, so not specifying it should mean the same as setting it to 1.


You need to understand how Azure Functions scale. It is very unnatural/confusing.

Assumes Consumption Plan.

Coding: You write Functions. Say F1 an F2. How you organize is up to you.

Provisioning:

  • You create a Function App.
  • You deploy F1 and F2 to this App.
  • You start the App. (not function).

Runtime:

  1. At start
  • Azure spawns one Function Host. Think of this as a container/OS.
  • Inside the Host, one worker-process is created. This worker-process will host one instance of App.
  • If you change FUNCTIONS_WORKER_PROCESS_COUNT to say 10 then Host will spawn 10 processes and run your App inside each of them.
  1. When a Function is triggered (function could be triggered due to timer, or REST calls or message in Q, ...)
  • Each worker-process is capable of servicing one request at a time. Be it a request for F1 or F2. One at a time.
  • Each Host is capable servicing one request per worker-process in it.
  • If backlog of requests grows, then Azure load balancer would trigger scale-out and create new Function Hosts.

Based on limited info, it seems like bad design to create 3 functions. You could instead create a single timer-triggered function, which sends out 3 messages to a Q (Storage Q should be more than plenty for such minuscule traffic), which in turn triggers your actual Function/implementation (which is storage Q triggered Function). Message would be something like {"game_name": "game1"}.

Kashyap
  • 15,354
  • 13
  • 64
  • 103