0

Before I had time to get an ingestion strategy & process setup, I started collecting data that will eventually go through a Stream Analytics job. Now I'm sitting on an Azure blob storage container with over 500,000 blobs in it (no folder organization), another with 300,000 and a few others with 10,000 - 90,000.

The production collection process now writes these blobs to different containers in the YYYY-MM-DD/HH format, but that's only great going forward. This archived data I have is critical to get into my system and I'd like to just modify the inputs a bit for the existing production ASA job so I can leverage the same logic in the query, functions and other dependencies.

I know ASA doesn't like batches of more than a few hundred / thousand, so I'm trying to figure a way to stage my data in order to work well under ASA. This would be a one time run...

One idea was to write a script that looked at every blob, looked at the timestamp within the blob and re-create the YYYY-MM-DD/HH folder setup, but in my experience, the ASA job will fail when the blob's lastModified time doesn't match the folders it's in...

Any suggestions how to tackle this?

EDIT: Failed to mention (1) there are no folders in these containers... all blobs live at the root of the container and (2) my LastModifiedTime on the blobs is no longer useful or have meaning. The reason for the latter is these blobs were collected from multiple other containers and merged together using the Azure CLI copy-batch command.

Andrew Connell
  • 4,939
  • 5
  • 30
  • 42

1 Answers1

0

Can you please try below?

  1. Do this processing in two different jobs, one for the folders with date partitioning (say partitionedJob). Another for old blobs without any date partitioning (say RefillJob)
  2. Since RefillJob has a fixed number of blobs, put a predicate on System.Timestamp to make sure that it only processes old events. Start this job with at least 6 SUs and run it until all the events have been processed. You can confirm by looking at LastOutputProcessedTime or by looking at the input event count or by inspecting your output source. After this check, stop the job. This job is no longer needed.

  3. Start the partitionedJob with timestamp > RefillJob. This assumes the folders for the timestamps exists.

Vignesh Chandramohan
  • 1,306
  • 10
  • 15
  • I’m not clear on *partitionedJob* or *refillJob*.. what are those? Currently nothing is in folders... it’s just 500,000 blobs within a container. Also... this data was collected from multiple sources so they all have the same LastModifiedTime within a 2hr window of each other. This is because the blobs were copied from multiple sources... so the LastModifiedTime is now meaningless. I’ve been exploring using partitions, moving batches of 500 blobs => a folder at a time like BATCH01, then creating the query so it treats these folders as partitions. – Andrew Connell Oct 24 '17 at 20:44
  • What is the application time in the events in those 500000 blobs? What is the maximum diff between that time and LastModifiedTime? – Vignesh Chandramohan Oct 25 '17 at 21:41
  • For the LastModifiedTime, only about 6-8 hours (like I said above, we had two or three Azure CLI `copy-batch` commands running at once from different containers aggregating them together... we did this before we considered ASA and only now realize the implications of it). Within each blob, there is a timestamp field that ranges over 6 months... I'd MUCH rather use that, but it's not an option here. – Andrew Connell Oct 25 '17 at 23:42
  • What does your query look like? I ask because, using timestamp field in "timestamp by" expression is not an option like you mentioned, because late arrival tolerance is enforced and the max value for that is <30 days. So based on how the query looks like and if the timestamp can be changed through an expression (again depends on business logic), there might be a way. Will need more details to explain further. – Vignesh Chandramohan Oct 26 '17 at 16:46
  • ATM the query isn't written... just basing my concern on previous issues I ran into with other queries. I'm open to any suggestions. This will be a one-time process as the current process of working with current data is not running into this problem as it's partitioned into folders correctly... so I'm just trying to get this mass of archive data into Azure Tables that are currently used for storage of the data presently in these archived blobs. Once imported, I'll have no use for this job. – Andrew Connell Oct 26 '17 at 18:04