Pulling data from serverless SQL external table to spark for sentiment analysis

Question

I'm not experienced at all with Azure, yet I've been tasked with setting up the above. The serverless SQL external tables were set up by a company contracted to do so and use the SynapseDeltaFormat as the format, if that matters. One of the tables created in this manner has a column where we want the records that are at least 25 characters long to be processed for sentiment analysis. All of the examples I've been able to find online just have you use some other external file to act as the source and in this case that isn't what we want to do. The data has already been pulled into the serverless SQL environment. I assume that a pipeline and one or more notebooks would be necessary to move data around, but I can't for the life of me find a resource that explains how such an engine would be set up. In fact, I can't even find a reference for adding to the external data tables the sentiment analysis as it appears one can't be created without an already-existing data source. Does anyone have sources or information available to assist in what I'm trying to do?

score 0 · Answer 1 · answered May 19 '22 at 05:37

Before starting with Synapse, you must first understand the requirement and architecture.

An external table points to data located in Hadoop, Azure Storage blob, or Azure Data Lake Storage. External tables are used to read data from files or write data to files in Azure Storage. With Synapse SQL, you can use external tables to read external data using dedicated SQL pool or serverless SQL pool.

You mentioned -

The data has already been pulled into the serverless SQL environment.

Where it is exactly located? Is it in Azure Data Lake, CosmosDB, or Datavserse or in Dedicated SQL Pool?

Based on the source, you need to design the architecture.

If the data is in dedicated SQL Pool, I recommend you to use Dedicated SQL Pool analytics instead of serverless pool. To know whether the data in Dedicated Pool, you need to check how the data has been created. For example, when used in conjunction with the CREATE TABLE AS SELECT statement, selecting from an external table imports data into a table within the dedicated SQL pool.

Since you mentioned Serverless SQL Pool, I'm assuming that you have created a Data Source in Azure Storage account.

Any external tables in Synapse SQL pools follow the below hierarchy:

CREATE EXTERNAL DATA SOURCE to reference an external Azure storage and specify the credential that should be used to access the storage.
CREATE EXTERNAL FILE FORMAT to describe format of CSV or Parquet files.
CREATE EXTERNAL TABLE on top of the files placed on the data source with the same file format.

You can CREATE EXTERNAL TABLE using below syntax:

CREATE EXTERNAL DATA SOURCE <data_source_name>
WITH
(    LOCATION         = '<prefix>://<path>'
     [, CREDENTIAL = <database scoped credential> ]
     , TYPE = HADOOP
)
[;]

Secondly, to CREATE EXTERNAL FILE FORMAT you can refer below syntax:

-- Create an external file format for PARQUET files.  
CREATE EXTERNAL FILE FORMAT file_format_name  
WITH (  
    FORMAT_TYPE = PARQUET  
    [ , DATA_COMPRESSION = {  
        'org.apache.hadoop.io.compress.SnappyCodec'  
      | 'org.apache.hadoop.io.compress.GzipCodec'      }  
    ]);  

--Create an external file format for DELIMITED TEXT files
CREATE EXTERNAL FILE FORMAT file_format_name  
WITH (  
    FORMAT_TYPE = DELIMITEDTEXT  
    [ , DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec' ]
    [ , FORMAT_OPTIONS ( <format_options> [ ,...n  ] ) ]  
    );  

<format_options> ::=  
{  
    FIELD_TERMINATOR = field_terminator  
    | STRING_DELIMITER = string_delimiter
    | FIRST_ROW = integer
    | USE_TYPE_DEFAULT = { TRUE | FALSE }
    | ENCODING = {'UTF8' | 'UTF16'}
    | PARSER_VERSION = {'parser_version'}
}

Last, you can CREATE A EXTERNAL TABLE using below synatx:

CREATE EXTERNAL TABLE { database_name.schema_name.table_name | schema_name.table_name | table_name }
    ( <column_definition> [ ,...n ] )  
    WITH (
        LOCATION = 'folder_or_filepath',  
        DATA_SOURCE = external_data_source_name,  
        FILE_FORMAT = external_file_format_name
        [, TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}' ]
        [, <reject_options> [ ,...n ] ] 
    )
[;] 

<column_definition> ::=
column_name <data_type>
    [ COLLATE collation_name ]

<reject_options> ::=  
{  
    | REJECT_TYPE = value,  
    | REJECT_VALUE = reject_value,  
    | REJECT_SAMPLE_VALUE = reject_sample_value,
    | REJECTED_ROW_LOCATION = '/REJECT_Directory'
}

Refer the official documents Analyze data with a serverless SQL pool, Use external tables with Synapse SQL .

score 0 · Answer 2 · answered Jun 07 '22 at 01:08

I'm a bit torn between deleting and posting what I did to complete this particular task. Ultimately, I created a Synapse notebook (C#, to be specific on the language) that would pull the required data from the serverless sql instance that we are using in Synapse. It would then pull already-processed records from the destination and remove any repeats from the new data that still needed processing. It would then call the Language Analysis endpoint for sentiment analysis and write those back to the destination. The notebook was then added to a Synapse pipeline that would automate the gathering of required values from a key vault and a database to pass into the notebook so it could do its thing.

Pulling data from serverless SQL external table to spark for sentiment analysis

2 Answers2