0

I am going to apologize ahead time in saying that I am pretty new to SSIS but I've been assigned a project that involves connecting to a table and verifying URL's that part of a products 'attributes'. The verification would basically be looking for something like 200=good and 404=bad and loading the ProductID and the results into either a goodURL or badURL table.

I did read one post that got me in the ballpark but it fell all apart as it was dealing with a specific URL and not thousands of them. So if anyone could point me in the right direction of how to set up the variables and returning the results from the 'HEAD' (from some posts i've read) method of retrieving URL connection info....it would be greatly appreciated.

so i have a SQL task that returns over 1000 records. ProductID, URL_Type, and URL. I then have a script task that i found the code for on stack that i would like to read the individual records and kick out the ProductID, URL_Type, URL, and a new column URL_Valid (true or false). here is the code for script task:

    public void Main()
    {
        Dts.Variables["Found"].Value = UrlFound(Dts.Variables["URLCheck"].Value.ToString());
        Dts.TaskResult = (int)ScriptResults.Success;
    }
    private bool UrlFound(string url)
    {
        try
        {
            //Creating the HttpWebRequest
            HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
            //Setting the Request method HEAD, you can also use GET too.
            request.Method = "HEAD";
            //Getting the Web Response.
            HttpWebResponse response = request.GetResponse() as HttpWebResponse;
            //Returns TRUE if the Status code == 200
            response.Close();
            return (response.StatusCode == HttpStatusCode.OK);
        }
        catch
        {
            return false;
        }

    }

I am just trying to get it to work with just the URL and kick out a true or false right now and that isn't even working. trying to start off small then get to the wanted result at the end. not sure i understand the parameter mapping from a full result set being fed into a script task and how to return results that i want. again, i am new but i am trying to get there. SSIS is the only tool i have to work with on this project. thanks ahead of time.

basic flow

billinkc
  • 59,250
  • 9
  • 102
  • 159
tom pratt
  • 35
  • 6
  • UPDATE: I have now gotten individual or a single URL to run through the script task. The problem is that ALL of them seem to come back as =false and get a 0 loaded into the bad_URL table i have set up. even URL's that i have verified as good run through the script and show as bad? hmmmm. – tom pratt Dec 17 '20 at 18:52
  • Might be that closing the response object kills out the status code. I'd add more logging in there to keep track of what's going on at each step https://billfellows.blogspot.com/2016/04/biml-script-task-test-for-echo.html Make liberal use of Dts.Events.FireInformation to keep track of your local variables. The link is for SSIS scoped variables but the principle is the same – billinkc Dec 17 '20 at 20:07
  • Thanks billinkc. I will look at the article and add logging. Not being a programming guru do i need to add any code in the script task to add a data table for the script to handle the thousands of URL's returned from the SQL task? Trying to figure out how that works. Though the logic isn't correct at this point it seems to return a single URL result but I am needing it to go through a table full. – tom pratt Dec 18 '20 at 03:35
  • I would suggest to do some redesign here: I would include a Dataflow, which loads the URL_Validation tabke instead of a SQL task. In this Dataflow you can include a C# transformation script component, which then redirects good and bad URLs to the corresponding outputs. Thiis way the script component has to handly only one URL at a time and the redirecting of the records is easier to manage. – Tyron78 Dec 18 '20 at 12:37

1 Answers1

0

As mentioned in my comment, I would load the URLs in a Dataflow. Then after the source (e.g. OleDB Source) you should add a Script Component as Transformation. This Script Component should take the URL as Input column and should have two Outputs - ValidUrl and FaultyUrl or similar - which are configured as asynchronous. Then - within this component - you can overwrite the following method:

public override void Input0_ProcessInputRow(Input0Buffer Row)
{
    string currUrl = Row.URL;
    int statusCode = -1;

    if (!String.IsNullOrEmpty(currUrl))
    {
        try
        {
            statusCode = -1;

            HttpWebRequest request = WebRequest.Create(currUrl) as HttpWebRequest;
            request.Method = "HEAD";
            HttpWebResponse response = request.GetResponse() as HttpWebResponse;

            statusCode = (int)response.StatusCode;
            if (statusCode == 200)
            {
                UrlValidBuffer.AddRow();
                UrlValidBuffer.URL = currUrl;
                UrlValidBuffer.Code = statusCode;
            }
            else
            {
                UrlFaultyBuffer.AddRow();
                UrlFaultyBuffer.URL = currUrl;
                UrlFaultyBuffer.Code = statusCode;
            }
        }
        catch (Exception e)
        {
            ComponentMetaData.FireWarning(10, "Loading URL", "URL " + currUrl + " failed to open: " + e.Message, "", 0);
            UrlFaultyBuffer.AddRow();
            UrlFaultyBuffer.URL = currUrl;
            UrlFaultyBuffer.Code = statusCode;
        }
    }
    else
    {
        ComponentMetaData.FireWarning(10, "Loading URL", "URL empty:" + currUrl, "", 0);
    }
}

However, faulty URLs will already throw an exception during the http request - you will not get any further information from the response. In order to extract article info or whatever, you will have to extract it from the URL, not from the response.

Tyron78
  • 4,117
  • 2
  • 17
  • 32
  • Thanks as this looks like it could be the correct method to work on. Will this be able to pass through a variable the ID of the URL? This way I can join back on it to give us the info we need. This job is to verify which are good and which are bad and then have someone look at the bad ones to figure out and possibly get a new correct URL from a source. So at minimum i will need the Result out of the script task and the URL ID. – tom pratt Dec 18 '20 at 14:10
  • You can put as much info as you want / need / have to separate output columns and join it back as you wish. But as mentioned above: this is not a script task but a script component - the data flows through it and is then redirected either to the valid output or the faulty output. – Tyron78 Dec 18 '20 at 17:09
  • so not being a C# person other than very basic and limited knowledge why would i be getting errors on the UrlValidBuffer and UrlFaultyBuffer? stumbling along as best i can. – tom pratt Dec 18 '20 at 20:37
  • So I've tried to figure this out on my own with pulling what little hair I have left out on this but I can't get it to work. I'm trying to associate the code Tyron78 gave as an example but not sure how it works in my situation. I have three input columns in and really only need one output which is either a text or boolean or whatever to differentiate between a good and bad URL. i believe i am configuring everything correct but really not sure if i need synchronous or asynchronous. i am now not sure how to figure out the output. still don't understand the buffers 'URLValidBuffer and URLFaulty.. – tom pratt Dec 21 '20 at 20:17
  • Hello Tom. Sorry for the delay - currently rarely online due to holiday preparations. My example is for two outputs - one for correct URLs, one for faulty ones. If you use only one output you can define the Input/output as synchronous. If you choose my approach, you need to set it as asynchronous. What are the errors/problems you are currently facing? Maybe you could share your Code Here? – Tyron78 Dec 23 '20 at 00:43
  • No sorry needed. I appreciate the time you have spent. I guess for me I have used SSIS a bit. No pro. Script Task or Script component I haven't used at all and the C# part is hard for me to follow. Anyhow when I used your code that you showed earlier in this post everything looks good except for the UrlValidBuffer and UrlFaultyBuffer. I am not sure if this is the name of the outputs? I am left wondering am I configuring the script component correctly. It's just a bit confusing as I am researching other posts but that doesn't seem to help either. If you could possibly explain those buffers? – tom pratt Dec 24 '20 at 22:43
  • @tompratt When creating the component, you have to add a second output. In my case I renamed the outputs in order to match the above mentioned names. For each output you will have to add the output columns. As mentioned earlier, when using two outputs, they have to be configured as "asynchronous" - otherwise you will receive an error. – Tyron78 Dec 30 '20 at 11:00