-2

I am really new at sql and I want to extract 'SWAMP RIVER NEAR DOVER PLAINS NY' from the following string:

<a href='http://waterdata.usgs.gov/nwis/nwisman/?site_no=01199490'>01199490</a> SWAMP RIVER NEAR DOVER PLAINS NY</a>

The problem is that length of the range I want to extract varies in each row.

I tried the following:

select substring (name, 80 , char_length(name) - 4 ) from stream_gages; 

But I get: SWAMP RIVER NEAR DOVER PLAINS NY</a> not matter what number I put after the minus sign.

Is there a way I can do this?

So, what I am exactly trying to do, is to extract the code and name of each stream gage stations into new columns, contained in:

<a href='http://waterdata.usgs.gov/nwis/nwisman/?site_no=01199490'>01199490</a> SWAMP RIVER NEAR DOVER PLAINS NY</a>

in order to do some GIS queries.

The code is the number starting at position 57, and the name starting in postion 80 (in the case that the code length does not vary which is the most of the cases) however the length of name varies in each row. The length of the code also varies in a few row, but I can edit it manually if it is to hard to come up with a sql that does that. Howevery, for the name in all 240 rows the length is different. Thanks.

Alan Alves
  • 61
  • 1
  • 2
  • 9
  • The third parameter is the length of substring to extract, not the end position –  Apr 12 '14 at 07:38
  • Are you sure you don't want to use an external HTML parser? – Clodoaldo Neto Apr 12 '14 at 11:36
  • If the `length of the range I want to extract varies in each row`, obviously you need to specify *how* it varies. Examples or more explanation, please. – Erwin Brandstetter Apr 12 '14 at 13:10
  • Thanks for the clarifications. @ClodoaldoNeto I am so new in JS and SQL that I am not sure what a HTML parser is. So, yes. The length varies according to length of the station name, but I do know that every station name starts in postion 80 and ends one postion before the first character '<' located after postion 80. So basically the length I want to extract is the number of character in name (the Capital letters). – Alan Alves Apr 12 '14 at 16:42

1 Answers1

0

You could use regexp_replace() , avoiding the literal constants:

create TABLE bla
        ( id SERIAL NOT NULL PRIMARY KEY
        , body varchar
        );

select * from bla;

INSERT INTO bla(body) VALUES
(e'<a href=\'http://waterdata.usgs.gov/nwis/nwisman/?site_no=01199490\'>01199490</a> SWAMP RIVER NEAR DOVER PLAINS NY</a>' )
        ;

select id
  , regexp_replace(body, e'.+<\/a> ([^<]+)<\/a>.*', '\1')
from bla;

And the results:

CREATE TABLE
 id | body 
----+------
(0 rows)

INSERT 0 1
 id |          regexp_replace          
----+----------------------------------
  1 | SWAMP RIVER NEAR DOVER PLAINS NY
(1 row)
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • Thank you for you answer.It worked for this particular row, however it does not work for all of them: "> SWAMP RIVER NEAR DOVER PLAINS " "> TENMILE RIVER NEAR GAYLORDSVIL" "> BLIND BROOK AT RYE NY" "> BEAVER SWAMP BROOK AT MAMARONE" "> HUTCHINSON RIVER AT PELHAM NY<" – Alan Alves Apr 12 '14 at 16:55
  • Well, I can only answer the question that has been asked, can't I? If you have different rows, with different patterns in it, maybe another regexp could be used. (or maybe a totally different solution) – wildplasser Apr 12 '14 at 17:18
  • Sure, The solution you proposed worked for the problem in the original question. Only later than I realized I was not clear in my question, then I edited my question. Thank you very much for answer and time. I am steel trying to figure that out. Thank you one more time. – Alan Alves Apr 12 '14 at 18:14
  • Hi, Here one more time to say that actually the solution proposed works. I was just entering the code with a stupid minor err. If you don't mind, could explain please why did you entered the parameter you entered so can I apply this to other situations? – Alan Alves Apr 12 '14 at 18:41
  • I cannot and (should not) explain you regexp in one answer or comment. Read about them (maybe in the manual I linked). Basically, this particular regexp consist of three parts: the stuff you want `([^<]+)` and everything before `'.+<\/a> ` and after `<\/a>.*'` it. The parenthesised part is referenced by the `\1` in the third argument. – wildplasser Apr 12 '14 at 19:03
  • Ok...Thank you. I know you have helped me to much, but you tell me how I would extract the code :'01199490' instead. I have spent more then 2 hours reading and trying but could not put this to work. I am so grateful for your help. – Alan Alves Apr 12 '14 at 19:54