0

Working in Python 2.7. I'm attempting to remove from a string all things not databases and tablename combinations. I'm using regex for this, and unintentionally removing all the whitespace (which I need to keep to separate the values)

s = "replace view dw1.tbl1_st as select dw2.tbl1_st.col1, dw2.tbl1_st.col2, "
s = s + "dw2.tbl1_st.col3,  dw2.tbl1_st.col4  dw2.tbl1_st.col5, "
s = s + "dw2.tbl1_st.col6, dw2.tbl1_st.col7  dw2.tbl1_st.col15, dw2.tbl1_st.col8, "
s = s + "dw2.tbl1_st.col9, dw2.tbl1_st.col10,  dw2.tbl1_st.col11, dw2.tbl1_st.col12, "
s = s + "dw2.tbl1_st.col13, dw2.tbl1_st.col14 from dw2.tbl1_st;"

replaced = re.sub(r'((?!\w+\.\w+).)', '', s)

Result set is removing the "." between the database and tablenames. But I want "." and the whitespace to remain.

>> replaced
'dw1dw2tbl1_stdw2tbl1_stdw2tbl1_stdw2tbl1_stdw2tbl1_stdw2tbl1_stdw2tbl1_
 stdw2tbl1_stdw2tbl1_stdw2tbl1_stdw2tbl1_stdw2tbl1_stdw2tbl1_
 stdw2tbl1_stdw2tbl1_stdw2'

>> desired_results (Option 1)
'dw1.dw2.tbl1_st dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, 
dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, 
dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, dw2.tbl1_st, dw2.'

Or equally workable:

>> desired_results (Option 2)
'dw1 dw2tbl1_st dw2tbl1_st dw2tbl1_st dw2tbl1_st dw2tbl1_st 
dw2tbl1_st dw2tbl1_st dw2tbl1_st dw2tbl1_st dw2tbl1_st 
dw2tbl1_st dw2tbl1_st dw2tbl1_st dw2tbl1_st dw2tbl1_st dw2'
Lee_Str
  • 3,266
  • 2
  • 21
  • 32

1 Answers1

0

One option, which will work if you know the structure of your string and it is fairly regular, is instead of useing . to match everything, using negation to match anything BUT a space or comma:

>>> replaced = re.sub(r'((?!\w+\.\w+)[^, ])', '', s)
>>> replaced
'  dw1   dw2tbl1_st, dw2tbl1_st, dw2tbl1_st,  dw2tbl1_st  dw2tbl1_st,
dw2tbl1_st, dw2tbl1_st  dw2tbl1_st, dw2tbl1_st, dw2tbl1_st, dw2tbl1_st,
dw2tbl1_st, dw2tbl1_st, dw2tbl1_st, dw2tbl1_st  dw2'

Or better yet, use re.findall and a negative capture group: , finally joining the resulting list with a space or whatever you want:

>>> " ".join(re.findall(r'((?:\w+\.\w+))',s))
'dw1.tbl1_st dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st
dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st 
dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st dw2.tbl1_st
dw2.tbl1_st dw2.tbl1_st'
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172