0

I am using two datasets from the NYC MTA turnstile data and subway station location one contains the turnstile data collected at a particular subway station while the other contains the longitude and latitude of said subway station. There is no common key between the tables. I had hoped to use subway station name however there are many different stations containing the same name within a table and in addition the naming conventions are slightly different between the tables. To overcome this I would like to use a combination of the subway station name and the lines present at the station to join the tables based on substrings.

For example:

In the train station locations table one row contains

+------------------------+-----------------+
|Name                    |Line             |
+------------------------+-----------------+
|Lexington Ave - 59th St | 4-5-6-6 Express |
+------------------------+-----------------+

While in the train station data table one row may look like this

+---------+-----------------+
| Station | LineName        |
+---------+-----------------+
| 59 ST   | NQR456W         |
+---------+-----------------+

The best workaround I could think of is to do some kind of search using the LIKE keyword OR LOCATE function to return back singular rows that contain the same substrings of characters for the station and line i.e. LIKE("%59%") AND NQR456 . I'm hoping to ignore substrings like ST and AVE and characters like '''-'''.

Once I have these rows I would like to make a new column with a proper key of a shared unique id for each station that I can make a JOIN statement on.

Thank you in advance for all of your help

Ive tried the query below however it is not working as intended due to only searching for one substring within another

SELECT tsl.station, td.station, td.linename, tsl.line
FROM train_station_locations tsl, turnstile_data td
WHERE CONCAT('%',LOWER(tsl.station),'%')
 LIKE CONCAT('%', REPLACE(REPLACE(td.station," st","")," ",""),'%') 
 AND  CONCAT('%',LOWER(td.linename),'%') LIKE 
 REPLACE(CONCAT('%',LOWER(tsl.line),'%'),"-","");

I've referred to the following questions

https://stackoverflow.com/a/40140482/9367155

SQL: Join tables on substrings

zealous
  • 7,336
  • 4
  • 16
  • 36

1 Answers1

0

It must be frustrating to deal with data that doesn't have a PK...

Based on the data you are sharing above, it seems it would work to strip out the non-numeric characters of both fields and look for a match. 59 = 59.

MySQL 8 supports REGEXP_REPLACE: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-replace

Prior to MySQL 8, you can create a custom function: MySQL strip non-numeric characters to compare

yg-dba
  • 330
  • 1
  • 6