1

I have a table full of stock price data. Each row has a unique combination of Ticker symbols and dates. I load new data all the time by obtaining CSV files containing stock price data for everyday for every ticker. I know that there are duplicates in the CSV files. I only want to add that data that is not already in my data table. What is the quickest way to do this?

Should I try to add every row and catch each exception? Or, should I compare each row against my data table by reading my data table to see that line already exists? Or, is there another alternative?

Additional Info

This is what I have been doing. For each line in the CSV file I read my data table to see if it already exists.

Dim strURL As String
    Dim strBuffer As String
    strURL = "http://ichart.yahoo.com/table.csv?s=" & tickerValue
    strBuffer = RequestWebData(strURL)
    Dim sReader As New StringReader(strBuffer)
    Dim List As New List(Of String)
    Do While sReader.Peek >= 0
        List.Add(sReader.ReadLine)
    Loop
    List.RemoveAt(0)
    Dim lines As String() = List.ToArray
    sReader.Close()
    For Each line In lines
        Dim checkDate = line.Split(",")(0).Trim()
        Dim dr As OleDbDataReader
        Dim cmd2 As New OleDb.OleDbCommand("SELECT * FROM " & tblName & " WHERE Ticker = ? AND [Date] = ?", con)
        cmd2.Parameters.AddWithValue("?", tickerValue)
        cmd2.Parameters.AddWithValue("?", checkDate)
        dr = cmd2.ExecuteReader
        If dr.Read() = 0 Then
            Dim cmd3 As OleDbCommand = New OleDbCommand("INSERT INTO " & tblName & " (Ticker, [Date], [Open], High, Low, [Close], Volume, Adj_Close) VALUES (?, ?, ?, ?, ?, ?, ?, ?)", con)
            cmd3.Parameters.Add("@Ticker", OleDbType.VarChar).Value = tickerValue
            cmd3.Parameters.Add("@[Date]", OleDbType.VarChar).Value = checkDate
            cmd3.Parameters.Add("@[Open]", OleDbType.VarChar).Value = line.Split(",")(1).Trim
            cmd3.Parameters.Add("@High", OleDbType.VarChar).Value = line.Split(",")(2).Trim
            cmd3.Parameters.Add("@Low", OleDbType.VarChar).Value = line.Split(",")(3).Trim
            cmd3.Parameters.Add("@[Close]", OleDbType.VarChar).Value = line.Split(",")(4).Trim
            cmd3.Parameters.Add("@Volume", OleDbType.VarChar).Value = line.Split(",")(5).Trim
            cmd3.Parameters.Add("@Adj_Close", OleDbType.VarChar).Value = line.Split(",")(6).Trim
            cmd3.ExecuteNonQuery()
        Else
        End If

This is what I have switched to and it gives this exception: The changes you requested to the table were not successful because they would create duplicate values in the index, primary key, or relationship. Change the data in the field or fields that contain duplicate data, remove the index, or redefine the index to permit duplicate entries and try again. I could catch this exception every time and ignore it until I hit a line that is new.

Dim strURL As String = "http://ichart.yahoo.com/table.csv?s=" & tickerValue
    Debug.WriteLine(strURL)
    Dim strBuffer As String = RequestWebData(strURL)
    Using streamReader = New StringReader(strBuffer)
        Using reader = New CsvReader(streamReader)
            reader.ReadHeaderRecord()
            While reader.HasMoreRecords
                Dim dataRecord As DataRecord = reader.ReadDataRecord()
                Dim cmd3 As OleDbCommand = New OleDbCommand("INSERT INTO " & tblName & " (Ticker, [Date], [Open], High, Low, [Close], Volume, Adj_Close) VALUES (?, ?, ?, ?, ?, ?, ?, ?)", con)
                cmd3.Parameters.Add("@Ticker", OleDbType.VarChar).Value = tickerValue
                cmd3.Parameters.Add("@[Date]", OleDbType.VarChar).Value = dataRecord.Item("Date")
                cmd3.Parameters.Add("@[Open]", OleDbType.VarChar).Value = dataRecord.Item("Open")
                cmd3.Parameters.Add("@High", OleDbType.VarChar).Value = dataRecord.Item("High")
                cmd3.Parameters.Add("@Low", OleDbType.VarChar).Value = dataRecord.Item("Low")
                cmd3.Parameters.Add("@[Close]", OleDbType.VarChar).Value = dataRecord.Item("Close")
                cmd3.Parameters.Add("@Volume", OleDbType.VarChar).Value = dataRecord.Item("Volume")
                cmd3.Parameters.Add("@Adj_Close", OleDbType.VarChar).Value = dataRecord.Item("Adj Close")
                cmd3.ExecuteNonQuery()
            End While
        End Using
    End Using

I just want to use the most efficient method.

Update

Per the answers below, this is the code I have so far:

 Dim strURL As String = "http://ichart.yahoo.com/table.csv?s=" & tickerValue
    Dim strBuffer As String = RequestWebData(strURL)
    Using streamReader = New StringReader(strBuffer)
        Using reader = New CsvReader(streamReader)
            ' the CSV file has a header record, so we read that first
            reader.ReadHeaderRecord()

            While reader.HasMoreRecords
                Dim dataRecord As DataRecord = reader.ReadDataRecord()
                Dim cmd3 As OleDbCommand = New OleDbCommand("INSERT INTO " & tblName & "(Ticker, [Date], [Open], High, Low, [Close], Volume, Adj_Close) " & "SELECT ?, ?, ?, ?, ?, ?, ?, ? " & "FROM DUAL " & "WHERE NOT EXISTS (SELECT 1 FROM " & tblName & " WHERE Ticker = ? AND [Date] = ?)", con)
                cmd3.Parameters.Add("@Ticker", OleDbType.VarChar).Value = tickerValue
                cmd3.Parameters.Add("@[Date]", OleDbType.VarChar).Value = dataRecord.Item("Date")
                cmd3.Parameters.Add("@[Open]", OleDbType.VarChar).Value = dataRecord.Item("Open")
                cmd3.Parameters.Add("@High", OleDbType.VarChar).Value = dataRecord.Item("High")
                cmd3.Parameters.Add("@Low", OleDbType.VarChar).Value = dataRecord.Item("Low")
                cmd3.Parameters.Add("@[Close]", OleDbType.VarChar).Value = dataRecord.Item("Close")
                cmd3.Parameters.Add("@Volume", OleDbType.VarChar).Value = dataRecord.Item("Volume")
                cmd3.Parameters.Add("@Adj_Close", OleDbType.VarChar).Value = dataRecord.Item("Adj Close")
                cmd3.Parameters.Add("@Ticker", OleDbType.VarChar).Value = tickerValue
                cmd3.Parameters.Add("@[Date]", OleDbType.VarChar).Value = dataRecord.Item("Date")
                cmd3.ExecuteNonQuery()
            End While
        End Using
    End Using

It gives me this error Data type mismatch in criteria expression.

Raidri
  • 17,258
  • 9
  • 62
  • 65
gromit1
  • 577
  • 2
  • 14
  • 36

1 Answers1

1

Most DBMS support a (non-standard) clause for the INSERT command to ignore duplicates, e.g.:

MySQL: INSERT IGNORE INTO ...

SQLite: INSERT OR IGNORE INTO INTO ...

This is the quickest way in non-batch mode, as you don't have to read the database before you write.

You can do the same with standard SQL using:

INSERT INTO ... 
SELECT <your values> 
WHERE NOT EXISTS ( <query for your values by id> );

Or (when you explicitly need a FROM clause):

INSERT INTO ... 
SELECT <your values> 
FROM DUAL 
WHERE NOT EXISTS ( <query for your values by id> );

EDIT

MS Access does not have a built-in DUAL table (i.e., a table that always contains just one single row), but Access requires a FROM clause. So you have to build your own DUAL table:

CREATE TABLE DUAL (DUMMY INTEGER);
INSERT INTO DUAL VALUES (1);

You just do this once and for all. Then, in your code you would do inserts like

INSERT INTO MyTable (A,B,C,D)
SELECT 123, 456, 'Hello', 'World'
FROM DUAL
WHERE NOT EXISTS (SELECT 1 FROM MyTable WHERE A = 123 AND B = 456);

Thus, for your example, use:

Dim cmd3 As OleDbCommand = New OleDbCommand(_ 
    "INSERT INTO " & tblName &  _ 
    "(Ticker, [Date], [Open], High, Low, [Close], Volume, Adj_Close) " & _ 
    "SELECT ?, ?, ?, ?, ?, ?, ?, ? " & _ 
    "FROM DUAL " & _
    "WHERE NOT EXISTS (SELECT 1 FROM tblName WHERE Ticker = ? AND [Date] = ? AND ...)", con)

(WHERE clause depending on your key columns)

Fabian
  • 2,822
  • 1
  • 17
  • 22
  • I tried to run the code you suggested and I am getting this error `Syntax error (missing operator) in query expression '(?, ?, ?, ?, ?, ?, ?, ?) WHERE NOT EXISTS (SELECT 1 FROM Historical_Stock_Prices WHERE Ticker = goog AND [Date] = 2013-11-08)'.` Any ideas? – gromit1 Nov 11 '13 at 18:18
  • 1
    I have edited the post for MS Access. Unfortunately, Access neither has a DUAL table nor does it permit SELECT without FROM. So you have to build your own DUAL table. – Fabian Nov 11 '13 at 18:21
  • 1
    You just need it once and for all, it will always be exactly one row (it is a pre-defined "constant table" in other DBMS), you can create it manually in Access. It just needs to be there for your code to work. – Fabian Nov 11 '13 at 18:26
  • I'm not understanding this concept. – gromit1 Nov 11 '13 at 18:29
  • 1
    There are two kinds of INSERT statements in SQL: INSERT INTO ... VALUES and INSERT INTO ... SELECT. The latter inserts the results of a query into a table, this can be zero or any number of results. The "trick" (which is quite standard) in my post is to select a constant tuple (the row you want to insert) from a table (DUAL) that always contains one row. With the WHERE EXISTS ( ... ), the SELECT FROM DUAL either gives you one row (when the record is not yet there), so the INSERT will insert that row, or the SELECT FROM DUAL will give you zero rows, so no records are inserted. – Fabian Nov 11 '13 at 18:34
  • 1
    I have added a concrete example to my post (for table MyTable). – Fabian Nov 11 '13 at 18:40
  • Okay I understand the concept now. I still don't know the exact syntax that I need to create the DUAL table. I need to do it from within my program. Sorry for being dense. – gromit1 Nov 11 '13 at 18:57
  • 1
    You can create it like this: CREATE TABLE DUAL(DUMMY INT); INSERT INTO DUAL VALUES(1); Just make sure you do this only once - DUAL must contain exactly one record. So, I suggest, don't do the INSERT INTO DUAL if the CREATE TABLE DUAL fails because you already have it from a previous run. – Fabian Nov 11 '13 at 21:25
  • For my example, what data would I put into the DUAL table when I create it? – gromit1 Nov 12 '13 at 16:02
  • 1
    The DUAL table is application: It must always contain ONE row. Usually, it also has only one column (e.g., named DUMMY). The content of that single row does not matter, you never actually select that row. You can put '1' if you want. See my edited post. The sole purpose of DUAL is to have a SELECT that either returns ZERO or ONE rows. The columns of a SELECT FROM DUAL are usually constants or values derived from somewhere else. – Fabian Nov 12 '13 at 16:37
  • I've created the DUAL Table only once and when I run the code you suggested in your answer, I get this error message `Additional information: Syntax error (comma) in query expression '(?, ?, ?, ?, ?, ?, ?, ?)'.` – gromit1 Nov 12 '13 at 16:58
  • 1
    Oh, sorry, you have to remove the parentheses. SELECT ?,?,? instead of SELECT (?,?,?). I did not pay enough attention here. – Fabian Nov 12 '13 at 17:00
  • Now I'm getting `No value given for one or more required parameters.` – gromit1 Nov 12 '13 at 18:41
  • 1
    Compared to your previous INSERT INTO ... VALUES form, you will have a few more parameters, as you will have the parameters for the key columns repeated in the WHERE condition of the EXISTS sub-query. – Fabian Nov 12 '13 at 19:14
  • Solved that problem. Now I get `Data type mismatch in criteria expression.` – gromit1 Nov 12 '13 at 19:31
  • 1
    Probably you don't provide the parameters in the right order. Please add your current code to your question, so I can have a look. – Fabian Nov 12 '13 at 19:33
  • I have added my current code to my original post. Thanks for all of your help! – gromit1 Nov 12 '13 at 19:37
  • 1
    Two things to mention: (1) You use anonymous parameter ("?") in your query, but you add named values for named parameters in cmd3.Parameters.Add(). You should either use named parameters in your query, or, if you use anonymous parameters in your SQL, you should use the corresponding Add method (see [here](http://msdn.microsoft.com/de-de/library/system.data.oledb.oledbparametercollection%28v=vs.110%29.aspx). (2) You declare all your types as varchar, but some seem to be dates. Could be that your previous version applied implicit type conversions which are not applied automatically anymore now. – Fabian Nov 12 '13 at 19:58
  • Those things make sense. What is the best way to fix them. – gromit1 Nov 12 '13 at 20:12
  • I've tried to make some changes and they haven't worked. I stil get the Datatype Mismatch error. I've even tried converting the values to the correct Datatype. – gromit1 Nov 12 '13 at 20:29
  • 1
    I have checked th docs: Apparently, sql parameters *have* to be "?" in OleDb, and the names are just ignored in the parameters collection. Only the position counts,. So, in theory, your code is almost fine already, you just have set the proper [data types](http://msdn.microsoft.com/de-de/library/system.data.oledb.oledbtype%28v=vs.110%29.aspx) in Add(), according to your column types, e.g., Date for your date columns, instead of Varchar. – Fabian Nov 12 '13 at 20:30
  • In regards to my original question, which way do you think is the most efficient? – gromit1 Nov 12 '13 at 20:36
  • 1
    Hmm. In cmd3.Parameters.Add("@[Date]", OleDbType.VarChar).Value = dataRecord.Item("Date"), I would expect to use OleDbType.Date and a conversion to a date value on the right-hand side. If that does not work, I don't have a clue. Sorry, I know quite some SQL, but I don't know OleDb well. Maybe you should repost your current problem as a new question on SO. – Fabian Nov 12 '13 at 20:36
  • 1
    Regarding your original question: What I proposed is the fastest way using vanilla SQL. However, as MS Access is an in-process DB, querying the record first (to check that it exists) and then inserting it (if necessary) will only be slightly more expensive. So maybe you are investing too much time here. Sorry that I cannot be of more help on your OleDb issue. – Fabian Nov 12 '13 at 20:40
  • Thank you so much! You've been great! You have more patients than I. – gromit1 Nov 12 '13 at 20:42