0

I have a CSV file encoded as UTF-8, which I downloaded off IMDb.com. I would like to import this data into SSMS 2016 (or 2014) using the Import Wizard. Here is a sample of what the CSV looks like (note the director of Dallas Buyers Club is Jean-Marc Vallée):

"position","const","created","modified","description","Title","Title type","Directors","You rated","IMDb Rating","Runtime (mins)","Year","Genres","Num. Votes","Release Date (month/day/year)","URL"
"38","tt1636826","Tue Feb 16 00:00:00 2016","","","Project X","Feature Film","Nima Nourizadeh","6","6.7","88","2012","comedy, crime","155628","2012-03-01","http://www.imdb.com/title/tt1636826/"
"39","tt0119528","Tue Feb 16 00:00:00 2016","","","Liar Liar","Feature Film","Tom Shadyac","6","6.8","86","1997","comedy, fantasy, romance","217817","1997-03-18","http://www.imdb.com/title/tt0119528/"
"40","tt0790636","Tue Feb 16 00:00:00 2016","","","Dallas Buyers Club","Feature Film","Jean-Marc Vallée","7","8.0","117","2013","biography, drama","321602","2013-09-07","http://www.imdb.com/title/tt0790636/"

I select Flat File Source in the Import Wizard, select my file, and go with the default options (while adding a " as the text qualifier). However, this is an example of what I'm seeing: https://i.stack.imgur.com/nL4n8.jpg

The diacritic character é is being turned into é. I tried selecting Unicode next to "Locale" in the Import Wizard, but it converted everything to Chinese characters and placed it all in a single cell.

Any idea what is going on here?

Walker
  • 153
  • 2
  • 9
  • What format is the SSIS tool returning it? Do some research on the style of UTF-8. For example, 1252 is LATIN ANSI in DT_String – clifton_h Jun 29 '16 at 23:53
  • 1
    your flat file source in the image is stating that it is expecting code page 1252. But you are saying the flat file is UTF-8 the code page selected should match the file. I haven't tried but I have seen in other answers that UTF-8 is supposed to be available in SQL 2016. so because this is driving me nuts at this point can you provide the link to what you downloaded from IMDB.com, I would like to setup a test scenario with the same data for my own knowledge..... – Matt Jun 29 '16 at 23:59
  • 1
    also FYI you should post your images directly into the question rather than linking to your own website. – Matt Jun 30 '16 at 00:01
  • 1
    https://www.mssqltips.com/sqlservertip/3119/import-utf8-unicode-special-characters-with-sql-server-integration-services/ here is an article on it. change codepage to 65001 (UTF-8) and then ensure your datatype is UNICODE STRING DT_WSTR – Matt Jun 30 '16 at 00:05
  • @Matt Hi Matt, if you go to [basic links](www.imdb.com), create an account, and rate a few movies. Then, select your name in the top right corner, go to "Your Ratings", scroll to the bottom and click "Export this list" (right after the last movie on the page). It will save as a CSV. – Walker Jun 30 '16 at 15:39
  • @Matt That worked! Post as the answer to the question. Ultimately I need to do this with a BULK INSERT rather than through the Import Wizard, but this gets me a step closer. – Walker Jun 30 '16 at 16:21
  • 1
    @Walker posted comments as answer. Why the desire to use Bulk insert? you can create a SSIS package (which is what the import wizard does) and make it flexible enough to handle file name as a paramemter/variable and execute from t-SQL.... – Matt Jun 30 '16 at 17:02
  • @Matt Frankly, I don't know SSIS well. I think I'm really close. I'm trying without a format file, but I'm having issues with the rowterminator now. I tried '\n' but it is returning one row with the last column having nearly all of the data. Here is what I have: `BULK INSERT [Test].[dbo].[rawData] FROM 'C:\IMDbRatings2\Files\ratings.csv' WITH ( CODEPAGE = '65001' ,DATAFILETYPE = 'char' --,FIRSTROW = 2 ,FIELDTERMINATOR='","' ,ROWTERMINATOR='\n' );` I checked to see if the diacritic characters are pulling in correctly, and they are! I just need to get the row terminator. – Walker Jun 30 '16 at 17:07
  • 1
    try \r\n which is basically char(13) char(10) or Carriage Return Line Feed, you can open your file in notepad++ and tell it to show you the line endings which will let you know. CR = Carriage Return and LF= Line Feed in when looking at it in notepad++ – Matt Jun 30 '16 at 17:11
  • @Matt I tried \r\n but it did not work. I looked around online and tried ROWTERMINATOR='0x0a', and it worked! I'm so happy. Thank you so much for your help through this. – Walker Jun 30 '16 at 17:36

1 Answers1

2

Change the encoding on the flat file connection to codepage 65001 (UTF-8) and ensure that the data type is unicode string DT_WSTR.

This link has more of a step by step directions for the process. https://www.mssqltips.com/sqlservertip/3119/import-utf8-unicode-special-characters-with-sql-server-integration-services/

banavalikar
  • 272
  • 1
  • 7
Matt
  • 13,833
  • 2
  • 16
  • 28