Files with .sql extension identified as binary in Mercurial

Question

Possible Duplicate:
Why does Mercurial think my SQL files are binary?

I generated a complete set of scripts for the stored procedures in a database. When I created a Mercurial repository and added these files they were all added as binary. Obviously, I still get the benefits of versioning, but lose a lot of efficiency, 'diff'ing, etc... of text files. I verified that these files are indeed all just text.

Why is it doing this?

What can I do to avoid it?

IS there a way to get Hg to change it mind about these files?

Here is a snippet of changeset log:

   496.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFindCustomerByMatchCode.StoredProcedure.sql has changed
   497.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFindUnreconcilableChecks.StoredProcedure.sql has changed
   498.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFixBadLabelSelected.StoredProcedure.sql has changed
   499.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFixCCOPL.StoredProcedure.sql has changed
   500.1 Binary file SQL/SfiData/Stored Procedures/dbo.pFixCCOrderMoneyError.StoredProcedure.sql has changed

Thanks in advance for your help Jim

Hg decides a file is binary if it contains a NUL byte, is this the case? — tonfa, Sep 16 '10 at 11:16
Which charset use these .sql files? Some charsets (like utf16 and utf32) are recognized as binary. — Rudi, Sep 16 '10 at 11:47

score 8 · Accepted Answer · edited May 23 '17 at 11:53

8

In fitting with Mercurial's views on binary files, it does not actually track file types, which means that there is no way for a user to mark a file as binary or not binary.

As tonfa and Rudi mentioned, Mercurial determines whether a file is binary or not by seeing if there is a NUL byte anywhere in the file. In the case of UTF-[16|32] files, a NUL byte is pretty much guaranteed.

To "fix" this, you would have to ensure that the files are encoded with UTF-8 instead of UTF-16. Ideally, your database would have a setting for Unicode encoding when doing the export. If that's not the case, another option would be to write a precommit hook to do it (see How to convert a file to UTF-8 in Python for a start), but you would have to be very careful about which files you were converting.

edited May 23 '17 at 11:53

Community

1
1

answered Sep 16 '10 at 13:00

tghw

25,208
13
70
96

1

tghw has the right answer it it's worth pointing out explicitly that "binary" and "text" files are handled identically by mercurial internally. They only differ in what merge tools they'll launch (which is easily configured) and what shows to users on diff/incoming/outgoing. The actual storage and merging is the same. – Ry4an Brase Sep 16 '10 at 14:06
1

The problem was indeed Unicode coding. The db export only allows setting Unicode or ANSI. It did not give any more explicit choices for Unicode. I changed the output to ANSI and got the behavior that I wanted. – Jim Reineri Sep 16 '10 at 14:18
Thank you all for your assistance. – Jim Reineri Sep 16 '10 at 14:20

score 7 · Answer 2 · answered Mar 16 '12 at 19:25

I know it's a bit late, but I was evaluating Kiln and came across this problem. After discussion with the guys at Fogbugz who couldn't give me an answer other than "File/Save As" from SSMS for every *.sql file (very tedious), I decided to have a look at writing a quick script to convert the *.sql files.

Fortunately you can use one Microsoft technology (Powershell) to (sort of) overcome an issue with another Microsoft technology (SSMS) - using Powershell, change to the directory that contains your *.sql files and then copy and paste the following into the Powershell shell (or save as a .ps1 script and run it from Powershell - make sure to run the command "Set-ExecutionPolicy RemoteSigned" before trying to run a .ps1 script):

function Get-FileEncoding
{
  [CmdletBinding()] Param (
  [Parameter(Mandatory = $True, ValueFromPipelineByPropertyName = $True)] [string]$Path
  )

  [byte[]]$byte = get-content -Encoding byte -ReadCount 4 -TotalCount 4 -Path $Path

  if ( $byte[0] -eq 0xef -and $byte[1] -eq 0xbb -and $byte[2] -eq 0xbf )
  { Write-Output 'UTF8' }
  elseif ($byte[0] -eq 0xfe -and $byte[1] -eq 0xff)
  { Write-Output 'Unicode' }
  elseif ($byte[0] -eq 0xff -and $byte[1] -eq 0xfe)
  { Write-Output 'Unicode' }
  elseif ($byte[0] -eq 0 -and $byte[1] -eq 0 -and $byte[2] -eq 0xfe -and $byte[3] -eq 0xff)
  { Write-Output 'UTF32' }
  elseif ($byte[0] -eq 0x2b -and $byte[1] -eq 0x2f -and $byte[2] -eq 0x76)
  { Write-Output 'UTF7'}
  else
  { Write-Output 'ASCII' }
}


$files = get-ChildItem "*.sql"
foreach ( $file in $files )
{
$encoding = Get-FileEncoding $file
If ($encoding -eq 'Unicode')
    {
    (Get-Content "$file" -Encoding Unicode) | Set-Content -Encoding UTF8 "$file"
    }
}

The function Get-FileEncoding is courtesy of http://poshcode.org/3227 although I had to modify it slightly to cater for UC2 little endian files which SSMS seems to have saved these as. I would recommend backing up your files first as it overwrites the original - you could, of course, modify the script so that it saves a UTF-8 version of the file instead e.g. change the last line of code to say:

(Get-Content "$file" -Encoding Unicode) | Set-Content -Encoding UTF8 "$file.new"

The script should be easy to modify to traverse subdirectories as well.

Now you just need to remember to run this if there are any new *.sql files, before you commit and push your changes. Any files already converted and subsequently opened in SSMS will stay as UTF-8 when saved.

An alternative is to change the default SQL template file to UTF-8 (or whatever encoding you want), which will affect all future saves through SSMS. The gist is to open this file in notepad, then "Save" and change the Encoding: `C:\Program Files (x86)\\Common7\IDE\SqlWorkbenchProjectItems\Sql\SQLFile.sql`. Credit goes to https://joehanna.com/sql-server/changing-the-default-encoding-of-sql-files-in-ssms/ — Johann, Mar 24 '20 at 14:28
Thanks for the comment - this was posted 8 years ago, so I'm not sure if this option was available back then. It might be worth you composing your own answer for folks that are only just coming across this issue. Although, having a default is great but doesn't solve the problem of all of the files encoded from yesteryear. — misterjaytee, Apr 16 '20 at 19:20
Looks like I did post it as an answer on the duplicate question. https://stackoverflow.com/a/60833119 I wish I had just linked there to begin with :P — Johann, Apr 16 '20 at 22:47

Files with .sql extension identified as binary in Mercurial

2 Answers2

Linked