BCP's batch_size argument does not control the output, unfortunately.
Ways I've done this type of splitting:
1 - Simple but non-repeatable: Create a command file (.cmd) that runs a multiple BCP
s over the table for specific row ranges. This probably requires an IDENTITY(1,1)
based primary key on the table.
bcp "SELECT * FROM MyTable WHERE Id BETWEEN 0 AND 10000000" queryout …
bcp "SELECT * FROM MyTable WHERE Id BETWEEN 10000000 AND 20000000" queryout …
2 - Simple and repeatable, uses a lot of disk: BCP
out the entire table to a single file and use split
to create as many new files as needed with a given number of bytes in each (note: splitting by lines would be a better idea IMO). Use 'Cygwin' (GnuWin32 no longer maintained) to install split
and any other utilities you want.
bcp MyDb.MySchema.MyTable out C:\MyFile.csv -T -w
split -b 10737418240 C:\MyFile.csv C:\MySplitFile_
Generates the following files
C:\MySplitFile_aaa
C:\MySplitFile_aab
…
3 - Complex but repeatable, requires possibly insecure T-SQL: Use the xp_cmdshell
function to call BCP inside a stored procedure that iterates through the table.
DECLARE @loop AS INT;
--Use WHILE to loop as needed--
DECLARE @sql AS VARCHAR(MAX);
--Add code to generate dynamic SQL here--
DECLARE @bcp AS VARCHAR(MAX);
SELECT @bcp='BCP "'+@sql+'" queryout C:\MyFolder\MyFile_'+@loop+'.csv';
FINAL NOTE: If you are using any NVARCHAR fields in your data then you need to use the -w
flag and be aware that the output will be in UTF-16LE. I would strongly recommend converting that to UTF-8 using iconv
(from 'Cygwin' again) before trying to do anything with it in Hadoop.