Exporting SQL Server Table to multiple part files

Question

I need to export a fairly large SQL Server table ~100GB to a CSV file. But rather than the output be a single csv file, it should ideally be multiple files say 10 files each 10GB.

I see BCP has a batch_size argument but this still writes all data to a single file? Are there other free utilities for doing what I require? Either where the size of file can be specified in bytes or number of rows?

For bit of context this is so the data can be combined with other sources in a Hive/Hadoop platform, so if there are better ways of exporting the data I'm open for suggestions.

Which version of SQL Server are you using? – Hart CO Jul 13 '13 at 17:24 — Hart CO, Jul 13 '13 at 17:24

score 8 · Answer 1 · answered Jul 13 '13 at 23:51

8

I think you could use SQL 2012's paging functions OFFSET and FETCH in conjunction with bcp:

SELECT *
FROM Table
ORDER BY ID --Primary Key
OFFSET 100000001 ROWS
FETCH NEXT 100000000 ROWS ONLY

answered Jul 13 '13 at 23:51

Hart CO

34,064
6
48
63

Joe Harris · Answer 2 · 2016-03-16T13:52:03.783

BCP's batch_size argument does not control the output, unfortunately.

Ways I've done this type of splitting:

1 - Simple but non-repeatable: Create a command file (.cmd) that runs a multiple BCPs over the table for specific row ranges. This probably requires an IDENTITY(1,1) based primary key on the table.

bcp "SELECT * FROM MyTable WHERE Id BETWEEN 0 AND 10000000" queryout …  
bcp "SELECT * FROM MyTable WHERE Id BETWEEN 10000000 AND 20000000" queryout …

2 - Simple and repeatable, uses a lot of disk: BCP out the entire table to a single file and use split to create as many new files as needed with a given number of bytes in each (note: splitting by lines would be a better idea IMO). Use 'Cygwin' (GnuWin32 no longer maintained) to install split and any other utilities you want.

 bcp MyDb.MySchema.MyTable out C:\MyFile.csv -T -w  
 split -b 10737418240 C:\MyFile.csv C:\MySplitFile_

Generates the following files

 C:\MySplitFile_aaa
 C:\MySplitFile_aab
 …

3 - Complex but repeatable, requires possibly insecure T-SQL: Use the xp_cmdshell function to call BCP inside a stored procedure that iterates through the table.

 DECLARE @loop AS INT;   
 --Use WHILE to loop as needed--   
 DECLARE @sql AS VARCHAR(MAX);   
 --Add code to generate dynamic SQL here--   
 DECLARE @bcp AS VARCHAR(MAX);   
 SELECT @bcp='BCP "'+@sql+'" queryout C:\MyFolder\MyFile_'+@loop+'.csv';

FINAL NOTE: If you are using any NVARCHAR fields in your data then you need to use the -w flag and be aware that the output will be in UTF-16LE. I would strongly recommend converting that to UTF-8 using iconv (from 'Cygwin' again) before trying to do anything with it in Hadoop.

score 3 · Answer 3 · edited Mar 20 '17 at 10:18

3

I would export the file first and then divide it externally. Assuming you are running on a Windows machine, there are several "free-ware" tools out there that can help. See this other answer on Superuser for more info.

edited Mar 20 '17 at 10:18

Community

1
1

answered Jul 13 '13 at 17:17

BellevueBob

9,498
5
29
56

Would splitting/dividing externally run the risk of splitting a row of data across two files? That would be fine if you are extracting and then ingesting, however, I'd think if you want to take advantage of scalable import your ingest would treat each file individually and extract and import individually, so splitting during zip might put a row of data in two files, thus breaking your scalable ingest? – Colorado Techie Dec 17 '21 at 16:38

score 3 · Answer 4 · answered Apr 16 '15 at 17:02

If you've got a single sortable primary key field, you could find the keys that define the boundaries of the rows you want, then just SELECT records using a WHERE that defines the boundaries.

It's similar to what Joe put for #1, but your keys need not be contiguous nor numeric. The following is a simple example:

DECLARE @maxrowsperfile AS bigint = 1048576
DECLARE boundaries CURSOR FOR
    SELECT the_sortable_key
    FROM
    (
        SELECT
            the_sortable_key
            , ROW_NUMBER() OVER(ORDER BY the_sortable_key) AS the_row_number
        FROM the_table
    ) AS t
    WHERE the_row_number % @maxrowsperfile = 0

OPEN boundaries

DECLARE @lowerbound AS [key type] = [value A]
DECLARE @upperbound AS [key type] = [value A]

FETCH NEXT FROM boundaries
INTO @upperbound

IF @lowerbound = @upperbound
    PRINT 'bcp "SELECT * FROM the_table" queryout file -w -T'
ELSE
    DECLARE @filecount AS int = 1
    BEGIN
        WHILE @@FETCH_STATUS = 0
        BEGIN
            PRINT 'bcp "SELECT * FROM the_table WHERE key > ' + CAST(@lowerbound AS varchar) + ' AND key <= ' + CAST(@upperbound AS varchar) + ' queryout file_' + CAST(@filecount AS varchar) + ' -w -T'
            SET @filecount = @filecount + 1
            SET @lowerbound = @upperbound
            FETCH NEXT FROM boundaries
            INTO @upperbound
        END
        PRINT 'bcp "SELECT * FROM table WHERE key > ' + CAST(@lowerbound AS varchar) + ' queryout file_' + CAST(@filecount AS varchar) + ' -w -T'

    END
CLOSE boundaries
DEALLOCATE boundaries

Exporting SQL Server Table to multiple part files

4 Answers4

Linked

Related