In this post, I hope to show how to load in a CSV (comma-separated values) flat-file with an optional double quote ("
) text qualifier and a XML format file using BULK INSERT
and OPENROWSET
.
The sample data we’ll work with looks like:
First,Last,Age,Address "Tim","Brooks",50,"105 Bond Street, New York" "Bob","Jones",45,"81 Main Road, Florida"
You can see that only the Age
column is not qualified with double quotes. If you were to only use ,
as a terminator, the data in the Address
column would overflow. The XML format file allows us to use single quotes for the TERMINATOR
attribute. This means we can specify the double quotes as the terminator. SQL Server also allows for non-XML format files, which are plain text-files. A XML format file is more beneficial because it’s backward compatible, flexible, and extendable.
The XML format file structure contains two main components; RECORD
and ROW
.
RECORD
describes the data as it is stored in the data file.
EachRECORD
element contains a set of one or moreFIELD
elements. These elements correspond to fields in the data file.ROW
describes how to construct data rows from a data file when the data from the file is imported into a SQL Server table.
AROW
element contains a set ofCOLUMN
elements. These elements correspond to table columns.
For the FIELD
attributes, the terminator specifies how each column is terminated. Let’s look at the first row of data:
"Tim","Brooks",50,"105 Bond Street, New York"
Let’s highlight the terminators:
"Tim","Brooks",50,"105 Bond Street, New York" ^^^ ^^ ^^ ^^
Each row is typically terminated using a non-visible new-line feed (\n
).
Column | Terminator |
---|---|
First | “,” |
Last | “, |
Age | ,” |
Address | “\n |
This will give us the following format file:
<?xml version="1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' /> <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' /> <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='",' /> <FIELD ID="4" xsi:type="CharTerm" TERMINATOR='"\n' /> </RECORD> <ROW> <COLUMN SOURCE="1" NAME="First" xsi:type="SQLVARYCHAR" /> <COLUMN SOURCE="2" NAME="Last" xsi:type="SQLVARYCHAR" /> <COLUMN SOURCE="3" NAME="Age" xsi:type="SQLINT" /> <COLUMN SOURCE="4" NAME="Address" xsi:type="SQLVARYCHAR" /> </ROW> </BCPFORMAT> |
In the ROW
, I used the SQLVARYCHAR
data type for the strings, and SQLINT
type for the integer. You can find the full list of xsi:type
s for the COLUMN
attribute here. You can also use the NAME
attribute to identify the column in the result set.
Using OPENROWSET
This query will retrieve the data using OPENROWSET
:
SELECT a.* FROM OPENROWSET( BULK 'c:\test\values.txt', FORMATFILE = 'c:\test\values.xml', FIRSTROW = 2) AS a; |
Result
First Last Age Address ----- ------- ---- -------------------------- "Tim Brooks 50 105 Bond Street, New York "Bob Jones 45 81 Main Road, Florida
You can see that the First
column has a leading double quote. This is because we did not handle the double quote in the format file. It’s possible to add another FILED
attribute for a fixed length to handle this, but it won’t be covered in this post. Once this data is loaded into a staging table, you can use a REPLACE
function to replace the double quote:
UPDATE #stage SET First = REPLACE(First, '"', '') |
Using BULK INSERT
This query will insert the data using BULK INSERT
:
BULK INSERT #import FROM 'c:\test\values.txt' WITH ( FIRSTROW = 2, FORMATFILE = 'c:\test\values.xml' ) |
After selecting the imported data from #import
, we get the same result set as above. Again, the First
column has a leading double quote and can be handled in the format file or using the REPLACE
function.
References
What if the unquoted age column was the FIRST column in the data. How woul dthe xml file look then?
“SQLVARYCHAR”?
Correct. See the following doc: https://docs.microsoft.com/en-us/sql/relational-databases/import-export/xml-format-files-sql-server