0

Importing CSV with optional double quote text qualifier

In this post, I hope to show how to load in a CSV (comma-separated values) flat-file with an optional double quote (") text qualifier and a XML format file using BULK INSERT and OPENROWSET.

The sample data we’ll work with looks like:

First,Last,Age,Address
"Tim","Brooks",50,"105 Bond Street, New York"
"Bob","Jones",45,"81 Main Road, Florida"

You can see that only the Age column is not qualified with double quotes. If you were to only use , as a terminator, the data in the Address column would overflow. The XML format file allows us to use single quotes for the TERMINATOR attribute. This means we can specify the double quotes as the terminator. SQL Server also allows for non-XML format files, which are plain text-files. A XML format file is more beneficial because it’s backward compatible, flexible, and extendable.

The XML format file structure contains two main components; RECORD and ROW.

  • RECORD describes the data as it is stored in the data file.
    Each RECORD element contains a set of one or more FIELD elements. These elements correspond to fields in the data file.
  • ROW describes how to construct data rows from a data file when the data from the file is imported into a SQL Server table.
    A ROW element contains a set of COLUMN elements. These elements correspond to table columns.

For the FIELD attributes, the terminator specifies how each column is terminated. Let’s look at the first row of data:

"Tim","Brooks",50,"105 Bond Street, New York"

Let’s highlight the terminators:

"Tim","Brooks",50,"105 Bond Street, New York"
    ^^^      ^^  ^^                         ^^

Each row is typically terminated using a non-visible new-line feed (\n).

Column Terminator
First “,”
Last “,
Age ,”
Address “\n

This will give us the following format file:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
  <FIELD ID="1" xsi:type="CharTerm" TERMINATOR='","' />
  <FIELD ID="2" xsi:type="CharTerm" TERMINATOR='","' />
  <FIELD ID="3" xsi:type="CharTerm" TERMINATOR='",' />
  <FIELD ID="4" xsi:type="CharTerm" TERMINATOR='"\n' />
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="First" xsi:type="SQLVARYCHAR" />
  <COLUMN SOURCE="2" NAME="Last" xsi:type="SQLVARYCHAR" />
  <COLUMN SOURCE="3" NAME="Age" xsi:type="SQLINT" />
  <COLUMN SOURCE="4" NAME="Address" xsi:type="SQLVARYCHAR" />
 </ROW>
</BCPFORMAT>

In the ROW, I used the SQLVARYCHAR data type for the strings, and SQLINT type for the integer. You can find the full list of xsi:types for the COLUMN attribute here. You can also use the NAME attribute to identify the column in the result set.

Using OPENROWSET

This query will retrieve the data using OPENROWSET:

SELECT a.* FROM OPENROWSET( BULK 'c:\test\values.txt', 
   FORMATFILE = 'c:\test\values.xml',
   FIRSTROW = 2) AS a;

Result

First  Last    Age  Address
-----  ------- ---- --------------------------
"Tim   Brooks  50   105 Bond Street, New York
"Bob   Jones   45   81 Main Road, Florida

You can see that the First column has a leading double quote. This is because we did not handle the double quote in the format file. It’s possible to add another FILED attribute for a fixed length to handle this, but it won’t be covered in this post. Once this data is loaded into a staging table, you can use a REPLACE function to replace the double quote:

UPDATE #stage SET First = REPLACE(First, '"', '')

Documentation

Using BULK INSERT

This query will insert the data using BULK INSERT:

BULK INSERT #import
   FROM 'c:\test\values.txt'
   WITH 
      (
        FIRSTROW = 2,
        FORMATFILE = 'c:\test\values.xml'
      )

After selecting the imported data from #import, we get the same result set as above. Again, the First column has a leading double quote and can be handled in the format file or using the REPLACE function.

Documentation

References

XML Format Files

Norbert Krupa

Technical Consultant

Leave a Reply

Your email address will not be published. Required fields are marked *