Bulk load between technologies

Data migration is a very common task in (big) data engineering. In the big data landscape, you would probably look for sqoop to handle such a task. But in the small to mid-size ecosystem, this task is actually not an obvious one. In this post, I will introduce a new bulk loading tool – Embulk – which can handle data loading between different technologies.

For a data migration or ETL project, you have multiple choices:

  • a generic-purpose ETL tool (e.g. Informatica, Talend) which needs a mimium of set-up and a bit of job design (reading-bucket size, insert batch size, temporary files, mapping, creation table options…)
  • hand-coded all-in-one script (e.g. in Python): basically you would have to do it from scratch
  • hand-coded set of scripts bulk dump from machine X database x with a dedicated tool (e.g. mysqldump), copy of the dump file x to the machine Y, bulk load into database y with another dedicated tool (e.g. bcp1)

Now, no matter which solution amongst these you consider, if your source or target database technology change you have to do it all over again. That’s annoying to say the least.

This is where Embulk comes in handy. This data transfer utility can bulk load between different database technologies and/or flat files using a very simple (yaml) configuration file. The basic idea is an input source, and an output target. This tool also supports filtering and some transformations (e.g. join two files on a key).

My particular use-case is data migration from a MySQL database to a Microsoft SQL Server database (running in Azure):

  max_threads: 1
  min_output_tasks: 1
  type: mysql
  host: xx.xx.xx.xx
  user: XXXX
  password: xxxxx
  database: DWH
  query: |
    from dim_products

  type: sqlserver
  driver_path: /tmp/sqljdbc_6.0/enu/sqljdbc42.jar
  host: xxxxx.database.windows.net
  user: XXXX
  password: xxx
  database: dwh
  table: legacy.dim_product
  mode: truncate_insert
  insert_method: native
    dwh_product_id: {type: 'INT'}
    product_name: {type: 'nvarchar(250)'}
    product_category: {type: 'nvarchar(100)'}
    source_system_name: {type: 'nvarchar(100)'}
    source_system_key: {type: 'INT'}

This turned out to be an interesting use-case ! The MySQL part is okay, nothing particular here. But the sqlserver part is all but a pleasant journey ! Why is that. Well for one you need two drivers: the standard JDBC driver that will be used by Embulk to proceed with the create table statements. Now, as you can see we use the ¨native¨ mode. The “normal” mode would actually produce a sequences of insert into statements (and therefore that would not be bulk loading). So we use the “native” mode that requires the “SQL Server Native Client (11.0)”. Tricky thing because the current version is the ODBC driver (currently in preview version 13.0). But the libraries calls are actually hard-coded in the output plugin meaning that we will need some soft links:

sudo ln -s /usr/lib64/libodbc.so /usr/lib/libodbc32.so

sudo ln -s /opt/microsoft/msodbcsql/lib64/libmsodbcsql-13.0.so.0.0 /opt/microsoft/msodbcsql/lib64/libsqlncli11.so

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/microsoft/msodbcsql/lib64/

Even the ODBC driver call is hard-coded in the plugin so we also needs to map what is called with what we have. You need to add the corresponding entry in /etc/odbcinst.ini

This is what the new entry should look like:

[SQL Server Native Client 11.0]
Description=Microsoft ODBC Driver 13 for SQL Server

Now we can call (finally) Embulk with our configuration file !

A few pitfalls though:

  • Embulk guess feature is not great with database, we have a very strict data schema here. Asking MySQL what the data is like should be enough to tell SQL Server which proper data types to use in the create statement. Unfortunately every integer fields were created as BIGINT and every non-numerical fields as TEXT. Not great, this is why a proper definition has been set in the configuration file.

  • Embulk is a parallel bulk loader. It uses multiple input/output tasks to dump and load data. To do so it creates n temporary data set that is file if the output is a file type. Even when dealing with relational database: first it bulk loads to temporary tables and uses multiple insert into table select from table_temporary ; statements to complete the integration. This is I believe a huge issue because it’s a bit counterproductive with the bulk load operation.

Basically Embulk has the potential to become a great tool but lack some implementation optimizations. For instance, with a single thread and a single output task, there is no need to bulk load into a temporary table, it could directly load to the target. Also I wonder if the ODBC driver part is really compulsory. I am confident that a bulk copy with the JDBC driver would be at least as efficient.

  1. “bulk copy program”: a bulk load command line utility developed by Microsoft for MS SQL Sever. Also exists as a FreeDTS flavor freebcp for Unix system.