String aggregation with databases and ETL tools


String aggregation is the process of concatenating strings (usualy as a list with a separator) based on a common key. Databases implement their own function to do so such as wm_concat or LISTAGG in Oracle, group_concat in MySQL, array_agg or string_agg with PostGres. Some databases doesn’t support it at all, such as MS SQL Server (before the 2017 version) and require workaround. The implementation on the ETL tools side also depends on the vendor.

String aggregation is a simple operation to do in any high level programming language. A possible implementation in Python 3 can be:

 
records = [
(1, 'Matt'),
(1, 'Rocks'),
(2, 'Stylus'),
(3, 'Foo'),
(3, 'Bar'),
(3, 'Baz')]

agg_data = dict()

for (k, v) in records:
if k in agg_data:
agg_data[k].append(v)
else:
agg_data[k] = list()
agg_data[k].append(v)

for k, v in agg_data.items():
print(k,",".join(v))

Which gives as expected:

 
1 Matt,Rocks
2 Stylus
3 Foo,Bar,Baz

The same can be obtained with SQL for databases that support string aggregation. For example with PostGres 9.x:

 
CREATE TEMPORARY TABLE records
(
k int,
v varchar(100)
) ;

INSERT INTO records(k, v)
values
(1, 'Matt'),
(1, 'Rocks'),
(2, 'Stylus'),
(3, 'Foo'),
(3, 'Bar'),
(3, 'Baz') ;

select k, string_agg(v, ', ') from records group by k ;

Or, for Oracle:

 
INSERT INTO records(k, v) values (1, 'Matt') ;
INSERT INTO records(k, v) values (1, 'Rocks') ;
INSERT INTO records(k, v) values (2, 'Stylus') ;
INSERT INTO records(k, v) values (3, 'Foo') ;
INSERT INTO records(k, v) values (3, 'Bar') ;
INSERT INTO records(k, v) values (3, 'Baz') ;

SELECT
k,
LISTAGG(v, ',') WITHIN GROUP (ORDER BY v)
from records
group by k ;

MS SQL Server prior to version 2017 requires a workaround, one of them is the well-known XML method:

 
select k
,Names = stuff((select ', ' + v as [text()]
from records xt
where xt.k = t.k
for xml path('')), 1, 1, '')
from records t
group by k ;

You can try all of these variations yourself with sqlfiddle.

Unfortunately, the same is true for ETL tools. Informatica (the market leader) has no built-in function for string aggregation and requires a workaround too.

With Talend, there are (at least) two ways to do it. Depending on your background, you may find one of them more intuitive. In Talend, you can aggregate strings using the tAggregateRow transformation or the tDenormalize transformation.

Example of a string aggregation with Talend

Giving the following input (randomly generated),

 
.---+----------.
|LogGeneratedData|
|=--+---------=|
|key|value |
|=--+---------=|
|2 |Woodrow |
|3 |Herbert |
|3 |Lyndon |
|4 |John |
|4 |Rutherford|
|4 |William |
|8 |Ronald |
|9 |Calvin |
|9 |Grover |
|9 |Millard |
'---+----------'

Both the aggregation on list method

Example of a string aggregation with Talend and tAggregateRow

 
.---+-----------------------.
| LogAggregatedData |
|=--+----------------------=|
|key|value |
|=--+----------------------=|
|2 |Woodrow |
|3 |Herbert,Lyndon |
|4 |John,Rutherford,William|
|8 |Ronald |
|9 |Calvin,Grover,Millard |
'---+-----------------------'

or the denormalization

Example of a string aggregation with Talend and tAggregateRow

will produce the same outputs:

 
.---+-----------------------.
| LogDenormalizedData |
|=--+----------------------=|
|key|value |
|=--+----------------------=|
|2 |Woodrow |
|3 |Herbert;Lyndon |
|4 |John;Rutherford;William|
|8 |Ronald |
|9 |Calvin;Grover;Millard |
'---+-----------------------'

When you need to perform a string aggregation, you have multiple choices. You can do it in the database if your database supports it or using an ETL tool for example. Talend offers ready to use built-in functions which is not the case of all vendors.