String aggregation is the process of concatenating strings (usualy as a list with a separator) based on a common key. Databases implement their own function to do so such as wm_concat or LISTAGG in Oracle, group_concat in MySQL, array_agg or string_agg with PostGres. Some databases doesn’t support it at all, such as MS SQL Server (before the 2017 version) and require workaround. The implementation on the ETL tools side also depends on the vendor.
String aggregation is a simple operation to do in any high level programming language. A possible implementation in Python 3 can be:
Which gives as expected:
The same can be obtained with SQL for databases that support string aggregation. For example with PostGres 9.x:
Or, for Oracle:
MS SQL Server prior to version 2017 requires a workaround, one of them is the well-known XML method:
You can try all of these variations yourself with sqlfiddle.
Unfortunately, the same is true for ETL tools. Informatica (the market leader) has no built-in function for string aggregation and requires a workaround too.
With Talend, there are (at least) two ways to do it. Depending on your background, you may find one of them more intuitive. In Talend, you can aggregate strings using the tAggregateRow transformation or the tDenormalize transformation.
Giving the following input (randomly generated),
Both the aggregation on list method
or the denormalization
will produce the same outputs:
When you need to perform a string aggregation, you have multiple choices. You can do it in the database if your database supports it or using an ETL tool for example. Talend offers ready to use built-in functions which is not the case of all vendors.