A SQL MERGE statement performs actions based on a RIGHT JOIN

RIGHT JOIN is an esoteric feature in the SQL language, and hardly ever seen in the real world, because almost every RIGHT JOIN can just be expressed as an equivalent LEFT JOIN. The following two statements are equivalent:

-- Popular
SELECT c.first_name, c.last_name, p.amount
FROM customer AS c
LEFT JOIN payment AS p ON c.customer_id = p.customer_id

-- Esoteric
SELECT c.first_name, c.last_name, p.amount
FROM payment AS p
RIGHT JOIN customer AS c ON c.customer_id = p.customer_id

It’s not unreasonable to expect these two statements to produce the same execution plan on most RDBMS, given that they’re logically equivalent. Since we’ve grown used to reading things from left to right and top to bottom, I don’t think RIGHT JOIN will become more popular any time soon.

There is, however, one place in the SQL language where RIGHT JOIN is surprisingly ubiquitous!

The MERGE statement

Why is it a surprise? Because that place doesn’t use the same syntax for joining two tables. But that’s exactly what happens in the MERGE statement. Let’s look at the following MERGE statement that takes:

A staging table (SOURCE table) to load data from
An ordinary table (TARGET table) to store data into

With a schema like this:

CREATE TABLE book_to_book_store (
  book_id BIGINT NOT NULL REFERENCES book,
  name TEXT NOT NULL REFERENCES book_store,
  stock INT NOT NULL,

  PRIMARY KEY (book_id, name)
);

CREATE TABLE book_to_book_store_staging AS 
SELECT * FROM book_to_book_store
WITH NO DATA;

A query that could be typical of an ETL job:

-- The target table
MERGE INTO book_to_book_store AS t

-- The source table
USING book_to_book_store_staging AS s

-- The RIGHT JOIN predicate
ON t.book_id = s.book_id AND t.name = s.name

-- The actions for each row, based on RIGHT JOIN matching
WHEN MATCHED THEN UPDATE SET stock = s.stock
WHEN NOT MATCHED THEN INSERT (book_id, name, stock) 
VALUES (s.book_id, s.name, s.stock);

This is simply taking all the rows from the BOOK_TO_BOOK_STORE_STAGING table, and merges them into BOOK_TO_BOOK_STORE:

If the row already exists (there’s a MATCH), then the STOCK is updated
If the row doesn’t already exist (there’s no MATCH), then the row is inserted

But we don’t use this source -> target syntactic order, we first specify the target table BOOK_TO_BOOK_STORE, and then we RIGHT JOIN the BOOK_TO_BOOK_STORE_STAGING table to it. Think about it this way:

SELECT *
FROM book_to_book_store AS t
RIGHT JOIN book_to_book_store_staging AS s
ON t.book_id = s.book_id AND t.name = s.name

And, if we think of a RIGHT JOIN not as a Venn diagram, but as a cartesian product as follows, then it can be seen easily what is done per MATCH or non-MATCH:

|t.name      |t.book_id|t.stock|s.name      |s.book_id|s.stock|
|------------|---------|-------|------------|---------|-------|
|            |         |       |Faraway Land|1        |9      | <-- NOT MATCHED
|Faraway Land|2        |10     |Faraway Land|2        |12     | <--     MATCHED
|Faraway Land|3        |10     |Faraway Land|3        |5      | <--     MATCHED
|            |         |       |Paper Trail |1        |1      | <-- NOT MATCHED
|Paper Trail |3        |2      |Paper Trail |3        |0      | <--     MATCHED

As always with a RIGHT JOIN, every row from right side of the join is matched with a matching row from the left side of the join, or an empty row of NULL values, if there’s no such match. After this MERGE, we want the resulting data to be updated as follows:

|t.name      |t.book_id|t.stock|s.name      |s.book_id|s.stock|
|------------|---------|-------|------------|---------|-------|
|Faraway Land|1        |9      |Faraway Land|1        |9      | <-- NOT MATCHED
|Faraway Land|2        |12     |Faraway Land|2        |12     | <--     MATCHED
|Faraway Land|3        |5      |Faraway Land|3        |5      | <--     MATCHED
|Faraway Land|1        |1      |Paper Trail |1        |1      | <-- NOT MATCHED
|Paper Trail |3        |0      |Paper Trail |3        |0      | <--     MATCHED

This is how the MERGE statement works.

Note, I said before that the JOIN is producing a cartesian product. Unlike with SELECT statements, however, there’s a limitation to MERGE where the cartesian product must not produce any duplicate matches per TARGET row, as the order of actions wouldn’t be defined if there were multiple SOURCE rows per TARGET row.

Deleting rows

MERGE is more powerful than just performing INSERT and UPDATE. It can also DELETE rows. Let’s assume that we want a staging table’s STOCK = 0 to mean that the row should be deleted, instead of the STOCK being set to 0. Then we can write:

MERGE INTO book_to_book_store AS t
USING book_to_book_store_staging AS s
ON t.book_id = s.book_id AND t.name = s.name
WHEN MATCHED AND s.stock = 0 THEN DELETE
WHEN MATCHED THEN UPDATE SET stock = s.stock
WHEN NOT MATCHED THEN INSERT (book_id, name, stock) 
VALUES (s.book_id, s.name, s.stock);

Now, with the above staging data, we’ll remove the last row instead of updating it:

|t.name      |t.book_id|t.stock|s.name      |s.book_id|s.stock|
|------------|---------|-------|------------|---------|-------|
|Faraway Land|1        |9      |Faraway Land|1        |9      | <-- NOT MATCHED : INSERT
|Faraway Land|2        |10     |Faraway Land|2        |12     | <--     MATCHED : UPDATE
|Faraway Land|3        |10     |Faraway Land|3        |5      | <--     MATCHED : UPDATE
|Paper Trail |1        |1      |Paper Trail |1        |1      | <-- NOT MATCHED : INSERT
|            |         |       |Paper Trail |3        |0      | <--     MATCHED : DELETE

The RIGHT JOIN semantics is still the same, just the action is different now, depending on the additional AND clause of the WHEN MATCHED clause.

Matching by source

Some RDBMS support an even more powerful vendor specific variant of MERGE, which should be added to the IEC/ISO 9075 standard, in my opinion. The BY TARGET / BY SOURCE clause. Let’s have a look at the following statement:

MERGE INTO book_to_book_store AS t
USING book_to_book_store_staging AS s
ON t.book_id = s.book_id AND t.name = s.name
WHEN MATCHED THEN UPDATE SET stock = s.stock
WHEN NOT MATCHED BY TARGET THEN INSERT (book_id, name, stock) 
VALUES (s.book_id, s.name, s.stock)
WHEN NOT MATCHED BY SOURCE THEN DELETE;

Adding a WHEN NOT MATCHED BY SOURCE clause has the simple effect of turning the RIGHT JOIN operation into a FULL JOIN operation. Think of it this way:

SELECT *
FROM book_to_book_store AS t
FULL JOIN book_to_book_store_staging AS s
ON t.book_id = s.book_id AND t.name = s.name

Now, the result might look something like this:

|t.name      |t.book_id|t.stock|s.name      |s.book_id|s.stock|
|------------|---------|-------|------------|---------|-------|
|            |         |       |Faraway Land|1        |9      | <-- NOT MATCHED BY TARGET
|Faraway Land|2        |10     |Faraway Land|2        |12     | <--     MATCHED
|Faraway Land|3        |10     |Faraway Land|3        |5      | <--     MATCHED
|            |         |       |Paper Trail |1        |1      | <-- NOT MATCHED BY TARGET
|Paper Trail |3        |2      |            |         |       | <-- NOT MATCHED BY SOURCE

The terms NOT MATCHED BY TARGET and NOT MATCHED BY SOURCE are quite self-explanatory when visualised as above, and probably less confusing to beginners than LEFT JOIN and RIGHT JOIN. I wouldn’t mind SQL syntax to be enhanced in a way that it would be possible to identify whether a NULL value originating from an OUTER JOIN is due to:

The source data containing the NULL value
The row being NOT MATCHED by the “other side” of the OUTER JOIN

Imagine a hypothetical syntax like this:

SELECT c.first_name, c.last_name, p.amount
FROM customer AS c
LEFT JOIN payment AS p ON c.customer_id = p.customer_id
WHERE p IS NOT MATCHED BY JOIN -- Effectively an ANTI JOIN

Anyway…

When deleting rows, this approach is much more convenient than having to rely on an interpretation of the semantics of data, such as STOCK = 0 meaning a deletion. We now have absent rows in the SOURCE table (the staging) table, which simply mean the row must be deleted, if that’s how we want to model things. So, after running the above MERGE statement, we’ll get this outcome again:

|t.name      |t.book_id|t.stock|s.name      |s.book_id|s.stock|
|------------|---------|-------|------------|---------|-------|
|Faraway Land|1        |9      |Faraway Land|1        |9      | <-- NOT MATCHED BY TARGET : INSERT
|Faraway Land|2        |12     |Faraway Land|2        |12     | <--     MATCHED           : UPDATE
|Faraway Land|3        |5      |Faraway Land|3        |5      | <--     MATCHED           : UPDATE
|Faraway Land|1        |1      |Paper Trail |1        |1      | <-- NOT MATCHED BY TARGET : INSERT
|            |         |       |            |         |       | <-- NOT MATCHED BY SOURCE : DELETE

At least the following RDBMS support the BY SOURCE and BY TARGET clauses:

Databricks
Firebird 5
PostgreSQL 17
SQL Server

Given how useful this is, I’ll expect more RDBMS to adopt this T-SQL syntax, soon. jOOQ 3.20 has added support for it, and a future version of jOOQ may emulate it by moving the FULL JOIN into the USING clause.

Please log in to like, share and comment!