PostgreSQL – search and replace within a character string

Here we explore the operators and functions PostgreSQL offers us to perform search and replace operations on a character string.

By way of an example, let’s say we have a table containing contact email addresses:-

    devg=> select email_address from contacts;
           email_address
    ----------------------------
     joe.bloggs@olddomain.com
     john.doe@otherdomain.org
     DAVID.JONES@OLDDOMAIN.COM
     jane.smith@otherdomain.net
    (4 rows)

We need to move any addresses on olddomain.com to mynewdomain.com. Now finding those addresses is pretty easy. For a case like this, old, reliable SQL LIKE will do the job nicely if we use the lower() function on the data first so we can do a caseless search:-

   devg=> select email_address from contacts
    devg-> where lower(email_address) like '%@olddomain.com';
           email_address
    ---------------------------
     joe.bloggs@olddomain.com
     DAVID.JONES@OLDDOMAIN.COM
    (4 rows)

LIKE may be an old, reliable way of doing it, but it is pretty limited. It only offers two wildcards – % for any number of characters and _ for a single one – so it’s useful to have something more powerful in our armoury. The SIMILAR TO operator works in a similar way to LIKE but allows more sophisticated SQL regular expressions to be used:-

    devg=> select email_address from contacts where
    devg-> lower(email_address) similar to E'%\@olddomain.com';
           email_address 
    ---------------------------
     joe.bloggs@olddomain.com
     DAVID.JONES@OLDDOMAIN.COM
    (2 rows)

SQL regular expressions aren’t quite the same as the regular expressions you’d encounter in other programming languages; note here that we’re still using % as a wildcard and we don’t need to escape the ‘.’ since it’s not being treat as a single character wildcard.

PostgreSQL does support the more familiar POSIX regular expressions too, via the ~ (tilde) operator. Here we’re using the caseless ~* version to remove the need for the lower() function:-

    devg=> select email_address from contacts where
    devg-> email_address ~* E'.*@olddomain\.com';
           email_address
    ---------------------------
     joe.bloggs@olddomain.com
     DAVID.JONES@OLDDOMAIN.COM
    (2 rows)

But what about updating the data? The SQL standard overlay() function isn’t especially helpful for this requirement as it wants character positions for the replacement:-

update contacts
   set email_address = overlay(email_address 
     placing 'mynewdomain.com'
     from (char_length(email_address) - char_length('olddomain.com')) + 1
     for char_length('mynewdomain.com'))
where email_address ~* E'.*@olddomain\.com';

It works, but boy is it ugly!

    devg=> select email_address from contacts;
           email_address
    -----------------------------
     john.doe@otherdomain.org
     jane.smith@otherdomain.net
     joe.bloggs@mynewdomain.com
     DAVID.JONES@mynewdomain.com
(4 rows)

To use it, we’re having to work out the starting point for the replacement by subtracting the length of olddomain.com from the length of an individual email address and also give it the length of the replacement string too.

PostgreSQL has its own replace() function, which is a bit more suited to our needs.

update contacts
   set email_address = replace(email_address,
         '@olddomain.com', '@mynewdomain.com');
where email_address ~* E'.*@olddomain\.com';

We can drop the where clause here since the replace will only change records with olddomain.com in the email address, but this highlights a problem. Run the above SQL and it will show two rows updated, but since the record with an upper case address in it does not contain olddomain.com (it contains OLDDOMAIN.COM) it doesn’t actually get changed:-

    devg=> select email_address from contacts;
           email_address
    ----------------------------
     john.doe@otherdomain.org
     jane.smith@otherdomain.net
     joe.bloggs@mynewdomain.com
     DAVID.JONES@OLDDOMAIN.COM
    (4 rows)

We do have a couple of cheesy fixes available. First of all, running the SQL twice – once for uppercase and once for lowercase – would do the job. Second, we could nest the replace functions:-

update contacts
   set email_address = replace(
          replace(email_address, '@OLDDOMAIN.COM', '@mynewdomain.com'),  
          '@olddomain.com', '@mynewdomain.com');

This does the job too, but it would miss any cases where our match text had mixed case rather than all upper or all lower. Fortunately PostgreSQL offers a slightly more powerful version of replace, regexp_replace() which can come to our rescue here:-

update contacts
  set email_address = regexp_replace(email_address,
       E'@olddomain\.com', '@mynewdomain.com', 'i')
where email_address ~* E'.*@olddomain\.com';

Note that the regular expression for the text to replace here is not a match for the whole string, simply the bit of the string we’re going to replace. Whilst it doesn’t matter in this example, retaining the where clause might be useful to ensure we’re only affecting the intended records. We’re also using the optional ‘i’ flag to ensure the text to replace is located using a case-insensitive search:-

    devg=> select email_address from contacts;                      
           email_address
    -----------------------------
     john.doe@otherdomain.org
     jane.smith@otherdomain.net
     joe.bloggs@mynewdomain.com
     DAVID.JONES@mynewdomain.com
    (4 rows)

Leave a Reply

Your email address will not be published. Required fields are marked *