Do not try to parse ISO-8601 with a regex

In 2017, I opened a pull request to Google’s ExoPlayer to properly support datetime strings in HLS manifests configured with a European locale. Basically, support comma (,) separators for unit delinations, in addition to the regular dot (.) separators that people are probably used to.

It was a very simple change that was quickly accepted. A few months later I submitted a similar patch to Django. This one was also ultimately accepted as well.

These were all just trivial regex changes to fix narrow use-cases.

Now I’ve just submitted another one to Django to fix yet another regex ISO-8601 parser bug.

I’m starting to see a pattern here.

Do not try to parse ISO-8601 datetime strings with a regex

Consider these two valid ISO-8601 datetime strings:

2012-04-23T10:20:30.400 -0200


>>> from django.utils.dateparse import parse_datetime
>>> parse_datetime("2012-04-23T10:20:30.400-0200")
datetime.datetime(2012, 4, 23, 10, 20, 30, 400000, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=79200), '-0200'))
>>> parse_datetime("2012-04-23T10:20:30.400 -0200")


>>> from dateutil.parser import parse as parse_datetime
>>> parse_datetime("2012-04-23T10:20:30.400-0200")
datetime.datetime(2012, 4, 23, 10, 20, 30, 400000, tzinfo=tzoffset(None, -7200))
>>> parse_datetime("2012-04-23T10:20:30.400 -0200")
datetime.datetime(2012, 4, 23, 10, 20, 30, 400000, tzinfo=tzoffset(None, -7200))

You see that Django doesn’t parse the second string correctly. Why?

This is why:

datetime_re = _lazy_re_compile(
    r'[T ](?P<hour>\d{1,2}):(?P<minute>\d{1,2})'

Look at this monstrous regex. Do you think this thing can possibly capture all 33 pages of the ISO-8601 datetime format specification? Up until a couple years ago it couldn’t even capture the difference between en and fr locales. Those ISO-8601 datetime formats are different.

What to do?

Don’t try to parse ISO-8601 datetime strings with a regex!

As far as I know, there is no regex that can parse the full specification accurately.

Use a library that doesn’t use (brittle) regexes to parse the strings.

If you’re using Python, use python-dateutil.