Ultimate Guide to Cleaning Dates and Time

Date and time are frequently used data types. So you would think they would be straightforward to work with, right? Right?

The ISO (International Organization for Standards) even has a module dedicated to how date and time should be written.

Image for post
Image for post

But like all data, date and time frequently appear in the wild in all different kinds of formats.

In this guide, we will walk through the most common date and time formats. There will also be detailed steps on how to best handle them in your Data Analytics project.

SHAMELESS PLUG: If you need to convert any date/time, feel free to check our tool www.cleanspreadsheets.com that lets you automatically clean any date format in spreadsheets.

Assumptions

  1. You follow the standard Gregorian Calendar
  2. The numbers are Arabic numerals, e.g., 1, 3, 23
  3. Your language is English

Why should I care about this?

There are a few problems you can run into if all your dates don’t conform

  1. Waste of resources — let’s say you want to send an email to all customers that purchased your product after a certain date.

If the purchase date is recorded as May 4th, 2012 for some customers and 4/27/2015 for others, then you can mistakenly end up with a wrong list. And waste both money and time emailing the wrong people. Or leave money on the table.

2. Wrong analysis — bad data can quickly screw up your analytics project. This could be something as simple as sorting a list of Customers based on dates they purchased.

Or something as severe as coming up with the wrong revenue for your quarterly report to investors because you couldn’t filter properly on customer sign up date.

How can these issues be fixed?

Standardize your date formats. We recommend one of the ISO standard if you are working across timezones

  • YYYY-MM-DDTHH:MM:SSZ (2018–01–24T13:02:45Z)
  • YYYY-MM-DDTHH:MM:SS+/-TH:TM (2018–01–24T13:02:45-05:00 or 2018–01–24T17:09:23+05:30)

The Z at the end of the first format signifies the time is in the UTC timezone. UTC is the time at Greenwich.

The +/-TH:TM at the end of the second format signifies the difference from UTC timezone. For example, New York is -04:00, Mumbai is +05:30.

If you are just working within 1 timezone, we recommend the following format. It’s not strictly ISO standard (since 2019) but still prevents any dirty data issues. Plus it looks a lot cleaner and less confusing without the T.

  • YYYY-MM-DD HH:MM:SS (2019–05–14 13:02:45)

And without the time format if you are just working with dates

  • YYYY-MM-DD (2019–05–14)

Let’s break down this format just a little bit more.

  • YYYY — year written in the full 4 year format like 2005. ISO technically allows all values from 0000 (1BC) to 9999, but years before 1583, and after/before 1BC/9999AD, are not automatically recognized.
  • MM — the month padded with a 0. So January is 01. 01–12 is the range.
  • DD — the day padded with a 0. So single digit dates are prefixed with a 0. Like 02, 05. 01–31 is the range depending on the month.
  • HH — the hour of time in 24-hour format, padded with a 0. For e.g., 04 for 4AM, 14 for 2PM etc. 00–24 is the range.
  • MM — the minute of time padded with a 0. For e.g., 06, 25, 59. 00–60 is the range.
  • SS — the second of time padded with a 0. Same example as above. 00–60 is the range. If you need more precision, you can add the appropriate decimals. For ex: 11:51 PM 32 Seconds and 30 Milliseconds will be 23:41:32.5

You can use different separator for the date and time if you like but for both visual and consistency reasons, we recommend using the hyphen (-) for date separation and the colon (:) for time separation.

Let’s see how to format inconsistent dates into this standard.

Date Formats

The world writes dates differently. There are three main arrangements of day, month and year to be aware of.

  1. Day, Month, Year
  2. Year, Month, Day
  3. Month, Day, Year

Separators

Various separators can be used to break up the day, month and year. No matter what format they are in.

This could be a punctuation mark

  • 3–26–1991
  • 2015/4/17
  • 14.2.78

Or a space

  • 08 14 88

Or a combination of punctuation marks and space

  • January 8, 2019

You can follow these steps to get the different components. This will work no matter how the date is formatted: Day, Month, Year or Year, Month, Day etc. BUT you will need to know the format of the date to eventually clean up each component.

We will use these 2 dates as an example.

Image for post
Image for post
Original Dates. Date1 (top) Date2 (bottom)

And yes there are extra spaces around the comma. Welcome to the real world.

  1. Replace any of the separator punctuation marks: [-][.][/] with a space.
Image for post
Image for post
Comma in Date1 [,] and slash in Date2 [/] is replaced with spaces

2. Split the date using space as the separator

Image for post
Image for post
Date2 will separate easily

3. Optional After splitting the date, depending on what tool you are using, you may end up with an extra space value. Delete that extra value

Image for post
Image for post
Delete the extra value to get the split

Now let’s dissect the month, day and year. And transform them so the final dates are in the YYYY-MM-DD format and look like this.

1998–01–05

2007–08–14

You need to know what the order of the components are before you begin transforming. For e.g., if it’s Day, Month, Year or Year, Month, Day etc.

Day

To transform the day component

  • Remove anything attached to the number
  • Validate that it is a 2 digit value

Remove any non-numerical characters like nd from 2nd or th from 30th

Add a 0 in front of single digit values like 7 -> 07

Image for post
Image for post
Date1 split into components
Image for post
Image for post
Date2 split into components
Image for post
Image for post
To clean Day, we remove the th and add a 0 at the beginning

Month

There are a lot of different ways to write months which make things hard.

  • The month could be written out fully. Such as January, February, March. This will be different in languages other than English.
  • The month could be written in an abbreviated form. Such as Jan, Feb, Mar.
  • The month could be written numerically. Either as a single or double digit.

If the month is a number, add a 0 to any single digit values like 2 -> 02.

If the month is written out, create a map of the word(s) with the month number, then Find & Replace. For example:

jan, january -> 01

feb, february -> 02

Image for post
Image for post
Result from cleaning Day component
Image for post
Image for post
To clean Month, we replace January by mapping to 01, and add a 0 in front of the 8

Year

The year is usually a 2 digit number or a 4 digit number. Ex: 1999 or 99. However it could also have punctuation like ‘99.

For a 2 digit number, you will have to rely on your business rules. If, for example, 99 is 1999 or 1899.

Remove any non-numerical characters like the apostrophe (‘) from ‘99

Convert any 2 digit year to 4 digit year depending on your data’s rules

Image for post
Image for post
Result from cleaning Month
Image for post
Image for post
Remove punctuation from the year and knowing that the dates are > 1900 we convert the year into a 4 digit number

Now we can rearrange the cleaned components to create the YYYY-MM-DD dates by joining them with the hyphen [-] separator.

1998–01–05

2007–08–14

Time Formats

Now let’s discuss time formats. The two main formats are the 12 hour clock with an AM/PM indicator or the 24 hour clock. Seconds can also be included but are optional. So can milliseconds and further divisions.

  • Hour:Minute:Seconds AM/PM (12 hour)
  • Hour:Minute:Seconds (24 hour)

Like date, time components are also separated from each other. The most common separators are

  • The colon [:], 6:45 PM
  • Full words, 18 hours 23 minutes 52 seconds
  • Abbreviations, 13h21m
  • None, 2137

We will now see how to clean up the time to be standardized in the ISO standard HH:MM:SS. We will use the 6 following times as examples.

Image for post
Image for post
Examples to start

Unlike dates, times mostly follow the hour, minute, second order so you don’t need to be aware of the format.

  1. Remove all possible separators: [:][hours][minutes][seconds][h][m][s][hh][mm][ss][hr][min][sec][hour][minute][seconds]. AND all spaces
Image for post
Image for post
First all separators are removed (left). Then all spaces are removed (right)

2. Check if your date has AM or PM. If it does, remove the AM or PM and count how many digits are left.

If it’s an odd number like 5, 7 etc., look at the first digit (the hour).

  • If it’s an odd number and the date has AM, place a 0 at the beginning. If the date has PM then add 12 to the first digit and replace the first digit with the result.
Image for post
Image for post
Make the hour a 2 digit number for any dates with AM/PM. This takes care of all the AM/PM dates.

3. Count the digits of all your dates. For any date has a dot/period [.], count the digits before the dot/period [.]. If the count is an odd number, then add a 0 at the beginning.

Image for post
Image for post
Add 0 to any date that has an odd number of digits

4. Now all the time should be formatted. We can split it into components by counting off by 2s from the start. After the second component (minutes), the result of the values will be seconds.

Image for post
Image for post
The dates are now cleaned and split into components

5. Join all the components with a colon to get the final cleaned times

  • 18:23
  • 18:19
  • 14:53:27
  • 05:27
  • 04:09:33.7

6. [Optional] if you are working across timezones then you need to know how far away from UTC each time is. Then you need decide whether you want to store timezone info or convert each time into UTC.

If you want to convert all times to UTC, get the UTC difference. For example, New York is -04:00 during Daylight Saving. Mumbai is +05:30.

The -04:00 and +05:30 value is called the UTC offset.

For New York, this means the times are 4 hours behind UTC. So add 4 hours to the hour component of your dates. If that goes above 24, subtract by 12.

18:23 in New York becomes 22:23 UTC.

If you are storing times in UTC, you can either add a Z or +00:00 at the end of the times.

22:23Z or 22:23+00:00

If you are storing in any other timezone, you need to add its UTC offset.

22:23–04:00

DateTime Format Together

A lot of times Date and Time are stored together. Before following the above guides for each Date and Time, you would need to split them.

The most common date time separators are spaces and the ISO standard T.

  • 4/23/1998 4:45PM
  • 1998–04–23T16:45Z

No mater what the separator is, use it to split the value into Date and Time, clean both parts individually and then recombine them using either the space separator or the T separator

  • 1998–04–23 16:45
  • 1998–04–23T16:45+00:00

Storing DateTime Values

Now that you have cleaned your DateTime value, you need to store it somewhere.

Mostly all databases, programming languages and spreadsheet software today has a DateTime data type. If you are going to be mostly using your data within just one database, programming language or spreadsheet software, you should store it as a DateTime data type.

However, if you are going to be moving your data around between multiple databases, programming languages or spreadsheet software, then you should store it as a text/string/char.

I know! I know! That seems sacrilegious. But in our experience moving data around, things are most likely to break when DateTime data types are involved. Each system usually has a different implementation of the DateTime data type.

If you have cleaned your date to the ISO standard YYYY-MM-DDTHH:MM:SS, then storing them as text shouldn’t cause any issues. You can easily compare between and sort these dates.

Where to go from here?

You can use this guide to come up with an implementation to clean dates from your database, programming language or spreadsheets.

If you know Python, there is already an easy way. We wrote about it here.

And if you need to clean dates in spreadsheets, you can check out our tool Clean Spreadsheets to automatically clean and transform any date in your spreadsheets to the ISO format.

Happy Data Cleaning!

Written by

Products and resources that magically simplify hard spreadsheet tasks. Check us out at www.lovespreadsheets.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store