Raise your hand if you have tried doing some analytics with company names and it led to a visualization like this one
WHY DID THIS HAPPEN?!
Dirty data strikes again. The same company name can be written a bunch of different ways.
- Just the company name — Apple
- The company name with abbreviated legal entity suffix — Apple Inc
- The company name with full legal entity — Apple Incorporated
Not to mention the usual dirty data suspects:
4. Inconsistent capitalization — apple Inc
5. Random commas and dots — Apple, Inc.
How do I clean these company names?
Let’s go step by step and use this list as an example
Step 1 — Standardize capitalization
Pick one type of capitalization to follow. You can make everything UPPERCASE, lowercase or Title Case (only the first letter in each word is capitalized).
We prefer Title Case because it looks better and follows the conventions of proper nouns. But it’s really up to you.
Step 2 — Remove all non-alphanumeric characters and extra text
In this step we are going to remove all non-alphanumeric characters such as commas (,) and dots (.)
We are also going to remove extra text such as (An Alphabet Co.) from Google Llc (An Alphabet Co.)
This whole step is a bit tricky and an inspection of your data is necessary to figure out which characters to remove. And what counts as extra text.
For example, Jeni’s Splendid Ice Creams is an ice cream company with an apostrophe (‘) in its name. You will not want to remove the apostrophe in this case.
Also depending on the company, things can be counted as extra text or not.
For example, both these company names have dashes ( — ) but the text after the dash in Johnson & Johnson is extra but not for Baskin — Robbins:
- Johnson & Johnson — a family company
- Baskin — Robbins
So you can’t reliably use the dash( — ) as the indicator for removing extra text in this case.
Generally, removing between parenthesis is fine in 99% of situations. But feel free to provide counter-examples in the comments!
So after removing non-alphanumeric characters and extra text, our list now looks like this
Step 3 — Remove/standardize full and abbreviated legal entity suffixes
This is the last step. We are now going to remove both the abbreviated and full legal entity suffixes.
You can also standardize the legal entity suffixes so each one appears consistently. For example instead of having a mix of Inc and Incorporated, all names only have Inc
This can be done in 2 ways:
- Inspect your data and make a list of all the legal entity names you see — best for small data
- Note down all the possible legal entity names for the countries in which your companies are based — best for big data
NOTE: you have to list both the abbreviated and the full legal entity suffix, such as Inc vs. Incorporated
Once you have your list, go through all the company names and remove words that occur in your legal entity list. Or replace the words with one standard.
We have chosen to remove the suffixes. Which leads us to the following
An Easier Way
As you can see, this gets out of hand really fast. Especially if you have company names from around the globe.
And this is why we created www.cleanspreadsheets.com
You simply upload your spreadsheet, choose the column with your company names, and click clean.
Our algorithms then auto-magically do all the cleaning for you and you get this as the result
As always, feel free to let us know if you have any questions, comments or feedback at email@example.com
Happy Data Cleaning!