How to Clean Company Names from your Spreadsheets

Raise your hand if you have tried doing some analytics with company names and it led to a visualization like this one

All 3 are the same company but because of improper formatting end up as different bars

WHY DID THIS HAPPEN?!

Dirty data strikes again. The same company name can be written a bunch of different ways.

Not to mention the usual dirty data suspects:

4. Inconsistent capitalization — apple Inc

5. Random commas and dots — Apple, Inc.

How do I clean these company names?

Let’s go step by step and use this list as an example

Step 1 — Standardize capitalization

Pick one type of capitalization to follow. You can make everything UPPERCASE, lowercase or Title Case (only the first letter in each word is capitalized).

We prefer Title Case because it looks better and follows the conventions of proper nouns. But it’s really up to you.

The first letter of each name is capitalized

Step 2 — Remove all non-alphanumeric characters and extra text

In this step we are going to remove all non-alphanumeric characters such as commas (,) and dots (.)

We are also going to remove extra text such as (An Alphabet Co.) from Google Llc (An Alphabet Co.)

This whole step is a bit tricky and an inspection of your data is necessary to figure out which characters to remove. And what counts as extra text.

For example, Jeni’s Splendid Ice Creams is an ice cream company with an apostrophe (‘) in its name. You will not want to remove the apostrophe in this case.

Also depending on the company, things can be counted as extra text or not.

For example, both these company names have dashes ( — ) but the text after the dash in Johnson & Johnson is extra but not for Baskin — Robbins:

So you can’t reliably use the dash( — ) as the indicator for removing extra text in this case.

Generally, removing between parenthesis is fine in 99% of situations. But feel free to provide counter-examples in the comments!

So after removing non-alphanumeric characters and extra text, our list now looks like this

Removed dots (.) commas (,) and text between parenthesis

Step 3 — Remove/standardize full and abbreviated legal entity suffixes

This is the last step. We are now going to remove both the abbreviated and full legal entity suffixes.

You can also standardize the legal entity suffixes so each one appears consistently. For example instead of having a mix of Inc and Incorporated, all names only have Inc

This can be done in 2 ways:

NOTE: you have to list both the abbreviated and the full legal entity suffix, such as Inc vs. Incorporated

Once you have your list, go through all the company names and remove words that occur in your legal entity list. Or replace the words with one standard.

We have chosen to remove the suffixes. Which leads us to the following

Spic and span company names ready for Data Analysis

An Easier Way

As you can see, this gets out of hand really fast. Especially if you have company names from around the globe.

And this is why we created www.cleanspreadsheets.com

You simply upload your spreadsheet, choose the column with your company names, and click clean.

Our algorithms then auto-magically do all the cleaning for you and you get this as the result

We break up the names and provide both full and abbreviated suffixes. We support all G20 countries

As always, feel free to let us know if you have any questions, comments or feedback at info@lovespreadsheets.com

Happy Data Cleaning!

Photo by Clarisse Meyer on Unsplash

Products and resources that magically simplify hard spreadsheet tasks. Check us out at www.lovespreadsheets.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store