A Practical High Level No Nonsense Guide to Learning Data Science (Part I)
The term Data Science has been muddled more than any other tech term over the past few years (maybe except for blockchain).
Different people define the activities that comprise Data Science differently.
But all of them would agree that the 4 steps we are about to lay out form the foundation of any Data Science project.
We know getting started is the hardest part, and that’s what this guide is for.
It is not meant to be in-depth. Nor is it meant to court everyone’s opinion.
Or touch upon adjacent fields and activities (like Data Engineering, Data Analysts, Business Requirements).
This is our opinionated view on how to get started learning the technical components of Data Science. It takes work but like anything else very do-able with practice.
These are the 4 major steps of any Data Science project:
We will cover each step in a separate blog post. This is Part I and will cover Source.
Source refers to getting the data you need to do the project.
The method of getting the data varies widely depending on where the data lives.
However, you can safely say that Databases, APIs and Files dominate the source arena.
From wikipedia itself:
A database is an organized collection of data, generally stored and accessed electronically from a computer system. Where databases are more complex they are often developed using formal design and modeling techniques.
Even though there are various implementations of relational databases, you can interact with any of them using SQL.
SQL stands for Structured Query Language and is pronounced (“Ess-que-el” or “Sequel”). Depending on what databases you are using some syntax may need to be changed but worry not, it is usually very minor and not worth fretting over.
To learn SQL, you need to practice on a real database. The best way to set up a real database is to set up a small SQLite database and run SQL queries on it
Even though everyone complains about a lot of data being stored in files, they are still the most widely used data store.
Spreadsheets, CSV files and Text files being the most common.
In order to do anything meaningful with files, you need to learn a language. The most common Data Science languages are Python & R.
Eventually you should pick up both, but in the beginning we recommend starting with Python because of its versatility.
Pandas is a library built in python that makes a lot of data processing tasks easier. Especially with reading and writing files.
The quickest way to get started with files is to install and run Python and start reading, manipulating and writing files.
You can even bring in your database to Python and use SQL within Python itself. That way you have all your data sources connected to a single script and can work off them together.
API stands for Application Programming Interface. Applications create a set of guidelines and procedures to interact with their data.
Some of your favorite applications such as Twitter, Facebook, Google, Yelp all have APIs to get the data you need from them.
You interact with an API in a programming language and if you followed the above suggestion of using Python, then you can get started right away.
The key to getting better at sourcing from APIs is to do something practical with the data. You can do this by reading the API documentation and contacting the API developers for any clarifications.
Data Sources can be super diverse, especially in today’s landscape where almost everything creates data. However, the three sources above will cover a huge chunk of any data you would need to access.
While going through the above, the key thing is to practice, practice, practice. There are some things you can do to practice
- Store data in SQLite database
- Phrase questions you need answering from the database in SQL queries
- Query, query, query
- Read spreadsheets
- Query data in Python using SQL and output to a CSV file
- Read data from an API and build a small practical project
- Read data from an API and create a table in the SQLite database. All in Python
Let us know if you have any questions!
Coming Up Next…
Part II: Clean