So you want to be a Data Analyst/ Scientist/ Engineer but don't know how? I've got your back
There is a lot of talk about Data in different forms: Big Data, Data Science, Data Analysis, Data Visualization. But what if all you want is a role in Data but all the talk is confusing. Don’t you worry my child, I’m here for you.
I got into Big Data purely out of necessity. I was consulting with a Utility company; there was a legal requirement to roll out Smart metres by 2020. To do this, I needed to understand customers’ energy consumption. All that data was in Hadoop and no one could make sense of it. So I connected Power BI to Hadoop for data visualization. That, ladies and gentlemen, is how I got sucked into the world of data.
I know, everything I have written so far doesn’t make sense and you are too polite to tell me. But this is how is it in the real world; a lot of things about Data as a whole does not make sense and you should not wait to understand it fully before you jump in.
Now that we all agree that we don't know what we are doing when we first start, I'll explain
5 roles in Big Data and how to carve a niche for yourself in these roles:
1. Data Scientist – Uses statistical techniques to analyze raw data
2. Data Analyst – Translates data into information that can be used to influence business decisions.
3. Data Engineer – Develops and maintains database infrastructure/ ecosystem.
4. Data Architect – Designs and manages the organisation's data architecture...how data flows etc.
5. Machine Learning Engineer – Creates algorithms to enable machines act without being directed.
Think driverless cars, automated credit card fraud detection and chat bots. Now that we have established this, how do you get into Data Science/ Big Data/ all things data without seeming like a complete buffon?
Ps - Don't be afraid of seeming like a buffon when you learn new things .
Skills for Data Science:
a. Programming languge:
You can decide to learn Python, R, SQL or Java. Whatever you do, start with one language, understand it properly and when you are "fluent", you can decide to learn another language.
Learning a programming language can be challenging especially when you have a full schedule (job, family and commitments). You should create time to learn and practice (a minimum of 30 minutes daily for 8 weeks). I learned Python on datacamp
You don't have to learn this just for Data Science but I find it a useful tool. It helps with understanding data sets, running experiments, interpreting results and summarizing data. Like @josh_wills says "Data Scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician.”
That's the balance you need to achieve and maintain. Here's a good introductory course on udacity.
c. Machine Learning:
We generate so much data that it's almost impossible for a Data Scientist to keep track of and work with. This is where ML(Machine Learning) comes into play. ML is how a system learns to process data sets and offer solutions without human intervention.For example, Netflix and AmazonUK recommend what you should watch based on your data/history (what you have watched before and how long you stuck to the movie/ series).
ML is part of Data Science and Artificial Intelligence; but AI is a totally different topic that I'd write about later. After you've completed the Python and Statistics courses, you can audit this Machine learning course by Columbia University.
Or if you are feeling rather brave and can create 3 hours in a day, you can take all 3 courses simultaneously. NOT Recommended!
d. Advanced SQL:
Python is basically a scripting language and I recommended that you learn it first because it's not as daunting as other programming languages. Now that you understand why I recommended you learn Python first, you are ready to take on SQL (pronounced "sequel").
I already hear the questions..."do I really need to learn SQL?"
Short answer - Yes.
Long answer - Data is stored in a database. You need SQL to retrieve the data from the database in the first place.
You also need SQL to write and update/ insert data in a database.Think about it like going to Italy without learning Italian. There is only so much you can do with sign language. After a while, you'd need an interpreter to help you order Mondeghili in Milan. SQL is that interpreter you need to communicate with relational database systems. I learned SQL on w3schools. However, I find the SQL course on codeacademy more user friendly.
Once you become fluent in all things Data Science, you need to rapidly improve your communication skills.it is already difficult to explain why people do what they do and the data you analyze will be confusing to decision makers and senior stakeholders.
This is your opportunity to be a superstar and you should ditch technical jargon and unrelatable scenarios. It's useful to learn how to tell stories from what the data tells.It's not enough to say "the data says x". Many decision makers don't respond to data; they are influenced by the stories the data tells and the impact on their operating costs/ earning potential.
If you really want to stand out, learn to tell relatable stories. Here's a course from coursera on storytelling and influencing to guide you:
Now on to Data Analysis. Remember what a Data Analyst does?
To do this, a Data Analyst:
i. determines what question the business needs answers to
ii. collects data that helps answers the question
iii. improves the quality of the data (data cleansing)
iv. manipulates the data (make it easier to read)
v. interprets and presents the data in way that helps stakeholders make informed decisions
Now on to Data Engineers. Remember what a Data Engineer does?
As they focus mainly on the database infrastructure, the main requirements are focused on architectural skills.
a. Hadoop based analytics - Hadoop is framework used to store data and run applications on clustered systems. Most organizations use it because stores and can quickly processes an awful amount of structured and unstructured data. It's analyzes data in real time and it's scalable . To understand Hadoop better, have a look at this
b. OS knowledge (Linux, UNIX, Solaris)
c. Database architecture and data warehousing
d. Database Systems (SQL and NoSQL)
e. Data modeling and mining
f. Machine Learning
To understand Data Engineering further, I find this course on data engineering useful,especially if you have no knowledge or experience at all.