Introduction to Data Science
Why?
Data is the closest thing we have to a real world application of truth, and the world is a better place when decisions are made with care.
Now that you know why you should become a data scientist, here is a simple guide to becoming one.
Prerequisites:
Math: You can honestly get really far with little math. That being said, there are far too many people who don't understand the math and end up using analysis techniques with little forethought into why they should be using that model. I will say this much: The more math, the better off you are. It is a beautiful field and directly relevant to you if you want to make the most of your foray into data science.
Beginner: High school algebra, maybe some calculus 1, high school statistics
Intermediate: College stats, Calc 1,2,3, Probability Theory, linear algebra, Statistical Inference
Advanced: Upper undergrad / Grad level statistical inference, bayesian inference, harder probability theory, linear optimization, convex optimization, real analysis, calculus of variations. Math is the limiting factor, remove those obstacles. Learn all the advanced math you can muster.
Reccs: khanacademy if you've never seen the content. MIT OCW has really great self paced math courses for calc 1-3, linear algebra and much more. Eventually you will want to learn the art of grinding through math textbooks while doing problems. The only way to learn math is by doing math. Practice, practice, practice.
If you want to be a true OG, here is Michael I. Jordan's reading list for prospective grad students in ML:
Mike I. Jordan ML Books
Programming
Pick up a language and stick with it for now. It is far easier to pick up a second language once you are very good at one than it is to slowly work through being mediocre at a bunch of languages.
People live and die by their language of choice for some reason. Generally, my thoughts on this are the following:
R is fantastic. R is great for data analysis, and if you use the tidyverse, R is great for everything else too. R is an organically grown language that is not always lexically consistent, so it can be a bit quirky. Due to the community of statistics researchers that use R, generally you will find the most bleeding edge packages on R. R will make you think like a statistician.
Python is great for taking your data a bit farther. It has packages which provide most of the functionality you would need from R. Python is a great choice because it has many more varied packages and capabilities related to software development. Python is a broad, general purpose language that is beautifully designed and consistent. This will make you think more like a programmer.
Over a career, it is wise to know as many tools well as you can. Since you will spend a good amount of time writing code, you should strive to learn both at some point.
Alrighty, now onto the core stuff.
How to:
1. Get Data
This can be done by scraping, downloading a data set, building a data set, making a survey, whatever. There is a lot of data floating around and there are plenty of resources for learning how to scrape. Since data is usually stored in data bases in the real world, it is wise to learn a bit about data bases. Learn some SQL. Since we live in the age of big data, it would also be wise to learn non-relational database query languages like nosql and postgresql
Some good spots for data: /r/datasets, kaggle, data.gov
2. Clean it
Remember: garbage in, garbage out. The quality of the data is the foundation of all your work from here on out. Check out a book or a course that will take you from these early stages into actually playing with the data. For python, Harvard CS109 does this. It also have video lectures. For R, Hadley Wickham (Chief Scientist at RStudio) has a bunch of amazing books out. Check out R for Data Science . It wouldn't hurt to eventually read a book on data cleaning and data interpolation.
3. Take some time to familiarize yourself with your data
What kind of variables are there? Quantitative (numbers)? Categorical / Qualitative (Labels)? Text? Images?
Geographical? The types of data present will determine what can be modeled.
Its a good idea to look at the data in list form and plot it out using R or python or excel or whatever. If using other types of data (eg images, geographical), then find the best ways to visualize the data. Now would be a good time to learn about knitr and rmarkdown if using R and jupyter notebooks if using python.
This is the time to do some descriptive statistics. Look at the way the data flows on the plots. What relationships do you think are there? Are there any outliers? Plot, plot, plot, and plot some more. Make the plots look good. Check out ggplot2 in R or matplotlib / seaborn in python.
Learn to dissect your data. Check out feature engineering. Learn to combine data sets to expand what can be done.
If you want to be a true OG, learn some graphic design principles. Learn to use blank space and color. There is nothing more influential than a beautiful graph.
Reccs:
Language Agnostic: The Visual Display of Quantitative Information
GGplot2 book
A book on seaborn / matplotlib if using python.
4. Figure out the questions you want to ask
This is perhaps the hardest part. The questions you ask will determine what you find in the data. When thinking about questions, ponder proactively about what techniques you would use to answer them. This is where a broad set of modeling tools comes handy. That will be gone over in the next section.
5. Analysis
The most satisfying part in my opinion. When you wrote out your questions, hopefully you have some idea of what techniques you would use. That said, there are a lot of techniques to choose from and each technique has its own special niches. You will learn to select models properly from seeing a lot of data and practicing. Like everything else, practice, practice, practice. Try many models, run them on test sets using cross validation and see what works best. Learn about bias-variance trade-off.
Generally statistical analysis is broken into 2 parts:
In descriptive statistics we want to describe the patterns and structure of the data. If you did part 3 well, this will be partially taken care of. You can expand on this as the analysis comes along.
In inferential statistics we want to predict from the data. This is what we are at now.
Broadly speaking, this is the content that seems to be spread around the most. I will post some reccs to different courses / books that you will see floating around this subreddit (because they are great).
Make sure you learn and understand cross validation extremely well.
Make sure to understand the assumptions behind each modeling technique you use. There is no one size fits all technique.
Since this is such a broad topic, go learn data analysis from books and courses. Learn models, figure out why you would want to use them, and then apply them.
Once you have completed your modeling, make sure to create a visualization. If others will be seeing it, learn to explain things in simple terms as if non experts will see it. When you are being paid to be a data scientist, it is not about how intelligent you come across, it is about results and understanding for everyone.
More Stats focused:
Introduction to Statistical Learning
More CS focused:
Andrew Ng Machine Learning
Think Stats: Data Analysis in Python
Harvard CS109 and R for Data Science linked in part 2.
Any other data analysis book you can find.
6. The Great Beyond
After step 5, you are essentially a (good) data analyst. What will differentiate you now will be the extras. Heres some other stuff that you should probably look into:
-
Databases: SQL / Nosql / any query language really.
-
Data Engineering: Create pipelines and automate a lot of the legwork to make raw data usable.
-
Shell: Linux is cool, and bash scripting is useful
-
Hadoop, Mapreduce, Spark, AWS, etc: Big Data
-
Deep Learning : See what the hype is all about
-
Time series analysis: Given the ubiquity of time series data, a book on TSA can't hurt
-
Computer Science: Learn CS rigorously. Check out Structure and Interpretation of Computer Programs, CLRS Algorithms, Concepts, Techniques and Models of Computer Programming, and more. If you learn the concepts underlying the technologies you see, you won't be outdated nearly as easily. Tools may change, but the concepts and math stay the same.
-
Learn to love practicing. This list is daunting, but it is completely achievable one small step at a time. Setting aside time to learn and practice daily will get you farther than you could ever imagine.
Guide to Learning
Like all other things in life, there are exciting things and less exciting things. What you find exciting is unique to you, but for the sake of this guide I will give how I would learn the content (assuming you know basic maths and programming):
- Learn some data analysis. If using python, think stats is nice. If using R, R for Data Science is nice
- Learn how to clean and dissect data. If you read one of the books in the bullet above, you will have learned this.
- Now learn some visualization. Make it pretty with a book on ggplot or seaborn/matplotlib
- Now learn some better tools for analysis: Read and work through Intro to Statistical Learning or Andrew Ng's Machine Learning course. Ideally you would do both.
- Learn database query languages. You will come across databases, so its best to be a wizard with them.
- Learn hadoop / map reduce / spark for big data
- Learn the theory on a deeper level. Probability Theory is great, Statistical Inference is great, Bayesian Inference and multilevel modeling is great.
- Never stop learning
Please let me know if there are errors, I wrote this quite late at night. Make sure to give your own reccs and advice in the comments! Data Science is a vast field that can be really intimidating to look into at first, and any input in the comments is helpful for me and others!