How to become a data scientist

Data science is one of the most buzzed about fields right now, and data scientists are in extreme demand. And with good reason – data scientists are doing everything from creating self-driving cars to automatically captioning images. Given all the interesting applications, it makes sense that data science is a very sought-after career.

Data science is applied in many field, including in developing self-driving cars.

If you’re reading this post, I’m assuming that you’d like to learn how to become a data scientist. If you’ve already done some research, you’ve probably read dozens of guides that start with “learn linear algebra”, and end 5 years later with “learn Spark”. When I was learning, I tried to follow these guides, but I ended up bored, without any actual data science skills to show for my time. The guides were like a teacher at school handing me a bunch of books and telling me to read them all – a learning approach that’s never appealed to me.

The unfortunate part about all the “become a data scientist in 5 easy years” guides is that they’re written by people who’re already expert data scientists. They look at themselves and say “what would someone need to learn to do what I do every day?” They forget what it’s like to struggle to learn something on your own, and what it’s like to need motivation to push you over the next hurdle.

As I learned data science, I realized that I learn most effectively when I’m working on a problem I’m interested in. Instead of learning a checklist of skills, I decided to focus on building projects around real data. Not only did this learning method motivate me, it also closely mirrors the work you’ll do in a data scientist role.

In this post, I’ll share a few steps that will help you in your journey to becoming a data scientist. The journey won’t be easy, but it will be infinitely more motivating than following the conventional wisdom.

1. Question Everything

The appeal of data science is that you get to answer interesting questions using actual data and code. These questions can range from “can I predict whether any flight will be on time?” to “how much does the US spend per student on education?”. To be able to ask and answer these questions, you need to develop an analytical mindset.

The best way to develop this mindset is to start doing it with news articles. Find articles, like this one on whether running makes you smarter and this one on whether sugar is actually bad for you. Think about:

  • How they reach their conclusions given the data they discuss
  • How you might design a study to investigate further
  • What questions you might want to ask if you had access to the underlying data

Some articles, like this one on gun deaths in the US and this one on online communities supporting Donald Trump actually have the underlying data available for download. When you can do this:

  • Download the data, and open it in Excel or an equivalent tool
  • See what patterns you can find in the data by eyeballing it
  • Do you think the data supports the conclusions of the article? Why or why not?
  • What additional questions do you think you can use the data to answer?

Here are some good places to find data-driven articles:

After you’ve read articles for a few weeks, reflect on whether you enjoyed coming up with questions and answering them. Becoming a data scientist is a long road, and you need to be very passionate about the field to make it all the way. Data scientists constantly come up with questions and answer them using mathematical models and data analysis tools.

If you don’t enjoy the process of reasoning about data and asking questions, you should think about trying to find the overlaps between data and things that you do enjoy. For example, maybe you don’t enjoy the process of coming up with questions in the abstract, but maybe you really enjoy analyzing health data or education data. I personally was very interested in stock market data, which motivated me to build a model to predict the market.

Before you move on to the next step, make sure that there’s something about the process of data science that you’re passionate about. I can’t emphasize this point enough. If your goal is to become a data scientist, but you don’t have a specific passion, you’re probably not going to put in the months of hard work that you’ll need to learn.

An infographic from FiveThirtyEight.

2. Learn The Basics

Once you’ve figured out how to come up with questions, you’re ready to start learning the technical skills to start answering them. I’d start by learning the basics of programming in Python. Python is a programming language that has consistent syntax, and is often recommended for beginners. Luckily, it also has the versatility to enable you to do extremely complex data science and machine learning related work, such as deep learning.

A lot of people worry about language choice, but the keys points to remembers are:

  • Data science is about being able to answer questions and drive business value, not about tools
  • Learning the concepts is more important than learning the syntax
  • Building projects and sharing them is what you’ll do in an actual data science role, and learning this way will give you a head start

As the above points illustrate, the key isn’t to learn all the data science tools. It’s to learn enough of the technical side to start building projects. Some good places to do this are:

  • Dataquest – Dataquest teaches you the fundamentals of Python and data science through analyzing interesting datasets, like data on NBA scoring or CIA covert actions.
  • Codecademy – Codecademy teaches you the basics of Python, and how to build programs.

The key is to learn the basics, and start answering some of the questions you came up with in the past few weeks as you learn. This will help you solidify your learning, and start building a portfolio.

3. Build Projects

As you’re learning the basics of coding, you should start building projects that answer interesting questions and showcase your data science skills. Projects don’t have to be extremely complex. For example, you could analyze Super Bowl winners to find patterns. The key is to find interesting datasets, ask questions about the data, then answer those questions with code. If you need help finding datasets, check out this post for a good list of places to find them.

As you’re building projects, remember that:

  • Most data science work is data cleaning.
  • The most common machine learning technique is linear regression.
  • Everyone starts somewhere. Even if you feel like what you’re doing isn’t impressive, it’s still worth working on.

Not only does building projects help you understand real data science work and practice your skills, it also helps you build a portfolio to show to potential employers. Here are some more detailed guides on building projects on your own:

Once you’ve built some smaller projects, it’s good to find one interest area that you can go deep in. For me, this was trying to predict the stock market. The nice thing about predicting the stock market is that you can start with very little knowledge of Python and try to make trades every month or week. As your skills grow, you can make the problem more complicated, by adding nuances like minute by minute prices and more accurate predictions.

Some other examples of projects that you can develop iteratively are:

  • Health tracking. You can start by manually entering and analyzing your data, and keep adding more correlations and predictive elements as time goes on.
  • Predicting NBA game winners. You can start by manually entering scores and making predictions with a heuristic, but you can keep acquiring more data and making more accurate predictions over time.

An example of a data science project — this map shows racial diversity in the US.

4. Share Your Work

Once you’ve built a few projects, you should share them with others! It’s a good idea to upload them to Github, where others can view them. You can read a good post on uploading projects to Github here, and more about assembling a portfolio here. Uploading projects will:

  • Force you to think about how to best present them, which is what you’d do in a data science role
  • Allow your peers to view your projects and comment
  • Allow employers to view your projects

Along with uploading your work to Github, you should also think about publishing a blog. When I was learning data science, writing blog posts helped me:

  • Get inbound interest from recruiters
  • Learn concepts more thoroughly (the process of teaching really helps you learn)
  • Connect with peers

You can read a good guide on how to publish a blog here. Some good topics for blog posts are:

  • Explaining data science and programming concepts
  • Discussing your projects and walking through your findings
  • Discussing the process of learning data science, and how you’re doing it

An infographic from [my blog](http://www.vikparuchuri.com/blog/how-do-simpsons-characters-feel-about-each-other/) that shows how much each Simpsons character likes the others.

5. Learn From Others

After you’ve started to build an online presence, it’s a good idea to start engaging with other data scientists. You can do this in-person, or on online communities. Some good online communities are:

I personally was very active on Quora and Kaggle when I was learning, which helped me immensely. Engaging in online communities is a good way to:

  • Find other people to learn with
  • Enhance your profile, and find opportunities
  • Strengthen your knowledge by learning from others

You can also engage with people in-person through Meetups. In-person engagement can help you meet and learn from more experienced data scientists in your area.

6. Push Your Boundaries

Companies want to hire data scientists who find those critical insights that save them money or make their customers happier. You have to apply the same process to learning – keep searching for new questions to answer, and keep answering harder and more complex questions. If you look back on your projects from a month or two ago, and aren’t embarrassed about something you did, you probably aren’t pushing your boundaries enough. You should be making strong progress every month, and it should be reflected in your work.

Some ways to push your boundaries are:

  • Try working with a larger dataset than you’re comfortable with
  • Start a project that requires knowledge you don’t have
  • Try making your project run faster
  • See if you can teach what you did in a project to someone else

You’ve Got This

Learning data science isn’t easy, but the key is to stay motivated and enjoy what you’re doing. If you’re consistently building projects and sharing them, you’ll build your expertise, and get the data scientist job that you want.

I haven’t given you an exact roadmap to learning data science, but if you follow this process, you’ll get farther than you imagined you could. Anyone, including you and I, can become a data scientist if you’re motivated enough.

After years of being frustrated with how conventional sites taught data science, I recently created Dataquest, a better way to learn data science online. Dataquest solves the problems of MOOCs, where you never know what course to take next, and you’re never motivated by what you’re learning. Dataquest leverages the lessons I’ve learned from helping thousands of people learn data science, and focuses on making the learning experience engaging. On Dataquest, you’ll build dozens of projects, and learn all the skills you need to be a successful data scientist. Dataquest students have been hired at companies like Accenture and SpaceX.

Good luck becoming a data scientist, and please let others know in the comments if you have any tips on how to learn!