Apache Spark (TM) SQL for Data Analysts

share › ‹ links

Below are the top discussions from Reddit that mention this online Coursera course from Databricks.

Offered by Databricks. Apache Spark is one of the most widely used technologies in big data analytics. In this course, you will learn how to ... Enroll for free.

View Coursera Info Page Enroll Now

Reddsera may receive an affiliate commission if you enroll in a paid course after using these buttons to visit Coursera. Thank you for using these buttons to support Reddsera.

Taught by
Kate Sullivan
Technical Curriculum Developer
and 10 more instructors

Offered by
Databricks

Reddit Posts and Comments

0 posts • 1 mentions • top 1 shown below

r/dataengineering • comment
2 points • ka_steve

I know I'm a bit late into the game, but I've seen the answers, and your conclusion to stick with pandas, and I may disagree with that a bit.

Don't get me wrong. I love pandas. You should know pandas as the back of your hand. But I would consider it as an introductory course to data engineering. It's very good for big data analysis assuming your data is in a structured format.

However, for HUGE data and for unstructured data you are going to need Spark. As people already mentioned it can handle bigger datasets delegating tasks to multiple computers (simplifying for simplicity), but it also being able to handle unstructured data, streaming data etc. And while you can tell yourself that "Okay, simple big data is good enough for me", it may not be good enough for your future employer. I would recommend searching for data engineer job postings (use LinkedIn, location set to Worldwide with Remote turned on to see the big picture) and to count the occurrences of Pandas and Spark. Currently, there is an overwhelming need for Spark engineers, like 20:1 or 50:1. One of the reasons, of course, that Pandas is already assumed for many Python data jobs without mentioning, and another one is that a lot more people are familiar with Pandas than with Spark. But the way the landscape changes right now, it's the data scientists and ML engineers with their fancy ML-tweaking specializations who use pandas while experimenting with the models on small(ish) samples of data. When that model goes to production and has to run on live data, that is going to need a data engineer and Spark (or something similar).

And fortunately, you don't need a distributed cluster (that even sounds scary when you don't even know what that is) to be able to start with Spark. You can sign up on Databricks.com for a free-forever community plan. Or you can download and run Spark on your own laptop directly or in a Docker container. Yes, you are not going to get any performance benefit at this level (other than actually being able larger datasets than Pandas can handle), but your code will run. Meaning you can learn everything, and you can write the same enterprise-level code you would write if you would have enterprise-level of resources and it will run the same (just slower). But it will be game-changer when you start applying to jobs. I would know as I already ran into that, and that is the single most important thing I regretted not starting with sooner.

And it's even pretty easy to start with. I would highly recommend the Apache Spark (TM) SQL for Data Analysts by Databricks Coursera course (7 days free trial should be enough to complete it) to learn Spark SQL (even if you are not comfortable with SQL, trust me on this one), and the Big Data Analytics Using Spark from UCSanDiegoX on edx to learn the Spark Python API (the course is free to audit, only the certificate costs money). Both of them provide dummy data and help you set up your starting environment (the Coursera one in the cloud, the free edx locally, optionally in Docker) so you don't have to be afraid of not even knowing how to start.

Go to Reddit comment