We are constantly inundated with the latest and greatest tools. We see over and over again friends, LinkedIn acquaintances, and former classmates posting about new certifications and technologies that are being used on the data science front. We feel that we are in a never-ending GIF that never quite allows us to reach completion in our data science career.
So how do you cut through this massive tidal wave of tech stacks to focus on what is important? How does one seek clarity through the haze of skills that are important and those tools that are simply a worthless certification that will never be used on the job?
Over the past few years, there has been a handful of tools that have transcended industries and shown a clear signal of being key to a high performing data science team in the near future. Mainly, these technologies revolve around the concept of big data. Now, you may be asking yourself, what does “big data” really mean? Is it gigabytes, terabytes, petabytes of data? Is it streaming, large batch jobs, or a hybrid? Or is it anything that doesn’t fit on an Excel spreadsheet?
What is big data?
We are used to the hearing the three Vs of big data: velocity, volume, veracity. But these are opaque and hard to grasp. They don’t meet the qualifications from a mom-and-pop analytics start-up to a fully-fledged Fortone 500 firm leveraging analytics solutions. A better example of big data is from Jesse Anderson’s recent book Data Teams. In it, he states that big data is simply when your analytics team says they cannot do something due to data processing constraints. I like to think of this analogy.
My own take: Imagine that you grew up in Enterprise, Oregon, a Northwest Oregon farming town of approximately 2,000 people. Now, if you were to venture across the border over to Washington state, you would think that Seattle is the biggest city on earth. Rightly so, if you grew up in New York City, you would think that Seattle is small in comparison to your home city.
This analogy applies to the size of an organization’s data; when you work with small data, everything seems big, and when you work with large data, it takes a larger magnitude of data ingestion to become overwhelmed. But each organization has a breaking point. That point where the data team throws up their hands and says, “we can’t go any further.” This is where the future of data science lies. To leverage the massive amounts of data (or taken as the three Vs of big data in velocity, volume and veracity), we need tools that allow us to compute at scale.
Soaring through the clouds
There is no wonder that scaling up data science workloads involves using someone else’s machine. Your (and my) dinky MacBook Air or HP glorified netbook can only store up to 8 GB of information in location memory (RAM), 16 if you are lucky enough to stuff another memory card in the thing. This creates myriad problems when trying to conduct anything beyond basic querying with larger datasets (over 1 GB). Once we enter this space, a simple Pandas GroupBy command or R data.frame equivalent will render your machine worthless until a reboot.
To use compute capacity, you need to rent another machine. When renting a machine, this is referred to as spinning up a compute instance. These compute instances go by different names and there are wide varieties of hardware and software that are leveraged when increasing our data science workload capacity. However, when it comes to who to rent from, there are really only 3 major players on the market: Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP). Of course, Oracle and IBM have their own dedicated cloud, but they don’t crack the surface when competing with the big girls and boys.
When venturing into the cloud, it is best to begin with the two primary means a data scientist will interact: storage and compute. When developing data science products on the cloud, you will need to store your data somewhere. In fact, what you really want, a data lake somewhere you can store your data with as few rules as possible. This is where the concept of the data lake comes into play. Data lakes give you the opportunity to dump data, regardless of file type, into a central location where you can they grab and place that data into an experiment. The respective players here are the following:
- AWS Simple Storage Service (S3)
- Microsoft Azure Data Blob
- GCP Cloud Storage
Now, don’t get me wrong, being able to store data in its proper home, be it a relational database based on SQL or an unstructured storage area using NoSQL, will always be preferential to dumping your data in a data lake. Moreover, for those data sources that you want to have on-hand at a moment’s notice, creating a data warehouse will be preferable. But to get up-and-started with data science experimentation and skill building on the cloud, data lakes are fantastic.
Once you have your data in a safe location, you can start looking at renting compute power. Again, our three main players will look like this:
- AWS Elastic Compute (EC2)
- Microsoft Azure Compute
- GCP Virtual Machines (VM)
When getting started, you want to learn a few things:
- How do I make sure that I don’t get charged too much money if I accidently (you will) leave this thing on? There are free tier compute instances that you can select to avoid unnecessary shocks when looking at your cloud bill.
- How do I ensure the correct ports are turned on? You will need to ensure that you have the correct web portals open when creating a compute instance. Failure to do so will not allow data science tools like Jupyter Notebook to run successfully. Leave ports open that you aren’t supposed to, then the world can see your instance.
- How do I make things talk to each other? Ensure that you have the correct permissions set up. AWS, Azure and GCP all have access management portals that need to have the minimum permissions and rules set for your storage and compute instance.
- If I had 2 more hours to spend on this, what would be the next steps? Knowing what the next level of complexity is often times as important as knowing how to do it. Being able to see how your application naturally builds upon itself and ties into other services will allow you to see the world as a true solutions architect.
Abstracting the noise
Often times, you will find yourself on a team that does not want to worry about how to manage a cluster or HDFS (Hadoop Distributed File System). Naturally, when jumping into the world of big data and scaling out data science workflows, the question of how involved you want to be in the day-to-day management of the big data ecosystem needs to be answered. This is why there has been a recent rise in products that allow the user to abstract the more nuanced data engineering and system administrator roles of the cloud to a platform that is more user friendly.
Databricks is one mover and shaker on the market that has been receiving a significant fanfare and accolades for its ease of use. But don’t let its new-kid-on-the-block persona dissuade you from using it. Started by the original researchers from UC Berkeley who created Apache Spark, the Databricks team has created a product that can leverage cloud computing and the data science Jupyter Notebook environment with ease. For a nominal fee, depending on how much power you seek, you can set the cluster size with the click of a button, create instant autoscaling groups and serve as your own cluster admin team with little up-keep. To sweeten the pot, there is a Community Edition that can be used and a legit Coursera course that dropped two weeks ago by Databricks staff.
Zeppelin notebooks is another data science tool that operates on top of distributed systems. Normally a little more difficult to implement, due to the manual creation of HDFS clusters and permissions granting between cloud applications, Zeppelin notebooks allow you to leverage a Spark environment through PySpark and SparkR. Due to constraints around requirements a company may have, such as a contract with a cloud support service, Zeppelin notebooks can allow you to operate within your already existing cluster setup.
While there are more tools that allow your company to leverage big data compute, distributed file systems, and all the buzzwords that lie between, the prime movers are going to be build on the big three cloud computing resources (AWS, Azure, GCP). Learning one easily translates to learning the other. Often times, within the data science space, you will find yourself moving from one service to the other. Simply knowing the mapping of related terms is often enough to ensure consistency in thought and action when creating new application across cloud platforms within big data. Personally, I have found having a chart of the naming conventions used for AWS and Azure to be quite helpful and ensure continued success in both learning and developing new data science products.
- AWS S3 : Azure Blob Storage
- GCP VM : AWS EC2
- Azure ML Studio : AWS SageMaker