Yesterday at the Impact Hub Zürich - Viadukt, we listened to Philipp Brunenberg, an Apache Spark Enthusiast, who talked about Inside Spark Core: Understanding Spark to write better performing code. The Meetup group was educated on Spark configuration opportunities, data decomposition, and performance problems.
Philipp Brunenberg supports his clients as a free-lance data science and big data consultant, solving data-driven issues, creating innovative applications and educating teams on how to write scalable applications. As a speaker, he presents at various events to help people gain a better understanding of how Spark is designed and how it works.
Apache Spark is an open-source distributed big data analytics technology that is written in Scala, Java, Python and R. The execution framework works with the filesystem to distribute the data across the cluster, and process that data in parallel. Like MapReduce, it also takes a set of instructions from an application written by a developer. Apache Spark is considered to be cutting edge technology and might be the future of analytics. Spark appeals to the early adopters and the people who are most passionate about the latest and greatest in technology.
When operations are run on Spark, Spark can keep things in memory without input/output, so developers can keep operating on the same data quickly. This results in dramatic improvements in performance. Spark excels at programming models, involving iterations or interactivity. Developers can use HDFS, YARN. Spark enables analytics workflows, uses memory differently and efficiently and the results are impressive: Spark bests Hadoop by a factor of 20, involving the use of binary data and an in-memory HDFS instance. Hadoop is even beat by a factor of 10 when memory is unavailable and it has to use disks.The core of Spark is not only the cluster-computing framework, but also the amazing Spark community that shares, teaches, and learns together. The heart of the Spark community is the Spark Meetup organizers who continue to dedicate their time, resources, and effort, like Onedot’s Tobias Widmer or Wolfram Willuhn, Head of Data Science at FlavorWiki. These Meetups bring top tech talent into one place.
After a short introduction, Philipp Brunenberg started with different configurations, explaining how to break the workload into small tasks that can be parallelized, talked about intermediate results sorting, performance and stragglers. He addressed bad modeling, length of documents, buffer and GC Overhead. In the end, there was an explanation on how to handle slow shuffles and use Spark Lint as the monitoring solution. Questions about detailed Spark issues were answered as a wide group discussion and then after the presentation in smaller groups or individually. People chatted about technology, opportunities and how to improve their Spark skills.
The presentation was on the spot, with a very appealing approach to the topic and towards the participants, and everyone could take something home from that Meetup. Philipp Brunenbergs visit and Apache Spark introduction to understanding Spark was an enrichment in the Zürich tech landscape.