Apache Spark (http://spark.apache.org) is currently the fastest growing project in Big Data environment. It allows processing Big Data sets faster and easier than in the existing solutions. This workshop will jump-start you into working with Spark and help in transition from analyst or developer to Big Data engineer.
Introduction to Big Data
Definition
What is Big Data?
History of Big Data
Big Data problems
Apache Spark
Introduction
History
Spark vs Hadoop
Resilient Distributed Datasets (RDDs)
Architecture
Operation variants
Administration
Spark Core
Introduction
Java vs Spark vs Python
Connecting to cluster
Dataset distribution
RDD operations
Shared variables
Execution and testing
Spark SQL
Introduction
Spark SQL vs Hive
Basic operation
Data and schema
Queries
Hive integration
Execution and testing
Latency analysis is the act of blaming components for causing user perceptible delay. In today's world of microservices, this can be tricky as requests can fan out across polyglot components and even data-centers. In many cases, the root source of latency isn't a component, but rather a link between components.
This session will overview how to debug latency problems, using call graphs created by Zipkin. We'll use trace zipkin itself, setting up from scratch using docker. While we're at it, we'll discuss how the model works, and how to safely trace production. Finally, we'll overview the ecosystem, including tools to trace ruby, c#, java and spring boot apps. We'll wrap up with a look at simulation with Spigo and future works in Distributed Context Propagation.
When you leave, you'll at least know something about distributed tracing, and hopefully be on your way to blaming things for causing latency!