Voxxed Days Berlin has ended
View analytic
Thursday, January 28 • 11:30 - 12:20
Recommendation system on Spark- follow your own way

Sign up or log in to save this to your schedule and see who's attending!

“Recommendation system on Spark- follow your own way” is a technical description of building a recommendation solution for a large media company based on Spark. The presentation compares the developed solution to default recommendations provided by Spark MLlib and shows pros and cons of both. The presentation describes problems of processing performance using medium-size data hundred thousands items, million users) on MlLib and a detailed comparison of scalability, performance and resource allocation to the developed system. The presentation also contains tips on how to find bottlenecks of the presented solution, as well as general aspects of optimization a recommendation system according to Spark architecture and business needs.

Latency is an important factor of any system interacting with a user, e.g. websites. Therefore, the first stage of the presentation explains why at least a subset of recommendations has to be precalculated and saved in a high-availability database. The presentation highlights that this requirement does not fit into the MLLib solution. The presentation shows the overview of an architecture design of Agora’s recommendation system and explanations of particular architecture decision, not only related to latency.

Agora is one of the largest media companies in Poland listed on Warsaw Stock Exchange since 1999. Its portfolio includes newspapers, outdoor advertising, a network of cinemas, Internet and radio operations, magazines and an on-line bookstore.

The following part of the presentation focuses on algorithm design. At first, the presentation explains how the algorithm solves the business need and how to elegantly combine a processing of item-to-item and user-to-item recommendations. The presentation also hints which Spark functions will be helpful while producing a recommendation algorithm’s implementation. Afterwards, the presentation scrutinizes algorithm’s internals. It focuses on explaining an optimization process of an item-to-item collaborative filtering algorithm in order to perform fast and efficient data-processing without breaking business requirements. The presentation shows how to monitor Spark to maintain proper resource allocation on Hadoop YARN as well as reveal bottlenecks of stages of the algorithm.

The final part of the presentation contains detailed results achieved by the Agora’s solution compared to the MLLib (accommodated to the chosen architecture). The comparison concerns scalability according to data set size, processing time and resource allocation.
The presentation is a solid case study on how to design and create a scalable recommendation system, using proper components. An chosen solution has to be monitored and optimized, so the presentation also contains hints and advice how to do this.

avatar for Wojciech Indyk

Wojciech Indyk

Big Data Engineer / Data Scientist, Agora S.A.
My professional interests are massive datasets processing (DAG, mapReduce, Bulk Synchronous Parallel), relational machine learning, graph structures and algorithms data mining, software architecture in big data ecosystem and distributed databases. | | SUMMARY of experience... Read More →

Thursday January 28, 2016 11:30 - 12:20
Kosmos - Room 02

Attendees (12)