Yahoo ! S4 - Analysis of Distributed System

Analysis of Distributed System -Yahoo ! S4

S4 is designed on the context of the search engine (Yahoo! Search Engine) which supports data mining and machine learning algorithms, instigate on MapReduce model. So, it makes possible to parallelize and distribute batch processing tasks and operations in immense clusters without less or no human intervention over issues like failover management. It is low latency scalable stream processing engine which streams the event flow at given data rate automatically.

Unlike Hadoop (the popular batch processing system), S4 works based on MapReduce (stream processing system typically operate on static data by scheduling batch jobs).

On the contrary, it needs segment partitioning of the input data in fixed sized segments to be processed by MapReduce platform where latency is proportional to length of the segment plus overhead requirement for segmentation and initiates processing jobs; apparently it’s a tradeoff between latency and segmentation process.

S4 shares the purpose of big data and data mining with IBM stream processing core (SPC) middleware besides having primary architectural design differences. While IBM-SPC is derived from subscription model, S4 is combination of MapReduce Model and Actor Models.

S4 has simple and exquisite cluster management system (do not have a single centralized nodes on cluster), which is accomplished by leveraging ZooKeeper that has a significant multiple sharing feature by user on data center.

S4 stream defined as events on the form of tuple valued keys and attributes where most frequent input with minimal latency is tuple value keys.

The S4 design (for Yahoo! search engine) includes Processing Elements (PEs) as basic computational units (the user inputs for the search engine), Processing Nodes (logical hosts to processing elements and listen to events), Communication Layer (which maps logical nodes to physical nodes and automatically re-map over failures and coordinate between nodes using Zookeeper), and Configuration Management Models (human interventions for setup and tear down cluster for s4 tasks).

For programming models, S4 processing elements APIs are written in Java Programming Language and communication layer APIs are written in binding of several programming language (i.e. java, C++ etc.).

Furthermore, the implementation of S4 is optimized based on online parameters which apply for searching of favorable content results by tuning the advertising system.

Overall, the Yahoo! S4 Architecture is simple and elegant for its search engine, and successfully run on real traffic slices of a search advertising system where slices are based on user space for the thousands of users per day, though it has some issues like tradeoff between latency and segmentation is need to be improved and processing elements migration is fragile which also needs to be strong enough and sustainable. Moreover, it uses static routing and lacks dynamic load balancing.

Software Engineering

Search This Blog

Featured Post

Data Mining with Weka -Installation

Yahoo ! S4 - Analysis of Distributed System

Analysis of Distributed System -Yahoo ! S4

Labels

Comments

Post a Comment

Popular posts from this blog

Roles of multimedia team members

C-program(Sequence) Age group display

Stages and Requirements of multimedia of project