Weka - Data mining Tool W eka is a tool for big data and data mining. It is used to various classification, experiments, and analysis over large data sets. Installation Guide -weka You can download Weka from here and follow the normal installation procedure. After completion you will get following window, here you can begin your classification or experiment on different data sets with Weka.
Analysis of Distributed System -Yahoo ! S4
S4 is designed on the
context of the search engine (Yahoo! Search Engine) which supports data mining
and machine learning algorithms, instigate on MapReduce model. So, it makes possible
to parallelize and distribute batch processing tasks and operations in immense
clusters without less or no human intervention over issues like failover
management. It is low latency scalable stream processing engine which streams
the event flow at given data rate automatically.
Unlike Hadoop (the
popular batch processing system), S4 works based on MapReduce (stream
processing system typically operate on static data by scheduling batch jobs).
On the contrary, it
needs segment partitioning of the input data in fixed sized segments to be
processed by MapReduce platform where latency is proportional to length of the
segment plus overhead requirement for segmentation and initiates processing
jobs; apparently it’s a tradeoff between latency and segmentation process.
S4 shares the purpose of
big data and data mining with IBM stream processing core (SPC) middleware
besides having primary architectural design differences. While IBM-SPC is
derived from subscription model, S4 is combination of MapReduce Model and Actor
Models.
S4 has simple and
exquisite cluster management system (do not have a single centralized nodes on
cluster), which is accomplished by leveraging ZooKeeper that has a significant
multiple sharing feature by user on data center.
S4 stream defined as
events on the form of tuple valued keys and attributes where most frequent
input with minimal latency is tuple value keys.
The S4 design (for Yahoo! search engine) includes Processing Elements (PEs) as basic computational units (the user inputs for the search engine), Processing Nodes (logical hosts to processing elements and listen to events), Communication Layer (which maps logical nodes to physical nodes and automatically re-map over failures and coordinate between nodes using Zookeeper), and Configuration Management Models (human interventions for setup and tear down cluster for s4 tasks).
For programming models, S4 processing elements APIs are written in Java Programming Language and communication layer APIs are written in binding of several programming language (i.e. java, C++ etc.).
Furthermore, the implementation of S4 is optimized based on online parameters which apply for searching of favorable content results by tuning the advertising system.
Overall, the Yahoo! S4 Architecture is simple and elegant for its search engine, and successfully run on real traffic slices of a search advertising system where slices are based on user space for the thousands of users per day, though it has some issues like tradeoff between latency and segmentation is need to be improved and processing elements migration is fragile which also needs to be strong enough and sustainable. Moreover, it uses static routing and lacks dynamic load balancing.
Comments
Post a Comment