Latest Apache Storm Interview Questions

Latest Apache Storm Interview Questions

Which Components Are Used For Stream Flow Of Data?

For streaming of data flow, three components are used:

Bolt :-

Bolts represent the processing logic unit in Storm. One can utilize bolts to do any kind of processing such as filtering, aggregating, joining, interacting with data stores, talking to external systems etc. Bolts can also emit tuples (data messages) for the subsequent bolts to process. Additionally, bolts are responsible to acknowledge the processing of tuples after they are done processing.

Spout :-

Spouts represent the source of data in Storm. You can write spouts to read data from data sources such as database, distributed file systems, messaging frameworks etc. Spouts can broadly be classified into following –

Reliable –

These spouts have the capability to replay the tuples (a unit of data in data stream). This helps applications achieve ‘at least once message processing’ semantic as in case of failures, tuples can be replayed and processed again. Spouts for fetching the data from messaging frameworks are generally reliable as these frameworks provide the mechanism to replay the messages.

Unreliable –

These spouts don’t have the capability to replay the tuples. Once a tuple is emitted, it cannot be replayed irrespective of whether it was processed successfully or not. This type of spouts follow ‘at most once message processing’ semantic.

Tuple :-

The tuple is the main data structure in Storm. A tuple is a named list of values, where each value can be any type. Tuples are dynamically typed — the types of the fields do not need to be declared.

Tuples have helper methods like getInteger and getString to get field values without having to cast the result. Storm needs to know how to serialize all the values in a tuple. By default, Storm knows how to serialize the primitive types, strings, and byte arrays. If you want to use another type, you’ll need to implement and register a serializer for that type.

What Are The Key Benefits Of Using Storm For Real Time Processing?

Easy to operate :

Operating storm is quiet easy.

Real fast :

It can process 100 messages per second per node.

Fault Tolerant :

It detects the fault automatically and re-starts the functional attributes.

Reliable :

It guarantees that each unit of data will be executed at least once or exactly once.

Scalable :

It runs across a cluster of machine

Does Apache Act As A Proxy Server?

Yes, It acts as proxy also by using the mod_proxy module. This module implements a proxy, gateway or cache for Apache. It implements proxying capability for AJP13 (Apache JServ Protocol version 1.3), FTP, CONNECT (for SSL),HTTP/0.9, HTTP/1.0, and (since Apache 1.3.23) HTTP/1.1. The module can be configured to connect to other proxy modules for these and other protocols.

What Is The Use Of Zookeeper In Storm?

Storm uses Zookeeper for coordinating the cluster. Zookeeper is not used for message passing, so the load that Storm places on Zookeeper is quite low. Single node Zookeeper clusters should be sufficient for most cases, but if you want failover or are deploying large Storm clusters you may want larger Zookeeper clusters. Instructions for deploying Zookeeper are here.

A few notes about Zookeeper deployment :

It’s critical that you run Zookeeper under supervision, since Zookeeper is fail-fast and will exit the process if it encounters any error case. See here for more details.

It’s critical that you set up a cron to compact Zookeeper’s data and transaction logs. The Zookeeper daemon does not do this on its own, and if you don’t set up a cron, Zookeeper will quickly run out of disk space.

What Is Zeromq?

ZeroMQ is “a library which extends the standard socket interfaces with features traditionally provided by specialized messaging middleware products”. Storm relies on ZeroMQ primarily for task-to-task communication in running Storm topologies.

How Many Distinct Layers Are Of Storm’s Codebase?

There are three distinct layers to Storm’s codebase:

First :

Storm was designed from the very beginning to be compatible with multiple languages. Nimbus is a Thrift service and topologies are defined as Thrift structures. The usage of Thrift allows Storm to be used from any language.

Second :

all of Storm’s interfaces are specified as Java interfaces. So even though there’s a lot of Clojure in Storm’s implementation, all usage must go through the Java API. This means that every feature of Storm is always available via Java.

Third :

Storm’s implementation is largely in Clojure. Line-wise, Storm is about half Java code, half Clojure code. But Clojure is much more expressive, so in reality the great majority of the implementation logic is in Clojure.

What Does It Mean For A Message To Be?

A tuple coming off a spout can trigger thousands of tuples to be created based on it. Consider.

for example:

the streaming word count topology:TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“sentences”, new KestrelSpout(“kestrel.backtype.com”,
22133,

“sentence_queue”,
new StringScheme()));
builder.setBolt(“split”, new SplitSentence(), 10)
.shuffleGrouping(“sentences”);
builder.setBolt(“count”, new WordCount(), 20)
.fieldsGrouping(“split”, new Fields(“word”));

This topology reads sentences off a Kestrel queue, splits the sentences into its constituent words, and then emits for each word the number of times it has seen that word before. A tuple coming off the spout triggers many tuples being created based on it: a tuple for each word in the sentence and a tuple for the updated count for each word.

Storm considers a tuple coming off a spout “fully processed” when the tuple tree has been exhausted and every message in the tree has been processed.

A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout. This timeout can be configured on a topology-specific basis using the Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS configuration and defaults to 30 seconds.

When Do You Call The Cleanup Method?

The cleanup method is called when a Bolt is being shutdown and should cleanup any resources that were opened. There’s no guarantee that this method will be called on the cluster: For instance, if the machine the task is running on blows up, there’s no way to invoke the method.

The cleanup method is intended when you run topologies in local mode (where a Storm cluster is simulated in process), and you want to be able to run and kill many topologies without suffering any resource leaks.

How Can We Kill A Topology?

To kill a topology, simply run:

storm kill {stormname}

Give the same name to storm kill as you used when submitting the topology.

Storm won’t kill the topology immediately. Instead, it deactivates all the spouts so that they don’t emit any more tuples, and then Storm waits Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS seconds before destroying all the workers. This gives the topology enough time to complete any tuples it was processing when it got killed.

What Is Combineraggregator?

A CombinerAggregator is used to combine a set of tuples into a single field. It has the following signature:

public interface CombinerAggregator {
T init (TridentTuple tuple);
T combine(T val1, T val2);
T zero();
}

Storm calls the init() method with each tuple, and then repeatedly calls the combine()method until the partition is processed. The values passed into the combine() method are partial aggregations, the result of combining the values returned by calls to init().

What Are The Common Configurations In Apache Storm?

There are a variety of configurations you can set per topology. A list of all the configurations you can set can be found here. The ones prefixed with “TOPOLOGY” can be overridden on a topology-specific basis (the other ones are cluster configurations and cannot be overridden).

Here are some common ones that are set for a topology:

1 Config.TOPOLOGY_WORKERS :

This sets the number of worker processes to use to execute the topology. For example, if you set this to 25, there will be 25 Java processes across the cluster executing all the tasks. If you had a combined 150 parallelism across all components in the topology, each worker process will have 6 tasks running within it as threads.

2 Config.TOPOLOGY_ACKER_EXECUTORS :

This sets the number of executors that will track tuple trees and detect when a spout tuple has been fully processed By not setting this variable or setting it as null, Storm will set the number of acker executors to be equal to the number of workers configured for this topology.

If this variable is set to 0, then Storm will immediately ack tuples as soon as they come off the spout, effectively disabling reliability.

3 Config.TOPOLOGY_MAX_SPOUT_PENDING :

This sets the maximum number of spout tuples that can be pending on a single spout task at once (pending means the tuple has not been acked or failed yet). It is highly recommended you set this config to prevent queue explosion.

4 Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS :

This is the maximum amount of time a spout tuple has to be fully completed before it is considered failed. This value defaults to 30 seconds, which is sufficient for most topologies.

5 Config.TOPOLOGY_SERIALIZATIONS :

You can register more serializers to Storm using this config so that you can use custom types within tuples.

Is It Necessary To Kill The Topology While Updating The Running Topology?

Yes, to update a running topology, the only option currently is to kill the current topology and resubmit a new one. A planned feature is to implement a Storm swap command that swaps a running topology with a new one, ensuring minimal downtime and no chance of both topologies processing tuples at the same time.

How Storm Ui Can Be Used In Topology?

Storm UI is used in monitoring the topology. The Storm UI provides information about errors happening in tasks and fine-grained stats on the throughput and latency performance of each component of each running topology.

Why Does Not Apache Include Ssl?

SSL (Secure Socket Layer) data transport requires encryption, and many governments have restrictions upon the import, export, and use of encryption technology.

If Apache included SSL in the base package, its distribution would involve all sorts of legal and bureaucratic issues, and it would no longer be freely available. Also, some of the technology required to talk to current clients using SSL is patented by RSA Data Security, who restricts its use without a license.

Does Apache Include Any Sort Of Database Integration?

Apache is a Web (HTTP) server, not an application server. The base package does not include any such functionality. PHP project and the mod_perl project allow you to work with databases from within the Apache environment.

How To Check For The Httpd.conf Consistency And Any Errors In It?

We can check syntax for httpd configuration file by using following command.

httpd –S

This command will dump out a description of how Apache parsed the configuration file. Careful examination of the IP addresses and server names may help uncover configuration mistakes.

Related Articles