This lab demonstrates leveraging and implementing Kafka services for static data alongwith real-timeTwitter streaming.
Apache Kafka is a streaming message platform. It is a publish-subscribe based durable messaging system. Kafka is designed to be high performance, highly available, and redundant. It is used to collect, process, store, and integrate data at scale. A messaging system sends messages between processes, applications, and servers.
Its basic use cases includes:
- Stream Processing
- Messaging
- Website Activity Tracking
- Log aggregation
- Event Sourcing
- Application health monitoring
These are four main parts in a Kafka system:
Broker: Handles all requests from clients (producer, consumer and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster
Zookeeper: Keeps track of status of the Kafka clusters (brokers, topics, users)
Producer: Sends records to a broker
Consumer: Consumes batches of records from the broker
Experiment setup
Prerequisites:
- Installing Oracle Virtual VM Box
Specifications:
- 4 GB RAM
- 25 GB Hard Drive
- Downloading ubuntu iso file
Oracle VM Virtual Box Manager
- Installing Ubuntu Guest Edition
sudo apt install build-essential dkms linux-headers-$(uname -r)
- Able to copy/paste the contents easily
- Full screen mode available
- Certain in-built headers/packages available for additional functionalities
- Installing Python
Installing the latest version of Python
sudo apt install python3
sudo apt install python3-pip
python3 version
- Installing AWS CLI
AWS CLI helps to access multiple AWS services and functionalities from the command line.
sudo apt install curl
curl https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip -o awscliv2.zip
unzip awscliv2.zip
sudo ./aws/install
/usr/local/bin/aws version
- Connecting with AWS
Connecting the server with AWS account by entering the Access and Secret keys
aws configure
aws s3 ls
- Installing Java jdk
Java jdk is required for starting the Kafka broker and services
sudo apt update
sudo apt list
sudo apt install default-jre
sudo apt install default-jdk
javac version
- Installing Pycharm in Ubuntu
Test Results
- Installing Kafka
Download Apache Kafka from here
Unzip Kafka binaries by using tar -xzvf
pip3 install kafka-python
- Starting the Zookeeper service and Kafka broker
Navigate to the directory where the downloaded files are unzipped and start the Zookeeper service
bin/zookeeper-server-start.sh config/zookeeper.properties
Start the Kafka broker in a new terminal
bin/kafka-server-start.sh config/server.properties
Use Cases
Collecting real time sampled tweets from Twitter and publishing them to our Kafka Broker
- py
Running the script producer.py for generating events
- py
Running the script consumer.py to consume the events published by the producer.
- twitter-stream.py
Using the twitter-stream.py script to fetch tweets from Twitters API in real-time.
Entering our bearer token in the twitter.py script under the BEARER_TOKEN parameter.
Tweets are published to the Kafka Broker.
On running consumer.py again, we can see all the published events that are collected by the consumer.
Lessons learned
- Learnt configuration of Oracle Virtual Box with Ubuntu operating system
- Learnt the basic fundamentals of Apache Kafka
- Implemented real-time data streaming using Twitter API in Apache Kafka
References
https://docs.cloudera.com/documentation/enterprise/6/6.1/PDF/cloudera-kafka.pdf
https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-apache-kafka.html
Reviews
There are no reviews yet.