This is it: a Docker multi-container environment with Hadoop (HDFS), Spark and Hive. But without the large memory requirements of a Cloudera sandbox. (On my Windows 10 laptop (with WSL2) it seems to consume a mere 3 GB.)
The only thing lacking, is that Hive server doesn't start automatically. To be added when I understand how to do that in docker-compose.
To deploy an the HDFS-Spark-Hive cluster, run:
docker-compose up
docker-compose
creates a docker network that can be found by running docker network list
, e.g. docker-hadoop-spark-hive_default
.
Run docker network inspect
on the network (e.g. docker-hadoop-spark-hive_default
) to find the IP the hadoop interfaces are published on. Access these interfaces with the following URLs:
- Namenode: http://<dockerhadoop_IP_address>:9870/dfshealth.html#tab-overview
- History server: http://<dockerhadoop_IP_address>:8188/applicationhistory
- Datanode: http://<dockerhadoop_IP_address>:9864/
- Nodemanager: http://<dockerhadoop_IP_address>:8042/node
- Resource manager: http://<dockerhadoop_IP_address>:8088/
- Spark master: http://<dockerhadoop_IP_address>:8080/
- Spark worker: http://<dockerhadoop_IP_address>:8081/
- Hive: http://<dockerhadoop_IP_address>:10000
Copy breweries.csv to the namenode.
docker cp breweries.csv namenode:breweries.csv
Go to the bash shell on the namenode with that same Container ID of the namenode.
docker exec -it namenode bash
Create a HDFS directory /data//openbeer/breweries.
hdfs dfs -mkdir /data
hdfs dfs -mkdir /data/openbeer
hdfs dfs -mkdir /data/openbeer/breweries
Copy breweries.csv to HDFS:
hdfs dfs -put breweries.csv /data/openbeer/breweries/breweries.csv
Go to the command line of the Hive server and start hiveserver2
docker exec -it hive-server bash
hiveserver2
Maybe a little check that something is listening on port 10000 now
netstat -anp | grep 10000
tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 446/java
Okay. Beeline is the command line interface with Hive. Let's connect to hiveserver2 now.
beeline
!connect jdbc:hive2://127.0.0.1:10000 scott tiger
Didn't expect to encounter scott/tiger again after my Oracle days. But there you have it. Definitely not a good idea to keep that user on production.
Not a lot of databases here yet.
show databases;
+----------------+
| database_name |
+----------------+
| default |
+----------------+
1 row selected (0.335 seconds)
Let's change that.
create database openbeer;
use openbeer;
And let's create a table.
CREATE EXTERNAL TABLE IF NOT EXISTS breweries(
NUM INT,
NAME CHAR(100),
CITY CHAR(100),
STATE CHAR(100),
ID INT )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/data/openbeer/breweries';
And have a little select statement going.
select name from breweries limit 10;
+----------------------------------------------------+
| name |
+----------------------------------------------------+
| NorthGate Brewing |
| Against the Grain Brewery |
| Jack's Abby Craft Lagers |
| Mike Hess Brewing Company |
| Fort Point Beer Company |
| COAST Brewing Company |
| Great Divide Brewing Company |
| Tapistry Brewing |
| Big Lake Brewing |
| The Mitten Brewing Company |
+----------------------------------------------------+
10 rows selected (0.113 seconds)
There you go: your private Hive server to play with.
Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop) to see the status of the Spark master.
Go to the command line of the Spark master and start spark-shell.
docker exec -it spark-master bash
spark/bin/spark-shell --master spark://spark-master:7077 --conf "spark.hadoop.hive.metastore.uris=thrift://hive-metastore:9083" --conf "spark.hadoop.hive.metastore.schema.verification=true" --conf "spark.hadoop.hive.metastore.schema.verification.record.version=true"
If you want to use pyspark, you can run the following command. I will be using scala for the rest of the tutorial.
/spark/bin/pyspark --master spark://spark-master:7077 --conf "spark.hadoop.hive.metastore.uris=thrift://hive-metastore:9083" --conf "spark.hadoop.hive.metastore.schema.verification=true" --conf "spark.hadoop.hive.metastore.schema.verification.record.version=true"
Load breweries.csv from HDFS.
val df = spark.read.csv("hdfs://namenode:9000/data/openbeer/breweries/breweries.csv")
df.show()
+----+--------------------+-------------+-----+---+
| _c0| _c1| _c2| _c3|_c4|
+----+--------------------+-------------+-----+---+
| 0| NorthGate Brewing | Minneapolis| MN| 0|
| 1|Against the Grain...| Louisville| KY| 1|
| 2|Jack's Abby Craft...| Framingham| MA| 2|
| 3|Mike Hess Brewing...| San Diego| CA| 3|
| 4|Fort Point Beer C...|San Francisco| CA| 4|
| 5|COAST Brewing Com...| Charleston| SC| 5|
| 6|Great Divide Brew...| Denver| CO| 6|
| 7| Tapistry Brewing| Bridgman| MI| 7|
| 8| Big Lake Brewing| Holland| MI| 8|
| 9|The Mitten Brewin...| Grand Rapids| MI| 9|
| 10| Brewery Vivant| Grand Rapids| MI| 10|
| 11| Petoskey Brewing| Petoskey| MI| 11|
| 12| Blackrocks Brewery| Marquette| MI| 12|
| 13|Perrin Brewing Co...|Comstock Park| MI| 13|
| 14|Witch's Hat Brewi...| South Lyon| MI| 14|
| 15|Founders Brewing ...| Grand Rapids| MI| 15|
| 16| Flat 12 Bierwerks| Indianapolis| IN| 16|
| 17|Tin Man Brewing C...| Evansville| IN| 17|
| 18|Black Acre Brewin...| Indianapolis| IN| 18|
| 19| Brew Link Brewing| Plainfield| IN| 19|
+----+--------------------+-------------+-----+---+
only showing top 20 rows
Let's check if our spark session can connect to hive metastore.
spark.sql("show databases").show(10, false)
+---------+
|namespace|
+---------+
|default |
|openbeer |
+---------+
Looks good. Can we access table data?
spark.sql("SELECT * FROM openbeer.breweries").show()
+---+--------------------+--------------------+--------------------+---+
|num| name| city| state| id|
+---+--------------------+--------------------+--------------------+---+
| 0|NorthGate Brewing...|Minneapolis ...| MN ...| 0|
| 1|Against the Grain...|Louisville ...| KY ...| 1|
| 2|Jack's Abby Craft...|Framingham ...| MA ...| 2|
| 3|Mike Hess Brewing...|San Diego ...| CA ...| 3|
| 4|Fort Point Beer C...|San Francisco ...| CA ...| 4|
| 5|COAST Brewing Com...|Charleston ...| SC ...| 5|
| 6|Great Divide Brew...|Denver ...| CO ...| 6|
| 7|Tapistry Brewing ...|Bridgman ...| MI ...| 7|
| 8|Big Lake Brewing ...|Holland ...| MI ...| 8|
| 9|The Mitten Brewin...|Grand Rapids ...| MI ...| 9|
| 10|Brewery Vivant ...|Grand Rapids ...| MI ...| 10|
| 11|Petoskey Brewing ...|Petoskey ...| MI ...| 11|
| 12|Blackrocks Brewer...|Marquette ...| MI ...| 12|
| 13|Perrin Brewing Co...|Comstock Park ...| MI ...| 13|
| 14|Witch's Hat Brewi...|South Lyon ...| MI ...| 14|
| 15|Founders Brewing ...|Grand Rapids ...| MI ...| 15|
| 16|Flat 12 Bierwerks...|Indianapolis ...| IN ...| 16|
| 17|Tin Man Brewing C...|Evansville ...| IN ...| 17|
| 18|Black Acre Brewin...|Indianapolis ...| IN ...| 18|
| 19|Brew Link Brewing...|Plainfield ...| IN ...| 19|
+---+--------------------+--------------------+--------------------+---+
only showing top 20 rows
How cool is that? Your own Spark cluster to play with.
The configuration parameters can be specified in the hadoop.env file or as environmental variables for specific services (e.g. namenode, datanode etc.):
CORE_CONF_fs_defaultFS=hdfs://namenode:9000
CORE_CONF corresponds to core-site.xml. fs_defaultFS=hdfs://namenode:9000 will be transformed into:
<property><name>fs.defaultFS</name><value>hdfs://namenode:9000</value></property>
To define dash inside a configuration parameter, use triple underscore, such as YARN_CONF_yarn_log___aggregation___enable=true (yarn-site.xml):
<property><name>yarn.log-aggregation-enable</name><value>true</value></property>
The available configurations are:
- /etc/hadoop/core-site.xml CORE_CONF
- /etc/hadoop/hdfs-site.xml HDFS_CONF
- /etc/hadoop/yarn-site.xml YARN_CONF
- /etc/hadoop/httpfs-site.xml HTTPFS_CONF
- /etc/hadoop/kms-site.xml KMS_CONF
- /etc/hadoop/mapred-site.xml MAPRED_CONF
If you need to extend some other configuration file, refer to base/entrypoint.sh bash script.