By: Walker Rowe, April 03, 2017 (06:17 AM)

Securing Big Data Databases

Last year tens of thousand of MongoDB databases were emptied and left with nothing but a ransom note. These were configured with public IP addresses and no authentication. Hackers were able to find them easily using Shodan.

Big data databases usually do not have any authentication setup by default. That means if someone can get command line access to the server on which they run they can have access to the data. Here we discuss some of the security issues around big data databases.

What are Big Data Databases?

Big data databases can generally be defined as those that run across a distributed architecture. The major ones include:

Below we discuss some of these.


Hadoop is more of a storage mechanism that a database. You cannot update objects in Hadoop, only add or delete them. Some other big databases uses Hadoop as part of their ecosystem. Hadoop is used for analytics too as you can run map and reduce operations across files in Hadoop, meaning work with data sets and transform them.

Unlike MySQL or Oracle, with Hadoop you do not login. Instead you interface with it using the OS command like. So if someone can get access to the command line they can access Hadoop data. And it does not matter which server they log into as Hadoop is distributed and runs across a cluster. In other words you do not know where data is located as you can get to it from any server in the cluster.

For example, in Hadoop you create a mount point like this:

hadoop fs -mkdir /data

Then you can look at any data using:

hadoop fs -cat /data/file.txt

The default configuration of Hadoop is no authentication.


MongoDB is popular with JavaScript programmers because it has a JavaScript command line, plus its records are stored in JSON (JavaScript Object Notation). It has grown more important as JavaScript has become a server-side language with Node.js. So it is no longer just used to render pages in the browser and interact with users.

The problem with the ransomware victims was the users had exposed these with public IP addresses and used the default authentication, which is no authentication. Below is where you expose MongoDB across the internet:

vim /etc/mongod.conf

# /etc/mongod.conf

# Listen to local, LAN and Public interfaces.
bind_ip =,,(public IP address)


The same as with Hadoop, with Cassandra it does not matter which server the hacker logs in to as the data is available across the whole cluster. And unlike Hadoop there is no master-slave design. All nodes work the same. So compromise one node and you can access all of them.

Cassandra is a noSQL structured database. It has a SQL-like query language, CQL. But to say it is noSQL means you cannot create new data tables using join, union, or other set operations.

Cassandra is a column-oriented database rather than a row-oriented one, like MySQL. That means you write data at row r, column x in one operation. Then write to row r, column x+1 in another operation.

Cassandra has a shell, cqlsh. Gain access to the host and you can login to the shell. Like the others the default is no authentication, But you can enable authentication then you would run:

cqlsh -u userid -p password

You create users using the familiar ALTER, CREATE, and DROP USER commands that Oracle uses.


Spark is replacing Hadoop for MapReduce operations because as an in-memory database it runs many times faster than Hadoop.

Spark is a structured database, meaning you cannot just copy whole files there as you can with Hadoop. You have to write a program to load data into Spark.

To gain access to Spark you run spark-shell for the Scala command line interpreter or pyspark for the Python one. And to run jobs in Spark you use spark-submit.

Spark supports Kerberos and shared secret authentication. But again, these are not configured by default. And these are for node-to-node authentication and not user authentication. Kerberos issues a session ticket. A shared secret is like a key. But Spark automatically pushes the key out to other nodes that connect to it. That does not seem logical as with security you are supposed to put that in place first and not give it away for free,

Spark too has a web browser interface. There is no authentication unless you understand how to write javax servlet filters. So to protect that do not give it a internet routable IP address or if you do then learn about javax servlet filters.

Jupyter Notebooks

Jupyter (aka iPython) is not a database. Instead it is an interactive web page where programmers can write Python, R, or Scala code to work with Apache Spark. And since these languages have connectors to Cassandra and other databases you can access those too. Plus with Jupyter you can use markdown, which means format the pages nicely. And you can make graphs using widgets. But you can also write bash commands there too. That would be highly dangerous as that means you can directly run shell commands from the web interface. So it is important to enable authentication there. You can use LDAP through a 3rd party plugin.

Apache Zeppelin is another product similar to Jupyter.

So in sum, you can see that not much thought has been given to protecting big data databases from a userid and password point of view. They are generally protected by keeping them off the internet and setting permissions on file folders. But a user does not need to be root to run any of these things. To protect web interfaces to these databases you could use nginx or Apache as a reverse proxy server, turn on authentication, and put that in front of those.

Be Informed. Stay One Step Ahead.

Sign up for our newsletter and stay up to date with the latest industry news, trends, and technologies