Cassandra is a free and open source database management system that can be used to manage large amounts of data across multiple servers.
It is trusted, clustered and specially designed to handle enormous amounts of structured data. Cassandra also supports replication and multi-replication data center for redundancy, fail over, and disaster recovery.
The Cassandra database is one of the most widely used NoSQL databases offering the best results for performance scaling and the ability to distribute the dataset partitioned on the cluster nodes free of charge.
This database can be installed on both Linux and Windows, and the installation process is simple: download and decompresses the archive software, configure a few addresses on the disk used to store data and run the executable file Cassandra.yaml.
The only unknown aspect should be the keyspace concept that resembles the schema in the Oracle database and contains information about how data must be replicated in the Cassandra cluster.
To scale the performance of an application, Cassandra allows configuration of Cassandra nodes clusters. By using applications, clusters can benefit from high availability, a linear scaling of performance and simple access to a multitude of low-cost servers with no single point of failure (single point of failure).
Within a cluster context, it is possible to define the amount of data that a particular node can process (through the token) and through which cluster node to get a specific record. The degree of availability of the cluster can be adjusted by the replication strategy that indicates the number of copies (replica) of each row in a table that must exist in the cluster at a time.
To create a simple Cassandra cluster, edit the Cassandra.yaml file
Once the configuration file is complete, install the Cassandra package and copy the Cassandra.yaml file to all nodes in the cluster. The cluster_name value must be the same on all machines. The seed nodes are used by each node in the cluster to discover cluster topology.
Set the field listen_address to the IP address of each machine; this will be used to communicate with the other nodes.
After all, nodes have been configured and started, we can check for the successful start of each machine by searching for “Listening for thrift clients …” in the Cassandra start console.
The nodetool utility above allows adding/deleting nodes in/from the cluster as well as performing maintenance and management tasks.
Although the Cassandra database offers many advantages compared to classic relational solutions, it also has some significant disadvantages. The weaknesses of this solution are:
Without transactions;
Without JOINs;
No complex queries.
Fortunately, most of the NoSQL solutions allow Hadoop integration, and in this specific case, Cassandra can be integrated with 2 frameworks that allow the execution of queries, Apache Hive, and Apache Pig.
NoSQL is an essential step in developing data persistence mechanisms that require a change in the way we need to think about applications. Whether we do it out of necessity or programmatic curiosity, we believe that NoSQL is a technology worth learning.