With the constant increase in the complexity of the data volume of applications in recent years, there has been a lot of innovation, and many new database options have emerged, or become more popular and accessible than they were before. Among so many database options, it is difficult to choose the best database for each use. The purpose of this post is to help you with this challenge.
In the White Paper Best Practices for Cloud Architecture, AWS suggests some important questions that can help you better understand your application requirements and then serve as a guide for you to choose the best database.
- Does your application require more reading, more writing, or is it balanced?
- How many reads and writes per second will you need?
- How can these read and write volume change if the number of users increases?
- How much data will you need to store and for how long?
- How fast is the volume of data expected to grow?
- Is there an upper bound in the near future?
- What is the size of each object (average, minimum and maximum)?
- How will these objects be accessed?
- What are the requirements in terms of data durability?
- Will this data store be your “source of truth”? Or is it replicated data from another source source?
- What are your latency requirements?
- How many concurrent users do you need to support?
- What is your data model and how will you query the data?
- Are your queries relational in nature (eg JOINs across multiple tables)?
- Can you denormalize your schema to make data structures flatter and easier to scale?
- Do you need strong integrity controls or are you looking for more flexibility (eg schema-less data stores)?
- Do you need sophisticated reporting or search capabilities?
- Are your developers more familiar with relational databases than NoSQL ?
Relational databases (RDBS/SQL) normalize data into tables, which consist of rows and columns.
They provide a powerful query language, flexible indexing capabilities, strong integrity controls, and the ability to combine data from multiple tables quickly and efficiently.
On AWS you can install your databases on EC2 servers, however, the most suitable model is to use RDS, which is a managed service that facilitates the configuration, operation, and scaling of a relational database in the cloud. With RDS you won’t have to worry about backups, patches, operating systems, and replication, all of which can be automated by the service.
How to Scale Relational Databases
Relational databases can scale up by upgrading to a larger instance or adding more and faster storage.
Some relational databases like Amazon Aurora allow you to also scale horizontally by creating one or more read replicas.
Read replicas are separate DB instances that are replicated asynchronously. As a result, they are subject to replication delays and may be missing some of the latest transactions.
In this case, to use read replicas, the application needs to consider which queries are tolerant of using data that may be a little out of date.
Read replicas also cannot accept data changes.
When using a relational database and scaling its write capacity beyond the constraints of a single instance, you will need to use a different approach called data partitioning or sharding.
In this model, data is split across multiple database schemas running on its own instance. While RDS makes it easier to launch these instances, partitioning adds to the complexity of the application.
The application’s data access layer needs to understand how the data is split so that they can direct queries to the correct instance.
Also, schema changes must be performed across multiple databases, so it’s worth investing in automating this process.
High availability with Relational Banks
For any production relational database, redundancy is important for high availability. For this, in RDS you can use MultiAZ, which creates asynchronously replicated standby instances in a different Availability Zone.
If the primary instance fails, RDS automatically fails over to the other instance without the need for manual intervention.
Amazon Aurora, on the other hand, offers multi-master capabilities to allow reads and writes to scale across different Availability Zones and also supports replication between different regions.
Databases for DW (Data Warehouse)
A Data Warehouse is a type of relational database optimized for analyzing and reporting large amounts of data.
It is generally used to combine transactional data from different data sources ( ERP, CRM, WMS, Clickstreams, e-commerce, etc. ) to make it available for analysis and decision making.
Traditionally, setting up, running, and scaling data warehousing is a complicated and very expensive job. Redshift is a managed data warehouse service that allows you to get your DW faster and at a lower cost.
Scaling your DW
Amazon Redshift provides high performance for large data queries through a combination of parallel processing (MPP), columnar data storage, and data compression.
You can scale by increasing the number of nodes in your data warehouse cluster.
Another very interesting solution is Amazon Redshift Spectrum which allows Amazon Redshift SQL queries on exabytes of data stored in Amazon S3 which allows you to extend data stored on local disks in the data warehouse to unstructured data without the need to load or transform data.
Redshift continuously monitors cluster health and automatically replays data from failed drives and replaces nodes as needed to ensure availability and security.
NoSQL or Non-Relational databases use more flexible data models and often scale out more easily than relational databases.
Some common data models used in NoSQL databases are graphs, key-value, and JSON documents.
Some NoSQL databases such as DynamoDB allow very high levels of performance, scalability, and availability in a simple way.
On the other hand, if it is necessary to perform data analysis and ad-hoc queries, it will be necessary to later extract the data to a relational database such as Redshift, for example.
For the vast majority of applications that need to scale a lot, non-relational databases become fundamental tools, however, it is natural that they do not solve the end-to-end problem and that it is necessary to combine other databases such as in-memory banks for caching. (Memcached, Redis), and relational databases for Analytics (Oracle, Redshift).
Amazon DynamoDB is a fast and flexible NoSQL database service for applications that need consistent single-digit, millisecond latency at any scale. It is a fully managed cloud database and supports document and key-value storage models. Check out this full post on DynamoDB to learn more.
Horizontal scaling in NoSQL databases usually happens transparently and doesn’t need the data partitioning logic implemented in your application’s data access layer (unlike relational databases).
Databases for Research
Many applications need functionality that allows users to search textually, or find elements by context, taking into account the user’s intent and not necessarily an exact match between what was searched and what is being found. Just like when you do a Google search, or Facebook, for example.
A search service can be used to index and search both free and structured text format and generally supports functionality that is not available in other databases, such as rankings, faceted searches, synonyms, language interpretation, among others. others.
To help you build search solutions, AWS offers Amazon CloudSearch and Amazon Elasticsearch Service (Amazon ES).
Of the two, CloudSearch is a managed service that requires little configuration and scales automatically, owned by AWS.
Amazon ES is a managed service from Elasticsearch that is an open-source solution and allows you to have more autonomy over configuration.
Elasticsearch goes beyond a simple search solution and can also be for log analysis, real-time application monitoring, and clickstream analysis.
Both CloudSearch and Elasticsearch use data partitioning and replication to scale out. The advantage of CloudSearch is that you don’t have to worry about partitions and replicas because the service handles this automatically.
When it comes to High Availability, both include features that redundantly store data in different AZs.
A graph-based database (also called a graph-oriented database) uses structures of nodes that are interconnected with each other.
Graph databases are purposefully created to store and browse relationships and are typically used in use cases like social media, recommendation engines, and fraud detection where you need to create relationships between data and query those relationships quickly.
- Prefer managed databases over doing your own installations manually. Managed databases will allow your team to have fewer headaches to ensure scalability, availability, security and performance.
- If your application indexes and queries data without the need for complex joins or transactions, or if you need a write rate that overcomes the constraints of a single instance – consider a NoSQL database.
- If you have binary files (audio, video, and image), it is more efficient to store the actual files in Amazon S3 and store only the files’ metadata in your relational or key-value or document database.
- If your schema cannot be denormalized, and your application requires complex joins or transactions, consider a relational database.
- For Business Intelligence (BI) and Analytics use a specific database like Redshift that uses columnar storage and compression to allow analysis of large volumes of data.
- In a use case where you perform queries based on the relationships between objects, consider using a graph-oriented database like Amazon Neptune , for example