As a cloud database user, I always looks for ways to speed up queries and save costs.While cloud databases are powerful. It’s important to be mindful of when and how to run queries to avoid unnecessary expenses.This article highlights key SQL tips to optimize your queries, ensuring minimal runtime and cost efficiency.
Click on the link below to go directly to the SQL query optimization that picks your interest:
Utilize stored procedures
In a cloud database, stored procedures enhance performance by compiling and caching code, leading to faster execution of frequently used queries. They also streamline processes for developers, serving as reusable pieces of code. Developers no longer need to rewrite the same code repeatedly; instead, they can utilize existing SQL functions in the form of a stored procedure.
You can create a stored procedure as shown below:
Then, you can run this procedure using the following command:
You can also pass parameters into stored procedures by specifying the column name and datatype.
Simply include the column_name that is going to be the parameter using an @ sign and the data type you want it to be passed through. Then, to execute it, you again specify the parameter and its value.
This allows you to really customize your stored procedure for your specific use case while still reusing code that’s already been written and automated.
Consider indexes in SQL as similar to primary keys and mapping tables. In extensive data tables, identification codes or integers often link to another data table, offering an efficient method of data storage. This approach facilitates easy querying of wide tables and connecting values to retrieve more details on a data row. Additionally, primary keys function as indexes, enabling the selection of unique rows.
These indexing techniques significantly enhance the speed of data retrieval in tables. By storing data in specific columns, indexes allow quick locating of a single value or a range of values. For instance, when employing a WHERE clause in a query, an index prevents the need to scan the entire table, saving time by directly searching for a match condition.
It’s essential to note that cloud data warehouses like Redshift and Snowflake, being columnar, lack traditional indexes found in relational databases. They automatically partition data during loading based on data distribution. To optimize performance, consider loading data in a sorted order for frequent queries. Additionally, you can override the partition, triggering the database to reorganize and distribute the data accordingly.
Clustered indexes physically order the columns based on their actual value.
You only want to use clustered indexes when your column values are in sequential or sorted order and there are no repeat values. This is because the index is ordering them based on the actual value within the column itself. Once this index is created, it will then point to the row that contains the data—not the data itself. Primary keys are a form of clustered indexes.
Non-clustered indexes create two separate columns—one for the index and the other that points to the value. This type of index is typically used for mapping tables or even any type of glossary. You have certain column values that point to a specific location. Unlike clustered indexes, the index points directly to the data.
If you’re choosing between these two indexes, clustered indexes are the way to go. They are faster and require less memory to run since they don’t exist in two separate locations. This practice optimizes your cloud data warehouse performance.
There are also full-text indexes, which are more rare, but allow you to search through columns with a lot of text, like those that hold article or email content. This type of index stores the position of the terms found in the indexed field, making it much easier to find
Refrain from using SELECT *; fetch only the columns that are essential.
Believe it or not, using SELECT * when checking different datasets in your data warehouse isn’t a good idea. These queries are not very efficient because they show all fields in a dataset, rather than just the ones you are interested in.
When writing queries in a data model, exclude columns that won’t be used by data analysts or business users. For reporting purposes, include only the columns that business users want to see. To prevent confusion and improve run-time, it’s always better to keep things minimal!
Choosing only the specific fields you want to see will make your models and reports neat and easy to navigate. Here’s an example of how that might look:
Optimize JOIN operations
Joins can make or break complex queries. It’s imperative that you know the difference between an inner join, outer join, left join, and right join. Using the wrong join can create duplicates in your dataset and slow it down massively.
I recommend only using an outer join if you have a very specific use case that can’t be solved otherwise. Outer joins returned matched and unmatched rows from both of the tables that you are joining. It essentially returns everything from both datasets in one dataset, which in my opinion basically defeats the purpose of a join. Outer joins produce a lot of duplicates and return pieces of data you probably don’t need, making it inefficient.
Inner joins return only the matching records from the two tables that you are joining. This is almost always preferred over an outer join.
Left and right joins return all records from one table and only matching records from the table being joined. For left joins, the resulting query would contain all values in the first table and only the matching tables in the second. For right joins, it is the opposite—the resulting query would contain all values from the second table and only the matching records from the first table.
I recommend always choosing a left join over a right. In order to make this work, simply change the order you are joining your tables. Left joins are a lot easier to read and understand as compared to right joins—making this type of join better for data governance, reliability, and data quality.
Lastly, with joins, make sure you are joining the two tables on a common field. If you are selecting a field that doesn’t exist to join the tables together, you may get an extremely long-running query that will end up wasting your time and money. I recommend verifying that joins are utlilizing primary and foreign keys that exist between two tables.
Here’s an example of a left join:
Also, don’t be afraid to join on more than one column if need be. Sometimes this can help reduce resulting duplicates and speed up run-time in the case of multiple records with the same joining field.
Here’s an example with an inner join:
Normalize database tables
The premise of normalization is to make sure the values in your database tables are easy to locate and query. Normalization at the layer closest to your raw data is important so that you can easily query the values downstream. We cover this idea more in our article on data modeling.
I often run into issues with JSON objects in my raw data. Parsing these JSON objects is a type of normalization that ensures no nested objects exist in your data. This is actually called First Normal Form (1NF). With this form of normalization, values must exist as atomic values (values that can’t be broken down into smaller values) and each row must have a primary key.
Second Normal Form (2NF) is a different type of normalization that requires fields with multiple values to be broken into their own rows. This allows you to easily access each value stored in a field because it now exists dependently on the primary key but in a different row.
Third Normal Form (3NF) is another type of normalization. Second normal form is actually a prerequisite to this type of normalization. So, if using this form, make sure you follow the steps for 2NF first. Then, you want to look at your table and see if any column values are dependent on one another. For example, if you have a customers table with the customer name, phone number, state, and zip code, zip code is dependent on state. You can further break these values down into another table. Zipcode and state would exist in their own table while customer name, phone number, and state exist in another.
If you remember the transitive property from high school geometry class, this is essentially what that does. It says “Hey, if this column depends on that column for its value, then that value can be moved to a separate table”.
While there is also fourth normal form and fifth normal form, these normalization techniques are less popular and not needed for the scope of this article.
Monitor query performance
Monitoring query performance is key when trying to optimize your SQL queries. If you don’t ever look at the run-time of your queries, you will never know which ones are taking the longest! This is key to determining which ones need to be optimized and will have the most cost savings for you in your cloud database.
One tool to optimize performance is query profiling. This allows you to pinpoint the source of performance issues by looking at statistics such as runtime and rows returned. Query profiling also includes query execution plans which give you insight into what code is running in what order before it runs. To optimize query performance you can also look at database logs, the server itself, and any external applications connected to your cloud database.
Use UNION ALL instead of UNION
UNION is an operator used to join the outputs of two SQL queries. It comes in handy when you need to combine two datasets that have the same column names. However, it is important that you understand the difference between the two UNION operators- UNION and UNION ALL.
UNION joins all of the rows from Table A with all of the rows from Table B. No deduplication occurs. However, UNION ALL joins all of the rows from Table A with all of the rows from Table B and then deduplicates rows that contain the same values. If you don’t care about duplicates, UNION is going to save you a lot of processing time compared to UNION ALL. I typically always opt for UNION because, even if there are duplicates, I would want to know about them and take the time to understand why that is happening.
Optimize subquery performance
While I don’t recommend using subqueries when trying to optimize performance, sometimes they are fast and handy when doing a quick and dirty analysis. If you need to do something like check whether values exist in another table or sql subquery, it’s best to use an EXISTS statement over an IN statement.
EXISTS returns a boolean value, quickly comparing values and moving on to the next when a value is not present. IN compares every value since it returns the value itself, slowing down the processing time of the query. However, IN is more efficient to use than something like an OR statement which scans a table for multiple different conditions.
Instead of this…
WHERE state_id=3 OR state_id=11 OR state_id=34
WHERE state_id IN (3, 11, 34)
In this case, it is much more efficient to use the IN clause rather than the OR. However, in the following example, it makes more sense to use an EXISTS rather than an OR because two different tables are being compared.
Instead of this…
SELECT customer_id FROM customer_details WHERE order_id IN (SELECT order_id FROM order_details WHERE order_type_id=3)
SELECT customer_id FROM customer_details WHERE EXISTS (SELECT * FROM order_details WHERE customer_details.customer_id = order_details.customer_id)
This will return all of the rows that prove true rather than scanning and comparing every value like with an IN clause.
Leverage cloud database-specific features
One of the many benefits of using a cloud database is the database-specific features that come with it. Snowflake, for example, has a ton of SQL functions specific to Snowflake that make creating transformations easier. These include functions for parsing JSON values and working with different data type. Check with your cloud provider to see if they have specific optimizations they recommend.
Leveraging cloud databases can significantly elevate your data environment. While they provide ample flexibility, it doesn’t imply compromising performance. By integrating various features offered by these databases with strategic SQL practices, you can optimize costs and maintain high performance.
The key lies in understanding when and how to employ specific features effectively. Equally crucial is recognizing SQL functions to avoid. With these insights, you now have the means to assess your SQL usage in the cloud database, enhancing data reliability, quality, and accessibility.