top of page

Common Mistakes in SQL

As a data scientist or a data analyst, we usually need to work with SQL when we need to pull data from SQL servers or to combine and manipulate data from several tables, therefore, it is good if we can spot and avoid common mistakes when working with it. Or even a beginner or someone who does not get used to SQL much, you might got some error running queries. Here, I summarize some common mistakes with examples, so we could easily aviod it and we are not a noob SQL anymore! Fingercross if you do not need to work with SQL, you have Pandas!!!


Import libraries and connect to databases

Through out this work, we will run SQL commands in Jupyter Notebook (with ipython-sql) and work on IMDB and WORLD data badabases, therefore, we need to install and config libraries as below,


Import libraries :

import numpy as np
import pandas as pd

%matplotlib inline
%load_ext sql
%config SqlMagic.displaylimit = 20
%config SqlMagic.autolimit = 30

Read credential information from a credential.json file under /data

import json
import urllib.parse

with open('data/credentials.json') as f:
    login = json.load(f)
    
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']

Connect to the database (One at a time)

%sql postgresql://{username}:{password}@{host}:{port}/imdb
%sql postgresql://{username}:{password}@{host}:{port}/world

#1 : Miss up clause orders and execution orders

If you try to run a query below to see the result of movies that have a number of votes more than 1000, you will find an error indicating that it could not find the column "NumberOfVotes" that we aliased in SELECT statement.

%%sql

SELECT
    title AS Title,
    start_year AS Year,
    runtime AS Duration,
    rating AS Rating
    nvotes AS NumberOfVotes
FROM
    movies
WHERE
    NumberOfVotes > 1000
;















Why is that?

This is because the execution orders and the clause orders are different!

In SQL, there are two orders that we need to be careful. First, the clause order, it is the order of clause statement that we need to write in this particular orders, it is the same as the syntax of SQL. But when we actually execute the query, SQL will execute the clauses with different orders since it wants to optimize the query to run faster and more efficient. Two orders are listed as below.



Order of clauses in a statement:

    SELECT
      |
     FROM
      |
     JOIN
      |
    WHERE
      |
  GROUP BY
      |
    HAVING
      |
   ORDER BY
      |
    LIMIT

Order of execution/processing:

FROM and JOIN
      |
    WHERE
      |
  GROUP BY
      |
    HAVING
      |
    SELECT
      |
   DISTINCT
      |
   ORDER BY
      |
    LIMIT

Therefore, in the query example above, the SQL will execute FROM first, then WHERE, and it will throw an error since it does not know "NumberOfVotes" column we derived it in SELECT clause. The SELECT clause is executed later than WHERE.


Therefore, do keep in mind with the difference of these two orders because you might get struck a long time with this tiny mistake, or even you got the wrong result and you don't know it!



#2 : Not beware of the logical statement

Many times, we need to filter some conditions to get the result we want, but we sometimes have a weird or wrong query result even though the statement looks good. Take a look at the query below that we want to have a result of movies that was published during 2010 and 2020 with a rating > 8 and number of votes > 1000,

%%sql

SELECT
    *
FROM
    movies
WHERE
    start_year = 2010
    OR 
    start_year = 2020
    AND
    rating > 8
    AND
    nvotes > 1000
LIMIT 15
;

Again, why there are some movies with rating < 8 and nvotes < 1000?

yeah, it is because we do not beware while putting several condition in WHERE clause.


Same as other programming languages, there is an order of logical operators as well, in this case OR got executed later than AND. Therefore, it is better if we put some parenthesis for each condition, so it runs in the correct orders as we want. Check the code below!

%%sql

SELECT
    *
FROM
    movies
WHERE
    (start_year = 2010
    OR 
    start_year = 2020)
    AND
    rating > 8
    AND
    nvotes > 1000
LIMIT 15
;

BOOM.. we got what we want!


#3 : Combine aggregration to non-aggregration data


Again, sometimes we might want to run a simple average value of a certain column, so we use aggregation functions, and yeah it look great as below,

%%sql

SELECT
    AVG(population)
FROM
    country
;

And yeah, why don't we just find the average population for a certain continent and show the name of that continent together, here we go ..

%%sql

SELECT
    AVG(population), name
FROM
    country
WHERE
    continent = 'North America'
;

WHATTTTT? Why it got an error?

Ohh yeah, in SQL, we can not have aggregation columns together with normal columns because SQL does not know what aspects it should group and apply aggregations, but we can solve this by using GROUP BY clause.

%%sql

SELECT
    AVG(population), name
FROM
    country
WHERE
    continent = 'North America'
GROUP BY
    name
LIMIT 10
;

By this way, SQL knows that it should group countries in North America and apply AVG for each country and show the result, so we have the correct result as we want.


One tiny thing to remember about aggregations is that we can not put it in WHERE clause because SQL processes rows by WHERE before aggregations.

%%sql

SELECT
    name
FROM
    country
WHERE
    population > AVG(population)
;

Some more examples using aggregations with GROUP BY and HAVING BY.

%%sql

SELECT
    countrycode, AVG(population), MAX(population)
FROM
    city
WHERE
    countrycode IN ('CAN', 'USA', 'THA')
GROUP BY
    countrycode
ORDER BY
    countrycode
;
%%sql

SELECT
    countrycode,
    AVG(population)::int,
    MAX(population)::int,
    COUNT(population) AS city_count
FROM
    city
GROUP BY
    countrycode
HAVING
    COUNT(*) > 10
ORDER BY
    city_count DESC
;
%%sql

SELECT
    continent, region, AVG(population)::INT
FROM
    country
GROUP BY
    continent, region
ORDER BY
    continent, region
;

These three common mistakes seem to be easy and naive, but believe me, it will come up a lot if you are not familiar or do not use SQL much. Try to keep this in mind so you know and aviod these common mistakes, save your time, and have the query result you want!


Cheers.

!!


53 views0 comments

Recent Posts

See All

Comments


bottom of page