In this article we will be discussing the best practices for moving data from Redshift to a Jupyter Workspace.


Contents


Before you start

Before you move data from Redshift, you will need:



Make sure the correct packages are installed

When you first open your Jupyter Workspace you will need to make sure the correct packages are installed. 

These are:

  • sqlalchemy
  • psycopg2
  • python-dotenv (Retrieves environment variables)

These packages should already be installed, but if not, run this command:

pip install PACKAGE NAME

Import the packages into your script

Once you have all of the packages installed, you will need to import them into your script using:

from dotenv import load_dotenv
import os

from sqlalchemy import text, create_engine
import psycopg2
from psycopg2.extras import execute_values

Check your environment variables

When you have loaded your libraries you will need to create your connection. Before you do this, you'll need to make sure the environment variables are set. 


To do this run:

load_dotenv() 


This should set the environment variables, but to check run this command:

os.environ['VARIABLE_NAME']
This should be populated with the correct environment variable, if it is empty or not the correct one you can set it using:
os.environ['VARIABLE_NAME'] = 'xxxx'

Connect to the Redshift cluster

After checking your environment variables you can connect to the Redshift cluster that you want to run queries from. You will need to set your 'connection' using the environment variables you have set.

Here is an example of using a function to do this:

def connect_to_redshift():
    connection_str = 'postgresql://{usr}:{pwd}@{host}:5439/{db}'.format(
                        usr=os.environ['REDSHIFT_USERNAME'],
                        pwd=os.environ['REDSHIFT_PASSWORD'],
                        host=os.environ['REDSHIFT_HOST'],
                        db=os.environ['TENANT']
                    )
    sql_engine = create_engine(connection_str)
    return sql_engine

You can also create a function that allows you to query the data, this means you will be able to query your database with ease numerous times, without having to replicate code:

import pandas as pd 

def read_from_redshift(sql_engine, query):
    with sql_engine.connect() as conn:
        data = conn.execute(query)
        df = pd.DataFrame(list(data), columns=list(data.keys()))
    return df

Start querying your data with SQL!

Once you have your functions ready, you can start querying your data with SQL queries! 

For example, here I have just entered my SQL query in a string, then used the functions to connect to the database and run the query.

query = """
select * from stage.houseprices limit 10
"""

sql_engine = connect_to_redshift()
df = read_from_redshift(sql_engine, query)