In this article we will be discussing the best practices for moving data from Redshift to a RStudio Workspace.


Contents


Before you start

Before you move data from Redshift, you will need:


Moving data from an outside source to your local environment can be difficult to set up, and within data science this is something we need to do most of the time. With the Peak Platform we are able to connect directly to the source of the data, and start transforming with ease!


Using DBI

One way to do this is to use the R package `DBI`, this can use different database connectors such as `PostgreSQL` or `odbc`. These two connectors have their advantages and disadvantages, but they both work fairly similarly within a Workspace.



First of all you need to install the packages required, for this run


install.packages(‘DBI’)

## Install the connector you want to use
install.packages(‘odbc’)

Now you have the packages installed, you can now connect to Redshift!  If you are using the  `odbc` connector you will need a `odbc.ini` file. This holds the connection details you need to pull data from Redshift, it typically looks like this 


[ODBC Data Sources]

Amazon_Redshift_x64=Amazon Redshift (x64)
[tenantname-prod]
Driver=driver
Server=server
Port=port
Database=databasename
Username=username
Password=password
locale=en-GB

In the Peak Platform this should already be set up, however you can check it exists by copying over the `odbc.ini` file to the local directory using 

cp .odbc.ini odbc.ini

 

Once you have checked the `odbc.ini` file, it's now time to connect to the Redshift. First you need to set up your connection, to do this use `DBI` to connect using the `odbc` credentials, here you can see I have used the same connection name from the `odbc.ini` file 


library(DBI)
library(odbc)

conn <- dbConnect(odbc(), 'tenantname-prod')


If you want to connect to Redshift using `PostgreSQL` we just need to alter some of the code. When using `PostgreSQL` you can directly add your Redshift connection details into `dbConnect()`, it should look something like this 


library(DBI)

conn <-  dbConnect(
dbDriver(“PostgreSQL”),
           dbname = “tenantname-prod”,
          host = HOST,
          port = PORT,
          user = USERNAME,
          password = PASSWORD
)


We now have a connection to Redshift, and can now run SQL queries on data. To check this run something simple using the `DBI` package such as


df <- dbGetQuery(conn, “select * from stage.houseprices limit 10”) 

Here you can see I have just entered some SQL as a string, but you can save your SQL query as a file and run that instead, such as 


query <- readr::read_file(‘~/sql/get_data.sql`)

df <- dbGetQuery(conn, query) 


Using dbplyr


R has a package that allows you to directly query the data through Redshift without having to know or write SQL. First you need to install the packages `dbplyr` and `dplyr`, plus create a connection to your database using what we did earlier. 


install.packages(‘dbplyr’)
install.packages(‘dplyr’)

library(DBI)
library(odbc)
library(dbplyr)
library(‘dplyr’)

conn <- dbConnect(odbc(), 'tenantname-prod')


Once we have the connection, you will need to set up which table you wish to query from using


houseprices <- conn %>%
  tbl(in_schema(‘stage’, ‘houseprices’))

Now you can use this to transform the data using R then use `collect()` to pull the data into the workspace. To replicate the SQL from earlier simple use


df <- houseprices %>%
  head(10) %>%
  collect() 

With both of these methods you can transform and aggregate the data you're using on Redshift, or pull it all in and transform it in R.