In this article we will be discussing the best practices for moving data from Redshift to a RStudio Workspace.
Before you start
Before you move data from Redshift, you will need:
- Data ingested on to the Peak platform
- Access to an RStudio workspace
- To know where your data is stored on Redshift
- To have a workspace set up:
Pulling data from an outside source to a local environment can be difficult, but with the use of Workspaces on the Peak Platform we can read data in with ease, ready for developing data science workflows.
Moving data from an outside source to your local environment can be difficult to set up, and within data science this is something we need to do most of the time. With the Peak Platform we are able to connect directly to the source of the data, and start transforming with ease!
One way to do this is to use the R package `DBI`, this can use different database connectors such as `PostgreSQL` or `odbc`. These two connectors have their advantages and disadvantages, but they both work fairly similarly within a Workspace.
First of all you need to install the packages required, for this run
install.packages(‘DBI’) ## Install the connector you want to use install.packages(‘odbc’)
Now you have the packages installed, you can now connect to Redshift! If you are using the `odbc` connector you will need a `odbc.ini` file. This holds the connection details you need to pull data from Redshift, it typically looks like this
[ODBC Data Sources] Amazon_Redshift_x64=Amazon Redshift (x64) [tenantname-prod] Driver=driver Server=server Port=port Database=databasename Username=username Password=password locale=en-GB
In the Peak Platform this should already be set up, however you can check it exists by copying over the `odbc.ini` file to the local directory using
cp .odbc.ini odbc.ini
Once you have checked the `odbc.ini` file, it's now time to connect to the Redshift. First you need to set up your connection, to do this use `DBI` to connect using the `odbc` credentials, here you can see I have used the same connection name from the `odbc.ini` file
library(DBI) library(odbc) conn <- dbConnect(odbc(), 'tenantname-prod')
If you want to connect to Redshift using `PostgreSQL` we just need to alter some of the code. When using `PostgreSQL` you can directly add your Redshift connection details into `dbConnect()`, it should look something like this
library(DBI) conn <- dbConnect( dbDriver(“PostgreSQL”), dbname = “tenantname-prod”, host = HOST, port = PORT, user = USERNAME, password = PASSWORD )
We now have a connection to Redshift, and can now run SQL queries on data. To check this run something simple using the `DBI` package such as
df <- dbGetQuery(conn, “select * from stage.houseprices limit 10”)
Here you can see I have just entered some SQL as a string, but you can save your SQL query as a file and run that instead, such as
query <- readr::read_file(‘~/sql/get_data.sql`) df <- dbGetQuery(conn, query)
R has a package that allows you to directly query the data through Redshift without having to know or write SQL. First you need to install the packages `dbplyr` and `dplyr`, plus create a connection to your database using what we did earlier.
install.packages(‘dbplyr’) install.packages(‘dplyr’) library(DBI) library(odbc) library(dbplyr) library(‘dplyr’) conn <- dbConnect(odbc(), 'tenantname-prod')
Once we have the connection, you will need to set up which table you wish to query from using
houseprices <- conn %>% tbl(in_schema(‘stage’, ‘houseprices’))
Now you can use this to transform the data using R then use `collect()` to pull the data into the workspace. To replicate the SQL from earlier simple use
df <- houseprices %>% head(10) %>% collect()
With both of these methods you can transform and aggregate the data you're using on Redshift, or pull it all in and transform it in R.