Sunday, November 27, 2022
HomeBig DataCarry out ETL operations utilizing Amazon Redshift RSQL

Carry out ETL operations utilizing Amazon Redshift RSQL


Amazon Redshift is a quick, scalable, safe, and totally managed cloud knowledge warehouse that makes it easy and cost-effective to investigate all of your knowledge utilizing normal SQL and your current ETL (extract, remodel, and cargo), enterprise intelligence (BI), and reporting instruments. Tens of 1000’s of consumers use Amazon Redshift to course of exabytes of information per day and energy analytics workloads reminiscent of BI, predictive analytics, and real-time streaming analytics.

There are numerous methods to work together with Amazon Redshift. You’ll be able to programmatically entry knowledge in your Amazon Redshift cluster through the use of the Amazon Redshift Knowledge API, or you need to use a web-based interface reminiscent of Amazon Redshift Question Editor V2 to creator SQL queries. You may also work together with Amazon Redshift in interactive or batch mode by way of Amazon Redshift RSQL.

Beforehand, you had to make use of the PostgreSQL psql command line instrument to work together with Amazon Redshift for easy use instances reminiscent of importing and exporting knowledge to and from Amazon Redshift or submitting SQL in batch or interactive mode, or for superior use instances reminiscent of performing ETL operations with out writing advanced ETL jobs.

Now you need to use Amazon Redshift RSQL, a brand new command line instrument to connect with an Amazon Redshift cluster and carry out operations reminiscent of describe database objects, question knowledge, view and export question leads to varied output file codecs, run scripts that embody each SQL and complicated enterprise logic, carry out knowledge cleaning and transformation operations utilizing acquainted SQL, and write ETL and ELT scripts utilizing enhanced management move and superior error dealing with. Furthermore, in case you’re migrating from self-managed knowledge warehousing engines reminiscent of Teradata to Amazon Redshift and utilizing Teradata BTEQ (Fundamental Teradata Question) scripting for knowledge automation, ETL or different duties can now migrate to Amazon Redshift RSQL.

This submit explains how you can use Amazon Redshift RSQL for ETL, knowledge cleaning and preparation, enhanced management move, and exception and error dealing with.

Resolution overview

Amazon Redshift RSQL permits you to connect with your Amazon Redshift cluster by way of a command line. It helps the capabilities of the PostgreSQL psql command line instrument with an extra set of Amazon Redshift particular capabilities:

  • Describe properties or attributes of exterior tables in an AWS Glue catalog or Apache Hive metastore, exterior databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, and tables shared utilizing Amazon Redshift knowledge sharing
  • Use enhanced management move instructions reminiscent of IF, ELSEIF, ELSE, ENDIF, GOTO, and LABEL
  • Use single sign-on (SSO) authentication utilizing ADFS, PingIdentity, Okta, Azure AD, or different SAML/JWT based mostly identification suppliers (IdPs), in addition to use browser-based SAML IdPs with multi-factor authentication (MFA)

The next are some generally used instructions in Amazon Redshift RSQL. We use these instructions on this submit to exhibit totally different potential use instances utilizing Amazon Redshift RSQL scripts.

  • echo – Prints the desired string to the output.
  • comment – An extension of the echo command that has the flexibility to interrupt the output over separate strains.
  • goto – Skips all intervening instructions and SQL statements and resume the processing on the specified LABEL in sequence. The LABEL should be a ahead reference. You’ll be able to’t leap to a LABEL that lexically precedes the GOTO.
  • label – Establishes an entry level for operating this system because the goal for a GOTO command.
  • exit – Stops operating Amazon Redshift RSQL. You may also specify an optionally available exit code.
  • q – Logs off database classes and exits Amazon Redshift RSQL.

Conditions

The next are the stipulations for utilizing Amazon Redshift RSQL and carry out the steps on this submit:

  • An AWS account
  • Linux, Home windows, or MacOs working system (Amazon Redshift RSQL is obtainable for these working programs)
  • An Amazon Redshift cluster
  • SQL data

Moreover, full the next prerequisite steps:

  1. Set up Amazon Redshift RSQL in your native machine. For directions, seek advice from Getting began with Amazon Redshift RSQL.
  2. Join to the Amazon Redshift cluster.
  3. Create the orders and orders_summary tables utilizing the next DDL statements:
    create desk orders (
      O_ORDERKEY bigint NOT NULL,
      O_CUSTKEY bigint,
      O_ORDERSTATUS varchar(1),
      O_TOTALPRICE decimal(18,4),
      O_ORDERDATE Date,
      O_ORDERPRIORITY varchar(15),
      O_CLERK varchar(15),
      O_SHIPPRIORITY Integer,
      O_COMMENT varchar(79))
    distkey (O_ORDERKEY)
    sortkey (O_ORDERDATE);
    
    CREATE TABLE orders_summary 
    ( o_orderkey bigint, 
     o_custkey bigint, 
     o_orderstatus character various(1),
     o_totalprice integer,
     target_information character various(14),
     rank character various(15),
     description character various(15)
    ) DISTSTYLE AUTO;

Import knowledge into the Amazon Redshift cluster

There are a number of methods to load knowledge into Amazon Redshift tables, together with utilizing Amazon Redshift RSQL. On this part, we assessment the syntax and an instance of the Amazon Redshift RSQL script to load knowledge into an Amazon Redshift desk utilizing the COPY command.

We use the next syntax:

COPY <TABLE> from <location> 
iam_role <arn>
area <area>;

We offer the next parameters:

  • <location> – The situation of the supply knowledge to be loaded into the goal desk
  • <arn> – The AWS Identification and Entry Administration (IAM) position for accessing the info
  • <area> – The AWS Area of the supply knowledge

Within the following instance Amazon Redshift RSQL script, we load knowledge from an Amazon Easy Storage Service (Amazon S3) bucket location into the orders desk:

echo 'Job began' 
  copy orders from 's3://redshift-immersionday-labs/knowledge/orders/orders.tbl.'
  iam_role default
  area 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;
echo 'Job Ended'
exit 0 

Enhanced management move

You should utilize Amazon Redshift RSQL to outline programmatic enhanced management move and conditional blocks in your ETL script. We use the next syntax:

if <situation> 
  <code_block1>
else
  <code_block2>
endif

The syntax consists of the next parts:

  • <situation> –The conditional assertion
  • <code_block1> – The code block to run when the situation is met
  • <code_block2> – The code block to run when the situation just isn’t met

Within the following instance script, we carry out some conditional checks utilizing if, elseif, and else instructions based mostly on the rely of data from the orders desk, and we show some messages based mostly on the file rely worth:

echo 'Job began'
Choose rely(*)  from orders gset
choose :rely as rely;
if :rely < 76000000 
  echo 'Orders are lower than goal'
elseif :rely =76000000
  echo 'Order met the goal'
else :rely > 76000000
  echo 'Orders exceeded the goal'
endif
echo 'Job Ended' 
exit 0  

Error dealing with

You should utilize Amazon Redshift RSQL to outline exception dealing with blocks in your in ETL script to deal with varied user-defined and system-generated error situations that you just may encounter whereas operating the ETL course of.

The next are among the error dealing with choices supported in Amazon Redshift RSQL:

  • :ACTIVITYCOUNT – This variable is much like the psql variable ROW_COUNT, which returns the variety of rows affected by final SQL assertion. Nevertheless, ACTIVITYCOUNT stories the affected row rely for SELECT, COPY, and UNLOAD statements, which ROW_COUNT doesn’t. For SQL statements returning knowledge, ACTIVITYCOUNT specifies the variety of rows returned to Amazon Redshift RSQL from the database.
  • :ERRORCODE – This variable incorporates the return code for the final submitted SQL assertion to the database. Error code 0 specifies that SQL assertion accomplished with none errors.
  • :ERRORLEVEL – This variable is used to assign severity ranges to errors. You should utilize these severity ranges to find out a plan of action. The default worth is ON.
  • :MAXERROR – This variable designates a most error severity stage past which Amazon Redshift RSQL halts job processing. If SQL statements in Amazon Redshift RSQL scripts produce an error severity stage better than the designated maxerror worth, Amazon Redshift RSQL instantly exits.
  • :LAST_ERROR_MESSAGE – This variable incorporates the error message of the newest failed question.

We use the next syntax:

if :ERROR <> 0 
  echo :<ERRORCODE>
  echo :<LAST_ERROR_MESSAGE>
  goto <codeblock1>
else
  goto Y
endif

The syntax consists of the next info:

  • <ERRORCODE> –The error code quantity
  • <LAST_ERROR_MESSAGE> – The error message of the newest failed question
  • <code_block1> – The code block to run when the error situation is met
  • <code_block2> – The code block to run when the error situation just isn’t met

Within the following instance script, we create the orders_staging desk and replica data into the desk from an Amazon S3 location. The script additionally incorporates an exception dealing with part for each the desk creation and replica course of to deal with the potential errors encountered in the course of the course of.

echo `date`
echo 'Job began'
DROP TABLE IF EXISTS orders_staging;

create desk orders_staging (
O_ORDERKEY bigint NOT NULL,
O_CUSTKEY bigint,
O_ORDERSTATUS varchar(1),
O_TOTALPRICE decimal(18,4),
O_ORDERDATE Date,
O_ORDERPRIORITY varchar(15),
O_CLERK varchar(15),
O_SHIPPRIORITY Integer,
O_COMMENT varchar(79))
distkey (O_ORDERKEY)
sortkey (O_ORDERDATE);

if :ERROR <> 0 
  echo :ERRORCODE
  comment :LAST_ERROR_MESSAGE
  goto QUIT_JOB
else
  comment '***Orders_Staging Desk Created Efficiently****'
  goto COPY_DATA
endif

label COPY_DATA
 copy orders_staging from 's3://redshift-immersionday-  labs/knowledge/orders/orders.tbl.'
 iam_role default
 area 'us-west-2' lzop delimiter '|' COMPUPDATE PRESET;

if :ERROR <> 0
  echo :ERRORCODE
  comment :LAST_ERROR_MESSAGE
  goto QUIT_JOB
else 
  comment '****Knowledge Copied Efficiently****'
endif

echo `date`
echo 'Job Ended'
exit 0 

label QUIT_JOB
 echo `date`
 echo 'Job Failed'
 exit 1 

Knowledge transformation and preparation

You’ll be able to carry out some frequent knowledge preparation and transformation operations in your dataset utilizing SQL statements within the Amazon Redshift RSQL ETL script. On this part, we exhibit knowledge transformation and preparation operations reminiscent of casting, new knowledge column creation, and splitting an enter column into a number of columns.

We use trade normal SQL statements for reworking and making ready knowledge for downstream consumption.

Within the following instance script, we use a SQL assertion to rework the info from the orders_staging desk and insert it into the orders_summary desk:

echo `date`
echo 'Job began'

insert into orders_summary 
(o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice,
target_information,
rank,
description)
choose
o_orderkey,
o_custkey,
o_orderstatus,
o_totalprice::int,
case 
when o_totalprice < 200000
then 'Goal Missed'
when o_totalprice = 200000
then 'Goal Met'
when o_totalprice > 200000
then 'Goal Exceeds'
finish as "Target_Information",
split_part (o_orderpriority,'-',1) as RANK, 
split_part (o_orderpriority,'-',2) as DESCRIPTION
from orders_staging; 

if :ERROR <> 0 or :ACTIVITYCOUNT=0
 echo :ERRORCODE
 comment :LAST_ERROR_MESSAGE
 goto QUIT_JOB
else
 comment 'Knowledge Inserted into Abstract Orders Desk'
endif

echo `date`
echo 'Job Ended'
exit 0 

label QUIT_JOB
 echo `date`
 echo 'Job Failed'
 exit 1 

Export knowledge from an Amazon Redshift cluster and output file formatting choices

You should utilize Amazon Redshift RSQL to extract knowledge from one or a number of Amazon Redshift tables and write to your disk for consumption by downstream functions. Amazon Redshift RSQL makes use of the EXPORT choice to export the results of question to an output file.

The next are among the helpful output formating choices supported in RSQL:

  • rset rformat on – This command is required for all of the formatting instructions to take impact.
  • pset format – Codecs can embody aligned, AsciiDoc, CSV, HTML, LaTeX, LaTeX longtable, troff-ms, unaligned, and wrapped.
  • pset border – This feature specifies border info for output knowledge. Worth 0 means no border, 1 means inner dividing strains, and a pair of means desk body.
  • rset heading – This command provides the desired heading to the output consequence.
  • rset rtitle – This command provides the desired heading in addition to present system date of the shopper pc.
  • rset titledashes on/off – This command specifies whether or not to print a line of sprint characters between the column names and column knowledge returned for the SQL question.
  • rset width – This command specifies the goal width for every line in a report
  • t – This command turns off printing column names in addition to consequence row rely on the finish of the output (footers).

We use the next syntax:

export report file=<'FilePath/Filename'>
rset rformat on
pset format wrapped
pset border 2
rset heading ‘That is Heading’
rset width 50
rset titledashes on
<SQL Question>
export reset

We offer the next info:

  • <‘FilePath/Filename’> – The file title and path for the output file
  • <SQL Question> – The SQL question to run

Within the following instance script, we export the info from the orders_summary desk utilizing a SQL question and write it into the orders.txt textual content file on the native machine:

echo `date`
echo 'Job began'

export report file="/<FilePath>/orders.txt"
rset rformat on
pset format wrapped
rset width 50
rset titledashes on
choose * from orders_summary restrict 100;

export reset
echo 'Job Ended'
echo `date`
exit 0 

Automate the Amazon Redshift RSQL ETL script

One of many choices to automate Amazon Redshift RSQL scripts to run on a selected schedule is by way of shell scripting. You’ll be able to schedule the shell script by way of a CRON job, a command line utility.

We use the next syntax:

#!/bin/sh
rsql -D awsuser -f <RSQL_SCRIPT> <LOG_FILE>

We offer the next info:

  • <RSQL_SCRIPT> – The SQL scripts to un
  • <LOG_FILE> – The output log file

Within the following instance shell script, we run the Amazon Redshift RSQL script that we created and write the output log in a log file within the native machine. You’ll be able to schedule the shell script by way of a CRON job.

#!/bin/sh
SCRIPTS_DIR="<SCRIPTS_DIR>"
LOG_DIR="<LOG_DIR>"

RSQL_SCRIPT="${SCRIPTS_DIR}/<RSQL_SCRIPT>.sql"
LOG_FILE="${LOG_DIR}/test_rsql.log"

contact $LOG_FILE

rsql -D awsuser -f ${RSQL_SCRIPT} > ${LOG_FILE}

Clear up

To keep away from incurring future fees, cease the Amazon Redshift cluster created for the aim of this submit.

Conclusion

On this submit, we defined how you can use Amazon Redshift RSQL to carry out ETL operations. We additionally demonstrated how you can implement superior error dealing with and enhanced management move in your Amazon Redshift RSQL ETL script.

When you’re utilizing scripts by way of the psql command line shopper on Amazon Redshift, you possibly can function on Amazon Redshift RSQL with no modification. When you’re migrating your Teradata BTEQ scripts to Amazon Redshift RSQL, you need to use the AWS Schema Conversion Device (AWS SCT) to mechanically convert BTEQ scripts to Amazon Redshift RSQL scripts.

To be taught extra, seek advice from Amazon Redshift RSQL.


In regards to the Authors

Saman Irfan is a Specialist Options Architect at Amazon Internet Companies. She focuses on serving to clients throughout varied industries construct scalable and high-performant analytics options. Outdoors of labor, she enjoys spending time together with her household, watching TV sequence, and studying new applied sciences.

Sudipta Bagchi is a Specialist Options Architect at Amazon Internet Companies. He has over 12 years of expertise in knowledge and analytics, and helps clients design and construct scalable and high-performant analytics options. Outdoors of labor, he loves operating, touring, and taking part in cricket.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments