Amazon Redshift is a quick, scalable, safe, and totally managed cloud knowledge warehouse that makes it easy and cost-effective to investigate all of your knowledge utilizing normal SQL and your current ETL (extract, remodel, and cargo), enterprise intelligence (BI), and reporting instruments. Tens of 1000’s of consumers use Amazon Redshift to course of exabytes of information per day and energy analytics workloads reminiscent of BI, predictive analytics, and real-time streaming analytics.
There are numerous methods to work together with Amazon Redshift. You’ll be able to programmatically entry knowledge in your Amazon Redshift cluster through the use of the Amazon Redshift Knowledge API, or you need to use a web-based interface reminiscent of Amazon Redshift Question Editor V2 to creator SQL queries. You may also work together with Amazon Redshift in interactive or batch mode by way of Amazon Redshift RSQL.
Beforehand, you had to make use of the PostgreSQL psql command line instrument to work together with Amazon Redshift for easy use instances reminiscent of importing and exporting knowledge to and from Amazon Redshift or submitting SQL in batch or interactive mode, or for superior use instances reminiscent of performing ETL operations with out writing advanced ETL jobs.
Now you need to use Amazon Redshift RSQL, a brand new command line instrument to connect with an Amazon Redshift cluster and carry out operations reminiscent of describe database objects, question knowledge, view and export question leads to varied output file codecs, run scripts that embody each SQL and complicated enterprise logic, carry out knowledge cleaning and transformation operations utilizing acquainted SQL, and write ETL and ELT scripts utilizing enhanced management move and superior error dealing with. Furthermore, in case you’re migrating from self-managed knowledge warehousing engines reminiscent of Teradata to Amazon Redshift and utilizing Teradata BTEQ (Fundamental Teradata Question) scripting for knowledge automation, ETL or different duties can now migrate to Amazon Redshift RSQL.
This submit explains how you can use Amazon Redshift RSQL for ETL, knowledge cleaning and preparation, enhanced management move, and exception and error dealing with.
Amazon Redshift RSQL permits you to connect with your Amazon Redshift cluster by way of a command line. It helps the capabilities of the PostgreSQL psql command line instrument with an extra set of Amazon Redshift particular capabilities:
- Describe properties or attributes of exterior tables in an AWS Glue catalog or Apache Hive metastore, exterior databases in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, and tables shared utilizing Amazon Redshift knowledge sharing
- Use enhanced management move instructions reminiscent of
- Use single sign-on (SSO) authentication utilizing ADFS, PingIdentity, Okta, Azure AD, or different SAML/JWT based mostly identification suppliers (IdPs), in addition to use browser-based SAML IdPs with multi-factor authentication (MFA)
The next are some generally used instructions in Amazon Redshift RSQL. We use these instructions on this submit to exhibit totally different potential use instances utilizing Amazon Redshift RSQL scripts.
- echo – Prints the desired string to the output.
- comment – An extension of the
echocommand that has the flexibility to interrupt the output over separate strains.
- goto – Skips all intervening instructions and SQL statements and resume the processing on the specified LABEL in sequence. The
LABELshould be a ahead reference. You’ll be able to’t leap to a
LABELthat lexically precedes the
- label – Establishes an entry level for operating this system because the goal for a
- exit – Stops operating Amazon Redshift RSQL. You may also specify an optionally available exit code.
- q – Logs off database classes and exits Amazon Redshift RSQL.
The next are the stipulations for utilizing Amazon Redshift RSQL and carry out the steps on this submit:
- An AWS account
- Linux, Home windows, or MacOs working system (Amazon Redshift RSQL is obtainable for these working programs)
- An Amazon Redshift cluster
- SQL data
Moreover, full the next prerequisite steps:
- Set up Amazon Redshift RSQL in your native machine. For directions, seek advice from Getting began with Amazon Redshift RSQL.
- Join to the Amazon Redshift cluster.
- Create the
orders_summarytables utilizing the next DDL statements:
Import knowledge into the Amazon Redshift cluster
There are a number of methods to load knowledge into Amazon Redshift tables, together with utilizing Amazon Redshift RSQL. On this part, we assessment the syntax and an instance of the Amazon Redshift RSQL script to load knowledge into an Amazon Redshift desk utilizing the COPY command.
We use the next syntax:
We offer the next parameters:
- <location> – The situation of the supply knowledge to be loaded into the goal desk
- <arn> – The AWS Identification and Entry Administration (IAM) position for accessing the info
- <area> – The AWS Area of the supply knowledge
Within the following instance Amazon Redshift RSQL script, we load knowledge from an Amazon Easy Storage Service (Amazon S3) bucket location into the
Enhanced management move
You should utilize Amazon Redshift RSQL to outline programmatic enhanced management move and conditional blocks in your ETL script. We use the next syntax:
The syntax consists of the next parts:
- <situation> –The conditional assertion
- <code_block1> – The code block to run when the situation is met
- <code_block2> – The code block to run when the situation just isn’t met
Within the following instance script, we carry out some conditional checks utilizing
else instructions based mostly on the rely of data from the
orders desk, and we show some messages based mostly on the file rely worth:
Error dealing with
You should utilize Amazon Redshift RSQL to outline exception dealing with blocks in your in ETL script to deal with varied user-defined and system-generated error situations that you just may encounter whereas operating the ETL course of.
The next are among the error dealing with choices supported in Amazon Redshift RSQL:
- :ACTIVITYCOUNT – This variable is much like the psql variable
ROW_COUNT, which returns the variety of rows affected by final SQL assertion. Nevertheless,
ACTIVITYCOUNTstories the affected row rely for SELECT, COPY, and UNLOAD statements, which
ROW_COUNTdoesn’t. For SQL statements returning knowledge,
ACTIVITYCOUNTspecifies the variety of rows returned to Amazon Redshift RSQL from the database.
- :ERRORCODE – This variable incorporates the return code for the final submitted SQL assertion to the database. Error code 0 specifies that SQL assertion accomplished with none errors.
- :ERRORLEVEL – This variable is used to assign severity ranges to errors. You should utilize these severity ranges to find out a plan of action. The default worth is ON.
- :MAXERROR – This variable designates a most error severity stage past which Amazon Redshift RSQL halts job processing. If SQL statements in Amazon Redshift RSQL scripts produce an error severity stage better than the designated maxerror worth, Amazon Redshift RSQL instantly exits.
- :LAST_ERROR_MESSAGE – This variable incorporates the error message of the newest failed question.
We use the next syntax:
The syntax consists of the next info:
- <ERRORCODE> –The error code quantity
- <LAST_ERROR_MESSAGE> – The error message of the newest failed question
- <code_block1> – The code block to run when the error situation is met
- <code_block2> – The code block to run when the error situation just isn’t met
Within the following instance script, we create the
orders_staging desk and replica data into the desk from an Amazon S3 location. The script additionally incorporates an exception dealing with part for each the desk creation and replica course of to deal with the potential errors encountered in the course of the course of.
Knowledge transformation and preparation
You’ll be able to carry out some frequent knowledge preparation and transformation operations in your dataset utilizing SQL statements within the Amazon Redshift RSQL ETL script. On this part, we exhibit knowledge transformation and preparation operations reminiscent of casting, new knowledge column creation, and splitting an enter column into a number of columns.
We use trade normal SQL statements for reworking and making ready knowledge for downstream consumption.
Within the following instance script, we use a SQL assertion to rework the info from the
orders_staging desk and insert it into the
Export knowledge from an Amazon Redshift cluster and output file formatting choices
You should utilize Amazon Redshift RSQL to extract knowledge from one or a number of Amazon Redshift tables and write to your disk for consumption by downstream functions. Amazon Redshift RSQL makes use of the EXPORT choice to export the results of question to an output file.
The next are among the helpful output formating choices supported in RSQL:
- rset rformat on – This command is required for all of the formatting instructions to take impact.
- pset format – Codecs can embody aligned, AsciiDoc, CSV, HTML, LaTeX, LaTeX
troff-ms, unaligned, and wrapped.
- pset border – This feature specifies border info for output knowledge. Worth 0 means no border, 1 means inner dividing strains, and a pair of means desk body.
- rset heading – This command provides the desired heading to the output consequence.
- rset rtitle – This command provides the desired heading in addition to present system date of the shopper pc.
- rset titledashes on/off – This command specifies whether or not to print a line of sprint characters between the column names and column knowledge returned for the SQL question.
- rset width – This command specifies the goal width for every line in a report
- t – This command turns off printing column names in addition to consequence row rely on the finish of the output (footers).
We use the next syntax:
We offer the next info:
- <‘FilePath/Filename’> – The file title and path for the output file
- <SQL Question> – The SQL question to run
Within the following instance script, we export the info from the
orders_summary desk utilizing a SQL question and write it into the
orders.txt textual content file on the native machine:
Automate the Amazon Redshift RSQL ETL script
One of many choices to automate Amazon Redshift RSQL scripts to run on a selected schedule is by way of shell scripting. You’ll be able to schedule the shell script by way of a CRON job, a command line utility.
We use the next syntax:
We offer the next info:
- <RSQL_SCRIPT> – The SQL scripts to un
- <LOG_FILE> – The output log file
Within the following instance shell script, we run the Amazon Redshift RSQL script that we created and write the output log in a log file within the native machine. You’ll be able to schedule the shell script by way of a CRON job.
To keep away from incurring future fees, cease the Amazon Redshift cluster created for the aim of this submit.
On this submit, we defined how you can use Amazon Redshift RSQL to carry out ETL operations. We additionally demonstrated how you can implement superior error dealing with and enhanced management move in your Amazon Redshift RSQL ETL script.
When you’re utilizing scripts by way of the psql command line shopper on Amazon Redshift, you possibly can function on Amazon Redshift RSQL with no modification. When you’re migrating your Teradata BTEQ scripts to Amazon Redshift RSQL, you need to use the AWS Schema Conversion Device (AWS SCT) to mechanically convert BTEQ scripts to Amazon Redshift RSQL scripts.
To be taught extra, seek advice from Amazon Redshift RSQL.
In regards to the Authors
Saman Irfan is a Specialist Options Architect at Amazon Internet Companies. She focuses on serving to clients throughout varied industries construct scalable and high-performant analytics options. Outdoors of labor, she enjoys spending time together with her household, watching TV sequence, and studying new applied sciences.
Sudipta Bagchi is a Specialist Options Architect at Amazon Internet Companies. He has over 12 years of expertise in knowledge and analytics, and helps clients design and construct scalable and high-performant analytics options. Outdoors of labor, he loves operating, touring, and taking part in cricket.