2

i am new to python .. i have my data in redshift and i want to process data faster in python. i am using python because i want to run various algorithms and also do various calculations on these data and this is not possible in redshift . i saw tutorials but its taking too much time everytime to load in python. here is my code:

import psycopg2

con=psycopg2.connect(dbname = "xxxx", host="redshifttest-icp.cooqucvshoum.us-west-2.redshift.amazonaws.com", port= "5439", user="xxxx", password= "xxxx")
cur = con.cursor()

a = "select * from xxxx ;"

import pandas as pd
df = pd.read_sql(a,con)
df = df.fillna(0)
df2=df2.fillna(0)

this is working perfectly well, but i want some method which could help me to process data faster.. can anyone please help me ?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
vihaa_vrutti
  • 271
  • 1
  • 4
  • 13
  • what do you intend to do here? if the data from your data source is too big, you can make use of chunks to load it faster than transferring the whole data. – oim Dec 29 '17 at 06:46
  • i just want to know various methods to load data , but from redshift , my data is in redshift amazon – vihaa_vrutti Dec 29 '17 at 06:49
  • Redshift is vastly faster than Python on a single server. Recommend converting your logic to SQL and running it in Redshift. – Joe Harris Dec 29 '17 at 16:08
  • 1
    What do you mean by "process data faster"? What are you trying to do? Why do you wish to use pandas? If you provide more information, you'll get a better answer. Feel free to Edit your question to provide more details. – John Rotenstein Dec 29 '17 at 22:13
  • @JohnRotenstein , i have edited my question – vihaa_vrutti Jan 02 '18 at 05:51
  • i want to do various calculation on data and also run various algorithms , which is not possible in redshift , i have big data in GB . @JoeHarris – vihaa_vrutti Jan 02 '18 at 05:53
  • What do you mean by "process data faster"? What commands are "slow"? Are you running the Python app on an EC2 instance in the same region (if so, what Instance Type are you using?), or is it on your own computer outside of AWS? If it is outside of AWS, then the slow speed is probably because you are doing a `SELECT *` on GB of data! – John Rotenstein Jan 02 '18 at 06:01
  • look, i want my data to be in python for some analysis which are not in redshift , and if i am firing any query to get entire data or chunks of data , its taking me entire day to just fetch data from server , i want something like hadoop which can be possible in this , which can connect to AWS but m not getting exact sources which can help me to connect my aws data to python using hadoop like infrastructure @JohnRotenstein – vihaa_vrutti Jan 02 '18 at 06:56
  • Are you running the Python app on an EC2 instance in the same region (if so, what Instance Type are you using?), or is it on your own computer outside of AWS? – John Rotenstein Jan 02 '18 at 20:20
  • I am running python in my own computer – vihaa_vrutti Jan 03 '18 at 05:18

2 Answers2

4

If you are retrieving a large number of rows from Redshift (more than about 10 thousand) the fastest way is to use the UNLOAD command to extract them as CSV directly to S3. You can then retrieve the extract and manipulate that in Python.

If you are processing a much larger number (millions) then I suspect you will be limited by the speed of Python. In that case I recommend using Spark / PySpark and the spark-redshift package. Spark will perform the UNLOAD for you behind the scenes and your data frame calculations can be parallelized by Spark across a cluster of servers.

# Read data from a query
df = spark.read \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("query", "select x, count(*) my_table group by x") \
  .option("tempdir", "s3n://path/for/temp/data") \
  .load()
Joe Harris
  • 13,671
  • 4
  • 47
  • 54
  • I did same from above code but now its giving me error "Failed to find Spark jars directory. You need to build Spark before running this program." @Joe harris – vihaa_vrutti Jan 10 '18 at 10:15
  • It could be because you probably don't have the `com.databricks.spark.redshift` package to use for the format of your input source or because you didn't specify the location of additional jars at the command line or in a spark context configuration. Try changing the format to one of [the supportive native types](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataFrameReader.html#creating-dataframes-from-files). The format should be the same as the files you're working with, in S3. – Adam Oct 06 '18 at 07:20
0

One package that may be worth investigating is awswrangler which, though having a large number of dependencies, offers a convenient wrapper for many Redshift (and other) AWS interactions.

You first of all need to establish a redshift connection eg specifying host, port, db info and credentials.

You then use this connection with inbuilt functions including standard SQL queries and unloads which are more performant for larger data queries as mentioned by others.

kowpow
  • 95
  • 2
  • 8