Aws AppFlow: Unlock your SaaS data

Yves Callaert
ITNEXT
Published in
12 min readAug 3, 2020

--

The SaaS industry has grown tremendously these past couple of years but with that growth comes the ever demanding question to analyse that data.

Most SaaS solutions bring their own “analytics” plugin, at a hefty sum in most cases, but are still operating in silos (you usually can’t upload your own data in the SaaS solution). So how do we break the silos? By getting all the data in one location.

But how do we do this? Will we code everything from scratch or do we look at the world of services that cloud providers offer? In most cases the discussion is about speed of delivery versus the availability of programmers. It’s a healthy practice to be objective on the quality of the service before you start from scratch. So, let’s see if AppFlow is an awesome tool or a complete disaster.

AWS AppFlow: An intro

AWS AppFlow is a relative new service, introduced on the 22nd of April 2020 and is used to get SaaS data into the AWS platform.

A full architecture of AWS AppFlow is described by Amazon in the below picture.

AppFlow Architecture

Quickly summed up, you have a source, you can do some operations on the fields and then you write it to a destination. Looks simple enough right? In the next sections we will use a real-life example to highlight the different sections of this architectural drawing.

Google Analytics data in AWS

In the following sections we will work out a use case that sources data from Google Analytics and writes it to AWS RDS.

Setup Access to Google

Before we can actually do something we will need a user account on Google Cloud to be able to interact with their API. So let’s set that up first (the instructions were taken from the AWS documentation page).

1. On the Google API Console (https://console.developers.google.com), choose Library.
2. Enter analytics in the search field.
3. Choose Google Analytics API.
4. Choose Google Analytics Reporting API listed in the search results.
5. Choose ENABLE and return to the main page.
6. Choose OAuth consent screen.
7. Create a new Internal app (if you’re using your personal account, choose External).
8. Add com as Authorized domains.
9. Choose Add scope.
10. Choose Save.
11. Choose Credentials.
12. Add OAuth client ID credentials.
13. Choose Web application.
14. Enter https://console.aws.amazon.com/ as an authorized JavaScript origins URL.
15. Enter https://AWSREGION.console.aws.amazon.com/appflow/oauth as an authorized redirect URL. (Replace AWSREGION with the Region you’re working in.)
16. Choose Save.

Which means that the account will look something like below when you are done.

OAuth Google

Set up your first AppFlow

With our credentials in place we are ready to get started in AWS. Log on to your AWS account and from the services, select “AppFlow”.

Introduction screen AppFlow

From the introduction screen, press “create flow”. On the left side you will see you will need to navigate through 5 steps.

Left panel — side navigation

In step 1, the only thing we need to provide is the name of the flow. So, type in a meaningful name. For now, we will leave the other fields blank.

Flow name — Flow Details

If you are done, move to step 2 by pressing “Next” at the bottom of the page. In step 2 we will be defining the data source that we want to read from. You can see there are quite a lot of services available but since we will be focusing on “Google Analytics”, choose “Google Analytics” from the drop down.

Source selection

Next, we will need to specify a valid connection. If you have never specified a connection, you will only get the option for creating a new connection. Once a connection is specified it can be re-used and will appear as part of the drop down.

Once you have pressed on “Create new connection” you will need to specify a “Client ID” and a “Client secret”. You can find those in your Google credentials page. Once entered, press continue, Google will now attempt to validate the connection. Once the connection is validated it will show up in connection field.

Next we select the object, which is always Reports.

After selecting the report, we basically need to extract the data from a specific view or from all the reports. What you select here is use case specific. Either you select a specific view or you select “All web site data”.

Finally, we arrive at the destination endpoint which can only be S3. Choose an S3 bucket (or quickly create one and then return to this screen). If you want, you can specify a bucket prefix to group your data (a prefix called “raw” could be used to indicate this is the raw data from the source). Note that once you have selected an S3 bucket, the flow trigger window will appear. We will leave the settings to their default values.

In additional settings you have some options that you might or might not want to use. I suggest for now to check “Aggregate all records” (yes, I have a hidden agenda with selecting this option).

Once you are done press next.

Step 3 is where it gets interesting. Here we will define the metrics and dimensions we want to use for reporting. It’s entirely up to you what you select here, my suggestion, start small and work your way up.

Once you have selected the “Source field name” you can press the button “Map fields directly” to get the destination field name automatically generated.

Once you are happy with the fields you have selected press next to go to step 4. In step 4 we can add filters to our data but let’s skip this for now and continue to step 5. If you are happy with the overview in step 5, press “create flow”.

Data output

Before we continue let’s have a look at how the data structure will look like. Select your flow by clicking on it, then press “Run flow”. This might take a while, depending on the data size, if the flow is successful it will generate a link to the file generated in S3.

Let’s have a look at the content of this file (I will only show a small snippet of my file, yours might look different depending on the dimensions and metrics you have selected).

{
"reports": [
{
"columnHeader": {
"dimensions": [
"ga:date",
"ga:deviceCategory",
"ga:mobileDeviceInfo",
"ga:browser",
"ga:eventCategory",
"ga:eventAction"
],
"metricHeader": {
"metricHeaderEntries": [
{
"name": "ga:users",
"type": "INTEGER"
},
{
"name": "ga:newUsers",
"type": "INTEGER"
}
]
}
},
"data": {
"rows": [
{
"dimensions": [
"20200724",
"mobile",
"(not set)",
"Chrome",
"Outbound links",

Great, a JSON structured file but what if you want this data in tabular format? After all, this data will be analyzed by business, so handing them this file will just lead them right back to your desk, asking for help to load the data.

There are a couple of ways you can achieve data transformation, but we will be loading the data into a PostgreSql instance making it easier for business to query the data or use a reporting tool on top of it.

AWS Lambda to the rescue

If you have no idea what AWS Lambda is, I suggest you look here. A short explanation is that AWS Lambda is a serverless solution that let’s you run code within a maximum time frame of 15 minutes. A lambda function can be triggered in many ways but for our solution, we will trigger the function each time a file is generated on the S3 bucket which we have defined as a target in our AppFlow flow.

So, let’s get coding.

There are a couple of things our program needs to do:

  • Read the JSON file
  • Create a tabular format in a generic way
  • Write the data to RDS

We will highlight certain parts of the code, for the entire solution you can visit my github repo.

In order to understand which file we need to process, we need to extract both the bucket name and the object key from the triggered event. Once we have this info we can uniquely identify the file.

bucket_name = event['Records'][0]['s3']['bucket']['name']
object_key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])

Once we have loaded the JSON file, we can convert the file to a tabular format. To achieve this we need the column names and metric names, which AWS was kind enough to supply inside the JSON.

cols = [r for r in jsonf[0]['reports'][0]['columnHeader']['dimensions']]...metrics = [r['name'] for r in jsonf[0]['reports'][0]['columnHeader']['metricHeader']['metricHeaderEntries']]

Next we will start extracting the data.

data_rows = [r for r in jsonf[list_index]['reports'][0]['data']['rows']]

From here we will construct a dictionary object that has the dimension / metric name as a key and for the value we construct a list that contains all related values for that dimension / metric. The benefit of doing it this way is that in the end we can just create a pandas DataFrame from the entire dictionary object.

pd_result = pd.DataFrame.from_dict(dim_result_dict)

There we go, pd_result now contains all the data in a tabular format. From here writing the data to a database is really simple. A dataframe has the method “to_sql” which allows us to write immediately to a database, if a valid connection is specified.

from sqlalchemy import create_engine...pd_result.to_sql(name=db_tmp_table, con=db.create_engine(), index=False, if_exists='replace')

We are using the sqlalchemy library to construct a valid database connection. The data is now in the database.

Deploying a lambda

While this post is not intended as an exercise on AWS Lambda I will quickly touch some points on how to successfully deploy the lambda function.

Either you script the entire deployment (which I recommend you do because you will inevitability need to redeploy to fix mistakes) or you set it up manually.

Before we start we need an IAM role that gives the lambda function the necessary execution rights, so let’s head over to IAM → Roles and create a new role with the following policies attached.

IAM Role Lambda Configuration

The managed policy is a policy that you need to add to give the lambda full access to the S3 bucket you used to store your results.

On a side note, I have added secrets manager permissions but in this example, we are not storing our database credentials in secrets manager. In production you do want to use secrets manager.

Next, I run the script “packager.sh”, which is part of the solutions repo, to zip the solution and upload the entire solution to AWS Lambda. If you want to do it manually that’s up to you.

So are we done? Almost, you need to navigate to your AWS Lambda function and add an S3 trigger, from the trigger configuration you will need to select the bucket that is being used as the target from AppFlow.

Every time an object is written to S3, the lambda function will trigger and execute.

For those completely new to AWS Lambda, in order to have a lambda function interact with a database, the lambda function needs to be in the same VPC as your database. This is out of scope of this exercise but you can find the documentation here.

Run and view the data

Right, we are finally done and ready to see our data. From AppFlow manually trigger your workflow, after a while you will get a message that the output is generated on S3.

Now head over to AWS Cloudwatch and let’s see what the logging has to say

Cloudwatch for AWS Lambda

As you can see the data was successfully converted and written to your database.

AWS AppFlow: Pitfalls, Errors & Missing things

While developing the solution I did run into a few issues. None of them were a major obstacle but needed a specific solution.

So here is the summary of issues I found.

Issue 1: No automation

This is not so much an issue as it’s more of an annoyance. Because the service is so new there is no way to create a flow through the CLI or Terraform. There is a github request open to incorporate this service in AWS CLI but we will have to be patient.

Issue 2: Dimensions and metrics — Request Failed

While testing, I randomly selected some dimensions and metrics and tried running my flow. However Google imposes a restriction on which dimensions can be combined with which metrics (check the explorer for all available solutions), which resulted in the following

AWS Code 400

Important to note is that AWS will not validate if you created a valid or invalid combination. So if you see this error check your metrics / dimensions combination.

Issue 3: Aggregating Data over multiple files → JSONDecodeError

Google analytics works with paging, meaning if you go over 1000 records, a new file is created with the next set of results. When we check “aggregate all data”, AWS combines multiple files into 1 file. Great right?

However, when I got a result of 1054 records I suddenly got the following issue in my code.

I was confused as my original code did work on smaller datasets, so I looked into the file and found out that AWS takes all JSON objects from multiple files and just writes them one after the other, making a final file that contains an invalid JSON object. So I needed to extend my code to do some “find-replace”

raw_object = s3_client.get_object(Bucket=bucket_name, Key=object_key)raw_data = json.loads('[' + raw_object['Body'].read().decode('utf-8').replace('}\n{', '},{') + ']')

Now we have a valid JSON object.

Issue 4: Invalid object key

As explained before, the AWS Lambda event contains both the bucket name and the object key (basically the full path to the file). However, in the beginning I was reading the object key with the following line of code

object_key = event['Records'][0]['s3']['object']['key']

To understand why this doesn’t work you need to look at the file. The file contains the time, separated by “:” , leading to the parsing of non-ascii characters. If you just read the object key it will return you a value that is different from the actual file location. To reconstruct the original file name, you can use the “urllib.parse” library.

import urllib.parse...object_key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])

AWS AppFlow: Conclusion

While I only tested one connection source it seems to be a service that is well thought out. This is the simple breakdown from me:

The positives

  • Easy integration with google cloud
  • Well documented setup for specific sources
  • The steps make it easy enough to understand what you are doing
  • It doesn’t really require a lot of technical knowledge, so you could give this to a business person
  • Good integration with other AWS services
  • No coding needed for different source integrations

The negatives

  • No automation to deploy AppFlow (at the time of writing)
  • No validation of metrics / dimension combination in Google Analytics
  • Only 1 target available (RDS out of the box would have been nice)
  • Still requires some coding to create tabular data or set up with Athena
  • Why create an invalid JSON object if AWS aggregates multiple files into 1?

The amount of engineering is heavily reduced and if your lambda functions are generic enough, you can even give this tool to business (along with a small guide what they need to do). It will be interesting to see how this service will evolve in the future.

Hope you enjoyed this article.

--

--

Senior Data Engineer who sometimes tries his hand at writing :)