Friday, April 27, 2018

Using Lambda and the new Firehose Console to transform data

Objective

Show you how you can create a delivery stream that will ingest sample data, transforms it and store both the source and the transformed data.

Intro

Amazon Kinesis Firehose is one of the easiest ways to prepare and load streaming data into the AWS ecosystem. Firehose was first released on October 2015 and it has evolved from just a simple solution to store your data without any modification to a delivery stream with transformation features. On July 2017 the delivery stream console was updated to offer you more options and so reduce the amount of work required to store and transform your data. This post offers you a guide to setup a proof of concept that will:

  1. Filter and transform sample data with n AWS Lambda function and store the results in S3.
  2. Keep the sample data to S3 for future analysis.
  3. Check the capabilities of the console, like encryption and compression.
  4. Take advantage of Firehose sample data producer (you won't need to create any script).

Prereqs

  • You will need an AWS account

Steps


Step 0: Access the Kinesis Firehose service


  1. Login into the AWS console.
  2. Search for the Kinesis service with "Find a service ..." text box or as an item of the "Analytics" list.
  3. Click on "Create delivery stream". In case you have not created a Kinesis stream before, you will need to press on "Get Started" first.


Step 1: Name and source


  1. Select a name for your delivery stream, for this demo I will use "deliveryStream2018".

  2. Choose a "Source", for this demo select "Direct PUT or other resources". Essentially you have two options here: Use a Kinesis Stream as the input for the delivery stream or you can send the records by other means:
  • PUT API: You will use this option if your custom application will feed the delivery stream directly with the AWS SDK.
  • Kinesis Agent: Use the agent to send information from logs produced by your applications, in other words, the agent will track the changes in your log files and send the information to the delivery stream.
  • AWS IoT: If you have an IoT ecosystem, you can use the rules to send messages to your Firehose stream. 
  • CloudWatch Logs: Sends any incoming log events that match a defined filter to your delivery stream.
  • CloudWatch Events: Deliver information of events when a CloudWatch rule is matched.
  1. Click "Next".

Step 2: Transform Records


  1. Once data is available in a delivery stream, we can invoke a Lambda function to transform it. To our relief, some ready-to-use blueprints are offered by AWS and you can adapt them according to your data format. In this tutorial, we will transform sample data offered by Firehose, so select "Enabled".

  2. Select "Create New".
  3. You will see a list of blueprints for you to use. We will process custom data so select the first one "General Firehose Processing". You will be taken to a new page, do not close the previous one, we will be back to it.

  4. The Lambda "Create function" page will open.
  5. Choose a "Name" for your function. 
  6. In the "Role" dropdown, select "Create new role from template(s)", this will create a new role to allow this Lambda function to logging to CloudWatch. Choose a "Name role", you may want to remember this one to delete it quickly when we are done with the tutorial.
  7. Leave the "Policy templates" field empty.
  8. Once you are ready select "Create function" and wait for the editor to appear.

Step 2.1: Into the Lambda realm


  1. Scroll down until you see the "Function code" section

  2. Change "Runtime" to "Node.js 8.10".
  3. The "index.js" file should be available to edit, if it is not, open the file with a double click in the file name on the left side. 
  4. Remove all the code and copy the next function and paste it into the editor:


'use strict';
console.log('Loading function');

/* Stock Ticker format parser */
const parser = /^\{\"ticker_symbol\"\:\"[A-Z]+\"\,\"SECTOR\"\:"[A-Z]+\"\,\"change\"\:[-.0-9]+\,\"price\"\:[-.0-9]+\}/i;
//"ticker_symbol":"NGC","sector":"HEALTHCARE","change":-0.08,"price":4.73


exports.handler = (event, context, callback) => {
    let success = 0; // Number of valid entries found
    let failure = 0; // Number of invalid entries found
    let dropped = 0; // Number of dropped entries 

    /* Process the list of records and transform them */
    const output = event.records.map((record) => {
        const entry = (new Buffer(record.data, 'base64')).toString('utf8');
        console.log("Entry: ", entry);
        let match = parser.exec(entry);
        if (match) {
            let parsed_match = JSON.parse(match); 
            var milliseconds = new Date().getTime();
            /* Add timestamp and convert to CSV */
            const result = `${milliseconds},${parsed_match.ticker_symbol},${parsed_match.sector},${parsed_match.change},${parsed_match.price}`+"\n";
            const payload = (new Buffer(result, 'utf8')).toString('base64');
            if (parsed_match.sector !== 'RETAIL') {
                /* Dropped event, notify and leave the record intact */
                dropped++;
                return {
                    recordId: record.recordId,
                    result: 'Dropped',
                    data: record.data,
                };
            }
            else {
                /* Transformed event */
                success++;  
                return {
                    recordId: record.recordId,
                    result: 'Ok',
                    data: payload,
                };
            }
        }
        else {
            /* Failed event, notify the error and leave the record intact */
            console.log("Failed event : "+ record.data);
            failure++;
            return {
                recordId: record.recordId,
                result: 'ProcessingFailed',
                data: record.data,
            };
        }
    });
    console.log(`Processing completed.  Successful records ${output.length}.`);
    callback(null, { records: output });
};


  1. Go back to the function menu (the header), look for the dropdown where you can create a new test, it is right before the "Test" button, select "Configure Test Event" in the dropdown. A secondary window will appear.
  2. Select "Create new test event" to create a new test and "Kinesis Firehose" as "Event template".
  3. Select an "Event name".
  4. Copy and paste the next JSON object into the editor to use it as the input for your test.

{
  "records": [
    {
      "recordId": "49583354031560888214100043296632351296610463251381092354000000",
      "approximateArrivalTimestamp": 1523204766865,
      "data": "eyJ0aWNrZXJfc3ltYm9sIjoiTkdDIiwic2VjdG9yIjoiSEVBTFRIQ0FSRSIsImNoYW5nZSI6LTAuMDgsInByaWNlIjo0LjczfQ=="
    }
  ],
  "region": "us-east-1",
  "deliveryStreamArn": "arn:aws:kinesis:EXAMPLE",
  "invocationId": "invocationIdExample"
}

The data attribute is encoded in base64, this is the type of data received by Firehose. The value of this data after being parsed is:


{
   "ticker_symbol":"NGC",
   "sector":"HEALTHCARE",
   "change":-0.08,
   "price":4.73
}


  1. Select "Create", you will be taken back to the Function editor.
  2. Make sure to press "Save" to save your changes in the editor.
  3. Now run your test by selecting your test in the dropdown and press "Test".
  4. You should get quick green results, check the details of the execution to know more.
    1. If you expand the "Details" section you will be able to see the output.
    2. You may want to  look at the Base64 decoded object. 
    3. In this case, we are filtering and transforming the stocks where price is 5.0 or greater. The one that we are using for testing has a 4.73 as price, so this record ends as a "Dropped" record, indicating that is not going to be part of the transformation set, but it did not provoke an error. 
    4. A record that will be part of the transformation set will have a result attribute of "OK".
    5. You can remove the filter if you want to transform all your data. 
  5. Now you can go back to the Kinesis Firehose tab, you can return to to this tab later if you want to dig deeper.
  6. Back into the Firehose delivery stream wizard, close the "Choose Lambda blueprint" dialog.
  7. Select your newly created function in the "Lambda function" dropdown, refresh if necessary.
  8. Ignore the timeout warning, this lambda function does not require too much time to execute, so keep going and select "Next".



Step 3: Choose a destination

We have configured a serverless function to transform our records, but we have not selected where to store them, and neither if we want to keep the raw records. In this case, we will use both options.

  1. Select "Amazon S3" as destination for simplicity. This will be the service where we will store our transformed data.





  1. Select an existing bucket or create one.
  2. You may select a secondary prefix for your files, I will use "transformed" to distinguish it from the source files. Firehose will add a timestamp automatically in any case.
  3. In "S3 backup", select "Enable" the store the raw data too.
  4. Select the destination bucket or create one, you may select a prefix for this too. 
  5. Go ahead and press "Next".



Step 4: Configure settings


  1. Leave your S3 buffer conditions as they are. They indicate the maximum amount of time that must be passed or the maximum quantity of data that must be gathered before to execute your Lambda function. This is an OR condition, meaning when any of these rules are satisfied, the Lambda function will execute.





  1. If you want to save space and secure your data, you can select your desired compression and encryption options. I am using the defaults for this tutorial.





  1. Error logging is enabled by default, you can keep it like that in case you want to debug your code later.





  1. We need an IAM role to access the corresponding resources from Firehose, like S3. In the "IAM role" choose to "Create new, or Choose", a new tab will open.




  1. As we have selected to use S3 in the previous steps, the IAM policy that we need has already been prepared for us, reviewed if you are interested and press on "Allow". The role will be created and the tab will be closed.




  1. The new role will be listed in the "IAM role" dropdown, you can select more if needed.




  1. Select "Next" when ready.


Step 5: Review your configuration


  1. Take a moment to check the options that you have indicated, when ready select "Create delivery stream".





  1. You will be taken to the "Firehose delivery stream" page, you should see your new stream active after some seconds.



Step 6: Test your work

Firehose allows you to send demo data to your stream, let's try it out.

  1. Select your stream radio button to enable the "Test with demo data" button.




  1. Click the "Test with demo data" button. You will see the "Test with demo data" section.
  2. Select "Start sending demo data"





  1. Do not leave this page until you complete the next steps, but be sure to stop the demo to save money once you see the results in your S3 bucket(s), if you close the tab, the demo data should stop too.
  2. In this same page, go down and check the "Monitoring" tab. Wait two minutes and use the refresh button to see the changes in the metrics.







  1. Wait up to 5 minutes then check your bucket for results, they will be inside folders representing the date. Download the files produced and see the results. Your "source_recods" folder has the backup data.





  1. What if something goes wrong? Where are the logs? Well, you can take check your logs in Cloudwatch. In the "Monitoring" tab, you will see a link to CloudWatch console, once there, select "Logs" on the menu, then look for your Lambda or Firehose logs in the list.
  2. Go back to the Firehose tab and select "Stop sending demo data".



Cleaning  

Once that you feel comfortable understanding the flow and the services used in this tutorial, it is a good idea to delete these resources.
If you are under the Free Tier, you will only incur in costs when your Firehose delivery stream is being fed, and if you are outside of the Lambda and S3 free tier limits, so as long as you are not producing and inserting data into the stream, you will not be charged. Still, it is a good idea to remove all when you are done.

Delete the delivery stream


  1. Go to the Firehose console page.
  2. Select your delivery stream.
  3. Press on the "Delete" or "Delete Delivery Stream" button depending on your location.


Delete S3 files and/or bucket


  1. Go to the S3 console page.
  2. Note: To select and item on S3, do not press on the link, select the row or checkbox.
  3. You may want to remove the files only, in that case, access the S3 console, then select the folders inside the bucket, select them and on the "More" menu, select "Delete".
  4. If you want to delete the bucket too, go back to the S3 console and select the destination bucket that you have used for this tutorial. Press on the "Delete Bucket" or "Delete Delivery Stream" button depending on your location.

Delete the Lambda function


  1. Access the Lambda console.
  2. On the left menu, select "Functions". Select your Lambda function and in the "Actions" menu, select "Delete".
  3. You can also delete the function directly into the Function editor using "Actions" and then "Delete function".


Delete the Roles


  1. Remember that you have created two roles during this tutorial, one for Lambda and one for Firehose.
  2. Access the IAM console.
  3. Select "Roles" on the left menu.
  4. The one for Lambda was chosen by you in a previous step, look for it and select it. If you are not sure about it, you can check the creation time of the roles using the gear on top of the list to show extra information, this and the firehose role should have been created during the same period of time. 
  5. The role created for Firehose should be named "firehose_delivery_role" unless you have chosen a different name.
  6. To delete them, select them using the checkboxes next to the item and then click on "Delete Role". You will be presented with information about the roles to confirm they are the ones that you want.


Conclusion

Firehose is fully managed service and it will automatically scale to match your throughput requirements without any ongoing administration, you can extend its capabilities with Lamda functions as we have demonstrated in this tutorial where we have ingested data from a system that produces sample stock records, then we have filtered and transformed it to a different format and we are also keeping copy of the raw data for future analysis in S3.

References:

Thursday, March 15, 2018

AWS Certified Solution Architect Associate Tips

In this article, you will find relevant information about the topics and the sources that I have used to prepare and pass the AWS Certified Solution Architect (CSA) - Associate exam.

The reason to share

First of all, knowing about the experience of others is quite important to understand the current state of an exam where questions are frequently changing. Also, AWS is well known for improving services at a fast pace, therefore some questions in the exam are outdated to the current functionality. I do not want this to be a full guide because it will take me millions of words to do it, instead, I want to give you a guide that includes all the topics that I found in the exam and the level of knowledge that would be useful for you to pass it.

Results

I presented and passed the CSA Associate exam on Friday, February 9. One day before, I got certified as an AWS Certified Developer (CD) - Associate. I will only cover the Architect case in this post.

My overall score: 91%.

The path that I have followed to get a score of 91%

I started to study intermittently from December 2017. During this time and until the first week of February 2018 I completed the A Cloud Guru Solution Architect course. Also, I took a deep look into FAQs and documentation for the topics that I found difficult. I will describe them later in this post.
I used the first two weeks of February to recap, this again included reading the documentation of topics related to EC2, EBS, Auto Scaling, ELB, and SQS, as well as some others.

During this time I found a free DB of AWS questions. I studied two sets, the version SAA v1 and started with v2, but the last one contains a lot of questions from the Professional level and decided to focus on the Associate one. By the way, even if the set that I studied included around 400 questions, only one of them appeared in the real exam. Does this mean that they are a waste of time? No, in my own opinion, it does help and it’s worth to take the time to answer these questions. You will know what topics you need to work on harder. The complexity of the questions is really similar to those that are in the exam. Also, you will find discussions in some of them, this is because not all the answers are correct, you can follow up the threads in order verify your answer.

I took the 3 exams that the A Cloud Guru Practice Exam Solution Architect offers at the beginning of February. These exams are different and help you validate knowledge in different areas each time. Here I got scores of:
  • First one: 80%
  • Second: 70%
  • Third: 75%

After that, I presented the Online AWS Test exams offered by Amazon and its proctor:
  • Architect Solution: 80%
  • Developer: 85%.

At IO Connect Services, as part of our career development, we formed a study group where we covered these topics: Cloud, Big Data, Systems Integration and Software Engineering to mention some of them. We organized a series of study groups with my teammates that are already certified, the one that helped me the most in this exam was the one related to VPCs because around 25% of the questions are related to this topic.

Deep dive into the exam

Single topic vs. Combination of topics

Some of the questions focus directly on the benefits of a specific service (e.g., S3) or to a specific feature (e.g., Cross Region Replication in S3). Be aware that these kind of questions are the minority.

Most of the questions combine two or more solutions, but topics like access, security, and costs are frequently used in a single question to test your knowledge (and it changes the final solution). But this makes sense, you are not studying to be an expert in the isolated pieces of the puzzle, you should know how they fit together in other to provide an end-to-end solution.


The topics that I’ve found in the exam

In the following section, I will cover the topics that I found in my exam with helpful comments and links to them so you know where to complement your study. My intention is not to give you the answers, but a sense of level in each topic I’ve identified.

CloudFormation vs. Elastic BeanStalk

  • The difference between these services should not be hard. While CloudFormation provides a common language to describe and provision all the infrastructure resources in your cloud environment, Elastic BeanStalk is a quick and easy-to-use service for deploying and scaling web applications and services developed in popular programming languages.

SQS

  • Know the limits and defaults of the SQS messages:
    • Message retention
    • Message Throughput
    • Message size
    • Message visibility timeout
    • You may check the full list in the SQS Developer Guide.
  • Understand how to convert a regular queue into a FIFO.

Identity Federation vs. IAM

  • You may need to answer questions related to the use cases for Federations, Cross-account, IAM: create users and their defaults.
  • You will find at least a question related to the steps to use SAML-based Federation.

DynamoDB

  • You do not need to know DynamoDB in deep, but you do need to learn about the marketable features like high-scalability & flexibility. Take a look at the benefits here.

SNS

Route53

  • A Names, C Names, and Alias.
  • Check the main features of Route 53.

RDS

  • Know that Multi-AZ helps you to obtain high-availability in regards of database failover.
  • Remember that you can use read-replicas for some DBMS to increase performance.
  • Be aware that there are questions where there is a requirement to access the OS instance running the DB, as you will need to use EC2, this automatically eliminates all the RDS options. This is because with RDS you cannot have direct access to the OS.
  • You may find an ElasticCache question, remember that you have two options for it:
    • Redis: Helps to manage and analyze fast moving data with a versatile in-memory data store.
    • Memcached: Helps to build a scalable Caching Tier for data-intensive apps.

CloudWatch vs. CloudTrail

EC2 + EBS

  • Do not forget that it is advised to stop an EC2 Instance to take an EBS snapshot in order to encrypt its content, then you can create encrypted volumes from that snapshot.
  • A question that will appear is related to spot instances and the cost, so remember that if Amazon stops a spot instance, you will not be charged for the cost of the current hour.

VPC

  • Know the difference between NAT Instances and NAT Gateways.
  • Review the Bastions topic, remember that it allows access to your private instances but you need to configure the security of both your private and public subnets.

VPC + EC2 + Security Group (SG) + Access Control List (ACL)

  • You need to fully understand the characteristics of SG and ACL and how they work, in few points you need to understand:
    • The default rules for the default SG and ACL.
    • The default rules for the custom SG and ACL that you create.
    • The meaning of stateful and stateless.
  • I recommend learning the topics described here

VPC + EC2 + IPs

  • Remember how to assign a public IP to an instance:
  • You cannot change a subnet CIDR.
  • The subnet CIDR block size can be from /16 to /28.
  • Do not forget that subnets are automatically connected to each other.

VPC + ELB + Auto Scaling

ELB

  • Know when to use a Classic, Network or an Application ELB.
    • An Application Load Balancer (ALB) can redirect to different ports too.
    • An ALB can redirect traffic according to the requests (so you can handle different microservices).
    • Do you know when to use a Network ELB (Layer 4) vs. an ALB (Layer 7)? Check this table.
    • Classic ELB was the first version and mostly used by old configurations, where no VPCs were set by AWS. It is not deprecated but it is not recommended.

API Gateway:

  • Remember that you need to enable CORS in order to make successful request between different services.
  • You can use CloudTrail with API Gateway to capture REST API calls in your AWS account and deliver the log files to an Amazon S3 bucket.

IAM

  • IAM is a pretty solid topic in the exams, study this topic here.
  • Remember that it is advised to assign roles to EC2 Instances instead of storing credentials on them.
  • I’ve got one question related to Cross-accounts where the Development team wanted to access the Production environment, check the Delegate Access Across AWS Accounts Using IAM Roles for more information about this scenario.

S3

  • Know the Bucket URLs format
    • http://<bucket>.s3-<aws-region>.amazonaws.com
  • Multipart upload
    • You can upload files up to 5GB directly, from 5GB to 5TB you must use multipart upload to avoid the “EntityTooLarge” message.
  • Glacier & Infrequent Access (IA)
  • Cross-Region Replication

Shared Responsibility Model

  • Know your security responsibilities (most of the time related to access and security patches of your EC2s) vs. those from AWS.
  • When a storage device has reached the end of its useful life, AWS procedures include a decommissioning process, check the Storage Device Decommissioning topic in the AWS security whitepaper.

Kinesis

OpsWorks

  • Remember that you if you need to use Chef you can use the AWS OpsWorks service.

Storage solutions: Connect your enterprise network to AWS

  • Direct Connect: It will increase the speed and security of your infrastructure, but it may take a while to be fully implemented.
  • Storage Gateway: You have Cache and Storage Cache Volumes Architectures.
  • AWS Import/Export: It is a service that accelerates data transfer into and out of AWS using physical storage appliances, bypassing the Internet.
    • You cannot export data from Glacier directly, you need to store it in S3 first.

Final advice

  • I ended the exam without being comfortable with 3 or 4 questions so I had to eliminate options instead of to be sure about the correct answer for those.
  • If you are like me, you will consume all time and try to review the questions that are difficult several times. A note in here, the questions and some answers are kind of long, you may not have enough time to go and check them all in a second round, use flags to mark those that you want to review.
  • Amazon exams do not have a fixed passing score, it varies depending on the scores obtained by other applicants. I knew a colleague that passed with 77% the day before I took the exam. In December, another colleague got a 67% and it did not pass. Aim to get at least a 70%, but you should be OK with a 75%.
  • Read the Amazon documentation for those topics that you do not understand. Even better, take a look at the re:Invent sessions in the AWS YouTube channel. People in Amazon repeat the popular ones every year while adding some of the new features. I watched ones related to VPCs and Auto Scaling a couple of times.
  • I read almost all the base documentation related to VPCs.

I hope this information helps you and others to prepare for the exam, as it has been one of the most difficult ones that I have taken. I invite you to check the topics discussed in this article to understand your weak points and reinforce them with the AWS FAQs, documentation and re:Invent videos on YouTube, I found these last 2 to be the most effective way to understand the difficult topics as the documentation and the experts are pretty good.

Resources

Where to study

Where to test your knowledge

Check other tips related to the AWS exams


Monday, February 19, 2018

MCD - API Design Associate tips



As MuleSoft partners, in IO Connect Services we care about constant education and certification for all its employees. In late January I presented and passed the exam for the MuleSoft Certified Developer - API Design Associate certification. MuleSoft recommends the Anypoint Platform: API Design course, which costs $1500 USD for 2 days. Here I’m sharing my findings that helped me to pass this exam.

Preparation guide

It all starts with what is covered in the exam. As you may know by now, MuleSoft publishes a preparation guide for all the certifications they have. For this particular topic, you can find the course guide here.

https://training.mulesoft.com/static/public_downloadables/datasheets/APApiDesign3.9_datasheet.pdf

This guide will help you to know which are the topics you have to know in order to pass this exam. In summary, you will have to know the following:
  • RESTful basics.
  • HTTP details in order to implement such APIs in a RESTful approach.
  • SOAP basics.
  • API-led connectivity lifecycle.
  • RAML 1.0.
  • Design APIs.
  • Define APIs using RAML1.0.
  • Document APIs.
  • Secure APIs.
  • Test APIs.
  • Publish APIs on Anypoint Exchange.
  • Version best practices.

RESTful and SOAP basics.

RESTful and SOAP have been around for some time now. Nevertheless, it’s good to go back to basics from time to time. I’ve found this website that gives clear statements about basics and best practices when designing a RESTful application.

http://www.restapitutorial.com/

Make sure the effect of the HTTP specification on a RESTful endpoint. HTTP codes, headers, request, responses and more are covered in the exam. Make yourself comfortable with this topic as it’s very important for the design of an API.

Unlike RESTful, SOAP is an industry standard, one good reference is the W3C website:

https://www.w3schools.com/xml/xml_soap.asp 

RAML 1.0

RESTful API Modeling Language, or RAML, is a standard to document APIs for RESTful applications. MuleSoft uses this standard in order to design and define APIs in Anypoint Platform and in the Mule runtime.

The first place you should look at is the specification itself.

https://github.com/raml-org/raml-spec/blob/master/versions/raml-10/raml-10.md

But if you’re a tutorial person, you can use the RAML 1.0 tutorial.

https://raml.org/developers/raml-100-tutorial

Make sure you can write an API in RAML with nothing but a notepad. You will get some RAML snippets and will have to answer those questions based on them. This means you have to know whether the syntax is correct and what those snippets mean to the question. Also, lots of questions will come up, like syntax, design best practices, versioning best practices, and security. This is one of the most important topics in the exam as it’s the core of the API design in Mule.

One more thing, I’ve found a lot of people who think that RESTful is JSON. This is not true at all. While the usage of JSON in RESTful APIs is widely used, remember that it also supports XML and other payload formats via content-type header reference. This is particularly true for RAML as it can serialize objects based on the content type you specify in the document.

API-led connectivity and lifecycle

MuleSoft has a set of best practices for APIs. This is very well documented as API-led connectivity. You can get a quick view here.

https://www.mulesoft.com/lp/whitepaper/api/api-led-connectivity

Also, MuleSoft has very specific products and practices to manage the lifecycle of your APIs through the Anypoint Platform, such as API designer, API portal and Exchange. Make sure you know all these products inside out. As part of the lifecycle management, be sure you understand the role of each product in it. To start looking into these components, see this link:

https://www.mulesoft.com/ty/ebook/api-lifecycle-management

One resource I knew recently is the API Notebook. A tool for writing API tutorials that you can share with your peers and that runs JavaScript snippets.

https://api-notebook.anypoint.mulesoft.com/

Be sure to know the API of this, you can find it here:

https://api-notebook.anypoint.mulesoft.com/help/api-guide

Summary

In my experience taking this exam, I noticed HTTP and RAML specifications are covered extensively. In the HTTP spec side, I got a bunch of questions about codes, requests formats, responses and headers in order to define an API properly.

I strongly advise you to get familiar with the API lifecycle management products in Anypoint platform. Moreover, do your own study projects on these products. Design an API from scratch using API Designer, publish it and make it discoverable within your organization. This will help you to understand MuleSoft’s practices and products while you study the specs as well. Will save you a little of time.

Let me know your experience about this exam. Write any comment and let’s help others looking for help on this topic.

Resources


IO Connect Services - https://www.ioconnectservices.com/

IO Connect Services - MuleSoft partnership - https://www.mulesoft.com/partner/io-connect-services

API Design course overview - https://training.mulesoft.com/instructor-led-training/apdev-api-design

API Design course guide - https://training.mulesoft.com/static/public_downloadables/datasheets/APApiDesign3.9_datasheet.pdf

REST API Tutorial website - http://www.restapitutorial.com/

W3C SOAP tutorial - https://www.w3schools.com/xml/xml_soap.asp

RAML Specification - https://github.com/raml-org/raml-spec/blob/master/versions/raml-10/raml-10.md

RAML 1.0 tutorial - https://raml.org/developers/raml-100-tutorial

API-led connectivity - https://www.mulesoft.com/lp/whitepaper/api/api-led-connectivity

API Lifecycle management - https://www.mulesoft.com/ty/ebook/api-lifecycle-management

API Notebook - https://api-notebook.anypoint.mulesoft.com/

API Notebook guide - https://api-notebook.anypoint.mulesoft.com/help/api-guide

Tuesday, September 26, 2017

AWS Step Functions in Enterprise Integration

The article, Achieve enterprise integration with AWS, depicts the orchestration of Lambdas using Amazon Simple Workflow (SWF) with outstanding results. As stated, SWF requires a standalone application running in order to process the flows and this time we wanted to migrate the application to a 100% serverless solution. The article also mentions that a new service is available and looks very promising in the serverless scenario, Step Functions. Here, we want to show you how we took the previous approach and transform it into a Step Functions-led approach.

AWS Step Functions is a service that helps you to create a flow based on several units of work, often implemented in AWS Lambdas. This service is basically a state machine: given an input, an initial state will compute what's required by the underlying implementation and will generate an output. This output serves as the input for the next state whose output might be used as an input for another step and so on until the flow is completed and the last state gets executed. Each state, or node in the visual editor in AWS Step Functions console, is implemented with a Lambda and the flow of the state machine is orchestrated by the logic specified in the transitions' definitions.


AWS Step Functions provides the following functionality:

  • Create a message channel between your Lambdas.
  • Monitoring the functionality of the Lambdas by reporting the status of each.
  • Automatically trigger each step.

The Scenario

In IO Connect services we want to test this new feature of AWS with an enterprise integration use case based on the scenario described in the SWF implementation. We modified the file size according to the AWS Step Functions free tier for testing purposes:

  1. Reduced the CSV (comma separated values) file stored in AWS S3, from 800k+ to 100K+ records and 40 columns for each record. This because we want to be sure the number of state transactions will not overpass the 4000 established in the free tier, reduce the records to 100+ give me approximately 400+ pages to be created, in the case of the "Parallel Branches" approach (explained below) it consumes 1200+ transactions, give me at least 3 runs before pass the free tier limit vs the 3200+ pages created by the original file that consumes approximately 9200+ transactions, generating a cost of $0.14 USD the first execution and $0.23 USD per execution after.
  2. Create pages of the file according to the batch size specified. SQS has a limit of 256KB per message, and using the UTF-8 charset with 250 records peer page/message gives me 230KB approximately.
  3. Store the pages in individual files in a Storage Service like AWS S3.


For this approach, the main idea is to use AWS Step Functions as an orchestrator in order to see all its features - logs, visual tools, easy tracking, etc. - that provides to support enterprise integration. The actual work parts are implemented using AWS Lambda. Because of the AWS Lambda limits, the units of work are very small to avoid reaching these limits, hence a good flow written in AWS Step Functions requires a series of steps perfectly orchestrated.


What we can do with AWS Step Functions.

This was a completely new tool for us so we did some diligence to investigate about what can, can't-do and other useful information about this tool:

Can.

  • Use a simple JSON format text to create the State Machine.
  • Use states to call AWS Lambda Functions.
  • Run a defined number of branches that execute steps in parallel.
  • Have a different language for your Lambdas in the same State Machine.
  • Send serializable objects like POJOS in the message channel.
  • Create, Run and Delete State Machines using the API.
  • Use the logs and other visual tools in order to see execution messages.
Can not.
  • Edit an already created State Machine. You'll have to remove it and then re-create a new state machine.
  • Launch the state machine using a trigger event like create a file in S3. Instead, you'll need to write a Lambda to trigger it.
  • Create a dynamic number of branches of states to be run in parallel. It's always a pre-defined set of parallel tasks.
  • Use visual tools (like drag and drop) to create the state machine. All the implementation must be done writing a JSON. The tool only shows you a graph with the representation of the state changes, but you can not use it to create the state machine, it only visualizes it.
  • Send non-serializable objects in the message channel. This is a big point as you must be sure the objects you return in your Lambda are serializable.
  • Resume the execution if some of the steps fail. Either it runs completely or fails completely.
Consider this. 
  • The free tier allows you 4,000 step transitions free per month.
  • All the communication between steps is made using JSON objects.
  • A state machine name must be 1-80 characters long.
  • The maximum length of the JSON used as input or result for a state is 32768 characters.

Approach 1: Lambda Orchestration.

For the first approach, we wanted to test how Step Functions work. For this purpose, we only set two steps in order to see what we can examine using the Step Functions logs and graph.

The StateMachine JSON
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "StartAt": "FileIngestion",
  "States": {
    "FileIngestion": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaFileIngestion",
      "Next": "Paginator"
    },
    "Paginator": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaPaginator",
      "End": true
    }
  }
}

The Graph

  1. As mentioned before, a state machine cannot be triggered by an S3 event or similar, so we used a Lambda to trigger the state machine when an S3 object is created in a certain bucket and ends with ".csv" extension, then this Lambda passes the execution to the state machine by starting it and also passes the S3 object details as an input parameter.
  2. The FileIngestion step calls a Lambda that reads the information provided by the event trigger to locate and read the file created in S3, calculate the number of pages to create and returns this number and the file locations as output.
  3. The Paginator calls a Lambda that this reads the lines of one single page, stores it in a variable, then call another Lambda in an async mode to write a file with the page content. This process is repeated until the original file is completely read.
In this approach, the Lambdas have more flow control than the state machine, because one Lambda calls another and orchestrates the asynchronous executions. Also in this case, if the Lambda that writes the pages fails you can not notice it in the graph, you need to check the Lambda executions and manually identify which Lambda and why it failed.

The Metrics. 
  • The total time of execution is 4 minutes average to process 100K+ records.

Approach 2: Lineal Processing.

Taking in count the previous implementation, We wanted to create a state machine that has more authority of the control of the flow execution. As a first step, we decided to implement a linear execution with no parallelization defined.

The StateMachine JSON
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "StartAt": "FileAnalizer",
  "States": {
    "FileAnalizer": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaFileIngestion",
      "Next": "FileChecker"
    },    
    "FileChecker":{
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.writeComplete",
          "BooleanEquals": false,
          "Next": "PageCreator"
        }
      ],
      "Default": "QueueChecker"
    },
    "ReadSQStoS3":{
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaPageWriter",
      "Next": "QueueChecker"
    },    
    "QueueChecker": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.emptyQueue",
          "BooleanEquals": true,
          "Next": "SuccessState"
        }
      ],
      "Default": "ReadSQStoS3"
    },    
    "PageCreator": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaPaginator",
      "Next": "FileChecker"
    },    
    "SuccessState": {
      "Type": "Pass",
      "End": true
    }
  }
}

The Graph


  1. The FileAnalizer step calls a Lambda function that consumes the .csv file and creates a POJO with the start and end byte of each page to be created, these bytes are calculated depending on the page size parameter specified in the Lambda. You can see it this way, FileAnalizer creates an index of the start and end byte for each page.
  2. FileChecker is a Choice Step that verifies a boolean variable in which it determines if all the pages were completed. This information is stored in a SQS queue.
  3. PageCreator calls a Lambda that reads the start and end bytes of each page in the received POJO, reads the S3 file only in that portion - start and end bytes- and creates a SQS message with the page content.
  4. QueueChecker is similar to FileChecker but in this case waits until no messages are left in the SQS queue.
  5. ReadSQStoS3 is a Resource step that calls a Lambda function, it reads the messages in the SQS queue that represents a page of the .csv file and stores it in an S3 folder.
  6. SuccessState ends of the state machine execution.
For this approach, the message channel always contains the POJO with the start and end bytes of each page.

The Metrics.
  • The total time of execution is 15 minutes average to process 100K+ records.

Approach 3: Batch Writing.

We took the same state machine for the linear processing but, the Lambda resource in step ReadSQStoS3 was modified with the intention to reduce the execution time of the previous approach. I've added a long polling behavior in the Lambda with a maximum of 10 messages, with this, the Lambda waits for a maximum of 10 messages in SQS if available (if 20 seconds pass and the 10 messages are not visible, it gets the maximum available at that moment) in the queue, get them and calls another Lambda asynchronously to write these 10 messages.

The Metrics. 
  • The total time of execution is 10 minutes average to process 100K+ records.

Approach 4: Parallel Branches.

For this implementation, we added a series of 5 branches in order to read the first and end byte of each page and send a message to SQS with the page content in parallel.

Here we faced two problems:
  1. We confirm the Step Functions limitation that you cannot create a dynamic number of parallel branches. This means you have to define a fixed set of parallel jobs since the beginning.
  2. At the end of the parallel execution, meaning the 5 tasks depicted below, Step Functions aggregates the results of all tasks and passes a single message with all results in it. This results in a problem with big JSON structures that can convert into a bigger JSON at the end of the parallel execution. If this JSON is bigger than 32768 characters then an error will be thrown and the execution will fail.
The StateMachine JSON
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
{
  "StartAt": "FileAnalizer",
  "States": {
    "FileAnalizer": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaFileIngestion",
      "Next": "FileChecker"
    },
    "FileChecker":{
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.writeComplete",
          "BooleanEquals": false,
          "Next": "ParallelWritePage"
        }
      ],
      "Default": "QueueChecker"
    },
    "ParallelWritePage":{
      "Type": "Parallel",
      "Next": "DeleteRead",
      "Branches": [
        {
          "StartAt": "SetBatchIndex0",
          "States": {
            "SetBatchIndex0": {
              "Type": "Pass",
              "Result": 0,
              "Next": "PageCreator0"
           },
            "PageCreator0": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex1",
          "States": {
            "SetBatchIndex1": {
              "Type": "Pass",
              "Result": 1,
              "Next": "PageCreator1"
           },
            "PageCreator1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex2",
          "States": {
            "SetBatchIndex2": {
              "Type": "Pass",
              "Result": 2,
              "Next": "PageCreator2"
           },
            "PageCreator2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex3",
          "States": {
            "SetBatchIndex3": {
              "Type": "Pass",
              "Result": 3,
              "Next": "PageCreator3"
           },
            "PageCreator3": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex4",
          "States": {
            "SetBatchIndex4": {
              "Type": "Pass",
              "Result": 4,
              "Next": "PageCreator4"
           },
            "PageCreator4": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        }
      ]
    },
    "DeleteRead":{
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaGetLastIndex",
      "Next": "FileChecker"
    },
    "ReadSQStoS3":{
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPageWriter",
      "Next": "QueueChecker"
    },    
    "QueueChecker": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.emptyQueue",
          "BooleanEquals": true,
          "Next": "SuccessState"
        }
      ],
      "Default": "ReadSQStoS3"
    },    
    "SuccessState": {
      "Type": "Pass",
      "End": true
    }
  }
}


The Graph


  1. FileAnalizer is a resource step in which a Lambda is called to read the csv file stored in AWS S3, create the index of the start and end byte of each page and storage this indexes in an external service like a database, file, cache service, etc. 
  2. FileChecker is a choice step. It reads a boolean variable that indicates if all the pages were already stored as SQS Messages.
  3. SetBatchIndexX is a transition step. Sets a variable used by "PageCreator" to know which index is next to be read.
  4. PageCreatorX depending on the integer value passed by "SetBatchIndex", it extracts the page's start and end bytes, use them to read the CSV File only by the portion defined by these indexes and sends a SQS Message with the page content. It returns the information it reads in order to know in the next step which pages are left.
  5. DeleteRead receives the output payload of the parallel process and determines what pages were already written in SQS and deletes the information related to these pages from the external service (step 1).
  6. QueueChecker, ReadSQStoS3 and SuccessState, works as the same of the "Batch Writing" approach.
The Metrics. 
  • The total time of execution is 6 minutes average to process 100K+ records. 

Conclusion.

AWS Step Functions is a tool that allows you to create and manage orchestrate flows based on small units of work. The simplicity of the language makes it perfect for quick implementations, as long as you already have the units of work identified.

Unfortunately, as this is fairly new service in the AWS ecosystem, functionality is severely limited. A proof of this is the fact that you need to maintain a fixed number of parallel steps and if you end up having less work than parallel steps you must add control logic to avoid unexpected errors.

Moreover, given the limits found in AWS Lambda and Step Functions, computing of high workloads of information can be very difficult if you don't give a good thought to your design to decompose the processing. We highly recommend you give a read to our blog Microflows to have an understanding of what this means.

On the plus side. if you want to transport small portions of data or compute small processes in a serverless fashion, Step Functions is a good tool for it.
In the future, we will evaluate a combination of other new AWS services like AWS Glue and AWS Batch together with Step Functions to achieve outstanding big data processing and enterprise integration.

Thanks for taking time to read this post. I hope this is helpful to you at the moment you decide to use Step Functions and do not hesitate to drop a comment if you have any question.

Tuesday, August 22, 2017

Achieve Enterprise Integration with AWS Lambdas, SWF and SQS

In recent days, We were asked to do an ETL flow using Amazon Web Services. Because we excel in Enterprise Integration we had a particular design in mind to make it happen. The job was pretty simple:
  1. The trigger was a file placed in a particular S3 bucket.
  2. Take the S3 object metadata of the file as the input of the job.
  3. Read the file and package the records in pages, each page is sent asynchronously as a message. This technique is to increase parallelism in the job processing since the files contain one million records in average.
  4. Consume all pages asynchronously and upload them as micro-batches of records into a third-party system via a Restful API.
  5. Other tasks to complete the use case like recording the completion of the job in a database.
On top of these basic requirements we had to make sure the system was robust, resilient and as fast as possible while keeping low the costs of the different systems.

We chose to use different services from Amazon Web Services for this: S3, Simple Workflow (SWF), Simple Queue Service (SQS) and Lambda.

Here a diagram of the solution (click on the image to see it bigger).

Solution diagram
Solution diagram


Why Simple Workflow (SWF)?

As you can see in the diagram, every task is executed by a Lambda function, so why involve Simple Workflow? The answer is simple: We wanted to create an environment where the sequence of task executions was orchestrated by a single entity, and also be able to share with the different tasks the context of the execution.

If you think of this, we wanted to have something similar to a Flow in a Mule app (MuleSoft Anypoint platform).

It is important to highlight that AWS has some specific limits to execute Lambdas like one Lambda function can only run for a maximum of 5 minutes. Due to these limits, we had to break the tasks into small but cohesive units of work while having a master orchestrator that could run longer than that. Here's where the shared context comes useful.

Note: There's another service that plays very well on the serverless paradigm as opposed to SWF, Step Functions, but at the time We were working on this task it was still in Beta, hence not suitable for production. There is a follow-up post about full Serverless integration that will include Step Functions.

Challenges and recommendations


While working with SWF and Lambdas, We learned some things that helped us a lot to complete this assignment. Here I'll show you the situation and solution that worked for me.

Invoke Lambdas from activities, not workflow workers


One thing you should know about working with SWF is that every output of an activity returns as a Promise to the workflow worker - very similar to a Promise in JavaScript. This Promise returns the output as a serialized object that you need to deserialize if you want to use it as an input for a Lambda function execution directly from the workflow worker. This overhead can be very cumbersome if you use it frequently. In your lambdas you're supposed to work with objects directly, not serialized forms.

Here my first advice, even though you can invoke a Lambda function from within a workflow worker don't do it, instead use an Activity worker. This way each workflow worker implements a unit of work that calls an Activity worker which in turn calls a Lambda function internally. Why? Because in the Activity worker you will be able to use a proper object to pass to the Lambda as an input parameter. This technique requires you to deal with some extra plumbing in your SWF code since you'll need one Activity per Lambda, but in the end, this provides you a very flexible and robust mechanism to exchange information between SWF and Lambdas.

See this sequence diagram to understand it.

Workflow, activity and lambda sequence diagram.
Workflow, activity and lambda sequence diagram.


Wrap your payload in a Message object


All in all, we are talking about Enterprise Integration and one of the central pieces is the message. In order to uniformly share information between the workflow and the different Lambdas, it's better to standardize this practice by using a custom Message object. This Message must contain the workflow context you want to share and the payload. When the Lambda functions are called, they receive this Message object that they use to extract the information required to perform the task fully with no external dependency.

Decompose large loads of data into small pieces


As mentioned before, Lambdas are supposed to run small tasks quickly and independently, therefore they have limits that you should be aware of, such as execution time, memory allocation, ephemeral disk capacity, and the number of threads among others. These are serious constraints when working with big amounts of data and long running processes.

In order to overcome these problems, we recommend decomposing the entire file content into small pieces to increase task parallelism and improve performance in a safe manner - actually, this was one of the main reason to use Lambdas since they auto-scale nicely as the parallel processing increases. For this, we divided the file content into packages of records as pages, where each page can contain hundreds or thousands of records. Each page was placed as a message in an SQS queue. The size of the page must consider the limit of 256 KB per message in SQS.

Keep long running processes in Activities, not Lambdas


As you see in the diagram above, there's a poller that is constantly looking for new messages in the SQS queue. This can be a long running process if you expect dozens of thousands of pages. For cases like this, having activities in your flow is very convenient as you can have an activity running for up to one year, this contrasts highly with the 5-minute execution limit of a Lambda function.

Beware of concurrency limits


Consider the scenario where you have an Activity whose purpose is to read the queue and delegate the upload of the micro-batches to an external system. Commonly, to speed up the execution you make use of threads - note I'm talking about Java but other languages have similar concepts. In this Activity, you may use a loop to create a thread per micro-batch to upload.

Lambda has the limit of 1024 concurrent threads, so if you plan to create a lot of threads to speed up your execution, like uploading micro-batches to the external system mentioned above, first and most importantly, use a thread pool to control the number of threads. We recommend do not create instances of Thread or Runnable, instead, create Java lambda functions for each asynchronous task you want to execute. Make sure you use the AWSLambdaAsyncClientBuilder interface to invoke Lambdas, the ones in AWS, asynchronously.


Conclusion


This approach was particularly successful for a situation where we were not allowed to use an integration platform like Mule. It is also a very nice solution if you just need to integrate AWS services and move lots of data among them.

AWS Simple Workflow and Lambda work pretty well together although they have different goals. Keep in mind that an SWF application needs to be deployed on a machine, like a standalone program, either in your own data center or maybe an EC2 instance, or another IaaS. 

This combo will help you to orchestrate and share different contexts, either automated through Activities or manual by using signals, but if what you need is isolated execution and chaining is not relevant to you, then you could use Lambdas only, but the chained execution will no truly isolate them from each other and the starting Lambda may timeout before the Lambdas functions triggered later in the chain finish their execution.

Moreover, every time you work with resources with similar limitations like AWS Lambdas, always bear in mind the restrictions they come with and design your solution based on these constraints, hopefully, in Microflows.  Have a read on the Microflows post by Javier Navarro-Machuca, Chief Architect at IO Connect Services.

To increase parallelism we highly recommend using information exchange systems such as queues, transient databases or files. In AWS you can make use of S3, SQS, RDS or DynamoDB (although our preference is SQS for this task)

Stay tuned as we're a working on a solution that uses Step Functions with Lambdas rather than Simple Workflow for a full Serverless solution integration.

Happy reading!

Resources


Enterprise Integration Patterns - http://www.enterpriseintegrationpatterns.com/
Amazon Simple Workflow - https://aws.amazon.com/swf/
Amazon Lambda - https://aws.amazon.com/lambda/ 
Amazon Simple Queue Service - https://aws.amazon.com/sqs/