Thursday, March 15, 2018

AWS Certified Solution Architect Associate Tips

In this article, you will find relevant information about the topics and the sources that I have used to prepare and pass the AWS Certified Solution Architect (CSA) - Associate exam.

The reason to share

First of all, knowing about the experience of others is quite important to understand the current state of an exam where questions are frequently changing. Also, AWS is well known for improving services at a fast pace, therefore some questions in the exam are outdated to the current functionality. I do not want this to be a full guide because it will take me millions of words to do it, instead, I want to give you a guide that includes all the topics that I found in the exam and the level of knowledge that would be useful for you to pass it.

Results

I presented and passed the CSA Associate exam on Friday, February 9. One day before, I got certified as an AWS Certified Developer (CD) - Associate. I will only cover the Architect case in this post.

My overall score: 91%.

The path that I have followed to get a score of 91%

I started to study intermittently from December 2017. During this time and until the first week of February 2018 I completed the A Cloud Guru Solution Architect course. Also, I took a deep look into FAQs and documentation for the topics that I found difficult. I will describe them later in this post.
I used the first two weeks of February to recap, this again included reading the documentation of topics related to EC2, EBS, Auto Scaling, ELB, and SQS, as well as some others.

During this time I found a free DB of AWS questions. I studied two sets, the version SAA v1 and started with v2, but the last one contains a lot of questions from the Professional level and decided to focus on the Associate one. By the way, even if the set that I studied included around 400 questions, only one of them appeared in the real exam. Does this mean that they are a waste of time? No, in my own opinion, it does help and it’s worth to take the time to answer these questions. You will know what topics you need to work on harder. The complexity of the questions is really similar to those that are in the exam. Also, you will find discussions in some of them, this is because not all the answers are correct, you can follow up the threads in order verify your answer.

I took the 3 exams that the A Cloud Guru Practice Exam Solution Architect offers at the beginning of February. These exams are different and help you validate knowledge in different areas each time. Here I got scores of:
  • First one: 80%
  • Second: 70%
  • Third: 75%

After that, I presented the Online AWS Test exams offered by Amazon and its proctor:
  • Architect Solution: 80%
  • Developer: 85%.

At IO Connect Services, as part of our career development, we formed a study group where we covered these topics: Cloud, Big Data, Systems Integration and Software Engineering to mention some of them. We organized a series of study groups with my teammates that are already certified, the one that helped me the most in this exam was the one related to VPCs because around 25% of the questions are related to this topic.

Deep dive into the exam

Single topic vs. Combination of topics

Some of the questions focus directly on the benefits of a specific service (e.g., S3) or to a specific feature (e.g., Cross Region Replication in S3). Be aware that these kind of questions are the minority.

Most of the questions combine two or more solutions, but topics like access, security, and costs are frequently used in a single question to test your knowledge (and it changes the final solution). But this makes sense, you are not studying to be an expert in the isolated pieces of the puzzle, you should know how they fit together in other to provide an end-to-end solution.


The topics that I’ve found in the exam

In the following section, I will cover the topics that I found in my exam with helpful comments and links to them so you know where to complement your study. My intention is not to give you the answers, but a sense of level in each topic I’ve identified.

CloudFormation vs. Elastic BeanStalk

  • The difference between these services should not be hard. While CloudFormation provides a common language to describe and provision all the infrastructure resources in your cloud environment, Elastic BeanStalk is a quick and easy-to-use service for deploying and scaling web applications and services developed in popular programming languages.

SQS

  • Know the limits and defaults of the SQS messages:
    • Message retention
    • Message Throughput
    • Message size
    • Message visibility timeout
    • You may check the full list in the SQS Developer Guide.
  • Understand how to convert a regular queue into a FIFO.

Identity Federation vs. IAM

  • You may need to answer questions related to the use cases for Federations, Cross-account, IAM: create users and their defaults.
  • You will find at least a question related to the steps to use SAML-based Federation.

DynamoDB

  • You do not need to know DynamoDB in deep, but you do need to learn about the marketable features like high-scalability & flexibility. Take a look at the benefits here.

SNS

Route53

  • A Names, C Names, and Alias.
  • Check the main features of Route 53.

RDS

  • Know that Multi-AZ helps you to obtain high-availability in regards of database failover.
  • Remember that you can use read-replicas for some DBMS to increase performance.
  • Be aware that there are questions where there is a requirement to access the OS instance running the DB, as you will need to use EC2, this automatically eliminates all the RDS options. This is because with RDS you cannot have direct access to the OS.
  • You may find an ElasticCache question, remember that you have two options for it:
    • Redis: Helps to manage and analyze fast moving data with a versatile in-memory data store.
    • Memcached: Helps to build a scalable Caching Tier for data-intensive apps.

CloudWatch vs. CloudTrail

EC2 + EBS

  • Do not forget that it is advised to stop an EC2 Instance to take an EBS snapshot in order to encrypt its content, then you can create encrypted volumes from that snapshot.
  • A question that will appear is related to spot instances and the cost, so remember that if Amazon stops a spot instance, you will not be charged for the cost of the current hour.

VPC

  • Know the difference between NAT Instances and NAT Gateways.
  • Review the Bastions topic, remember that it allows access to your private instances but you need to configure the security of both your private and public subnets.

VPC + EC2 + Security Group (SG) + Access Control List (ACL)

  • You need to fully understand the characteristics of SG and ACL and how they work, in few points you need to understand:
    • The default rules for the default SG and ACL.
    • The default rules for the custom SG and ACL that you create.
    • The meaning of stateful and stateless.
  • I recommend learning the topics described here

VPC + EC2 + IPs

  • Remember how to assign a public IP to an instance:
  • You cannot change a subnet CIDR.
  • The subnet CIDR block size can be from /16 to /28.
  • Do not forget that subnets are automatically connected to each other.

VPC + ELB + Auto Scaling

ELB

  • Know when to use a Classic, Network or an Application ELB.
    • An Application Load Balancer (ALB) can redirect to different ports too.
    • An ALB can redirect traffic according to the requests (so you can handle different microservices).
    • Do you know when to use a Network ELB (Layer 4) vs. an ALB (Layer 7)? Check this table.
    • Classic ELB was the first version and mostly used by old configurations, where no VPCs were set by AWS. It is not deprecated but it is not recommended.

API Gateway:

  • Remember that you need to enable CORS in order to make successful request between different services.
  • You can use CloudTrail with API Gateway to capture REST API calls in your AWS account and deliver the log files to an Amazon S3 bucket.

IAM

  • IAM is a pretty solid topic in the exams, study this topic here.
  • Remember that it is advised to assign roles to EC2 Instances instead of storing credentials on them.
  • I’ve got one question related to Cross-accounts where the Development team wanted to access the Production environment, check the Delegate Access Across AWS Accounts Using IAM Roles for more information about this scenario.

S3

  • Know the Bucket URLs format
    • http://<bucket>.s3-<aws-region>.amazonaws.com
  • Multipart upload
    • You can upload files up to 5GB directly, from 5GB to 5TB you must use multipart upload to avoid the “EntityTooLarge” message.
  • Glacier & Infrequent Access (IA)
  • Cross-Region Replication

Shared Responsibility Model

  • Know your security responsibilities (most of the time related to access and security patches of your EC2s) vs. those from AWS.
  • When a storage device has reached the end of its useful life, AWS procedures include a decommissioning process, check the Storage Device Decommissioning topic in the AWS security whitepaper.

Kinesis

OpsWorks

  • Remember that you if you need to use Chef you can use the AWS OpsWorks service.

Storage solutions: Connect your enterprise network to AWS

  • Direct Connect: It will increase the speed and security of your infrastructure, but it may take a while to be fully implemented.
  • Storage Gateway: You have Cache and Storage Cache Volumes Architectures.
  • AWS Import/Export: It is a service that accelerates data transfer into and out of AWS using physical storage appliances, bypassing the Internet.
    • You cannot export data from Glacier directly, you need to store it in S3 first.

Final advice

  • I ended the exam without being comfortable with 3 or 4 questions so I had to eliminate options instead of to be sure about the correct answer for those.
  • If you are like me, you will consume all time and try to review the questions that are difficult several times. A note in here, the questions and some answers are kind of long, you may not have enough time to go and check them all in a second round, use flags to mark those that you want to review.
  • Amazon exams do not have a fixed passing score, it varies depending on the scores obtained by other applicants. I knew a colleague that passed with 77% the day before I took the exam. In December, another colleague got a 67% and it did not pass. Aim to get at least a 70%, but you should be OK with a 75%.
  • Read the Amazon documentation for those topics that you do not understand. Even better, take a look at the re:Invent sessions in the AWS YouTube channel. People in Amazon repeat the popular ones every year while adding some of the new features. I watched ones related to VPCs and Auto Scaling a couple of times.
  • I read almost all the base documentation related to VPCs.

I hope this information helps you and others to prepare for the exam, as it has been one of the most difficult ones that I have taken. I invite you to check the topics discussed in this article to understand your weak points and reinforce them with the AWS FAQs, documentation and re:Invent videos on YouTube, I found these last 2 to be the most effective way to understand the difficult topics as the documentation and the experts are pretty good.

Resources

Where to study

Where to test your knowledge

Check other tips related to the AWS exams


Monday, February 19, 2018

MCD - API Design Associate tips



As MuleSoft partners, in IO Connect Services we care about constant education and certification for all its employees. In late January I presented and passed the exam for the MuleSoft Certified Developer - API Design Associate certification. MuleSoft recommends the Anypoint Platform: API Design course, which costs $1500 USD for 2 days. Here I’m sharing my findings that helped me to pass this exam.

Preparation guide

It all starts with what is covered in the exam. As you may know by now, MuleSoft publishes a preparation guide for all the certifications they have. For this particular topic, you can find the course guide here.

https://training.mulesoft.com/static/public_downloadables/datasheets/APApiDesign3.9_datasheet.pdf

This guide will help you to know which are the topics you have to know in order to pass this exam. In summary, you will have to know the following:
  • RESTful basics.
  • HTTP details in order to implement such APIs in a RESTful approach.
  • SOAP basics.
  • API-led connectivity lifecycle.
  • RAML 1.0.
  • Design APIs.
  • Define APIs using RAML1.0.
  • Document APIs.
  • Secure APIs.
  • Test APIs.
  • Publish APIs on Anypoint Exchange.
  • Version best practices.

RESTful and SOAP basics.

RESTful and SOAP have been around for some time now. Nevertheless, it’s good to go back to basics from time to time. I’ve found this website that gives clear statements about basics and best practices when designing a RESTful application.

http://www.restapitutorial.com/

Make sure the effect of the HTTP specification on a RESTful endpoint. HTTP codes, headers, request, responses and more are covered in the exam. Make yourself comfortable with this topic as it’s very important for the design of an API.

Unlike RESTful, SOAP is an industry standard, one good reference is the W3C website:

https://www.w3schools.com/xml/xml_soap.asp 

RAML 1.0

RESTful API Modeling Language, or RAML, is a standard to document APIs for RESTful applications. MuleSoft uses this standard in order to design and define APIs in Anypoint Platform and in the Mule runtime.

The first place you should look at is the specification itself.

https://github.com/raml-org/raml-spec/blob/master/versions/raml-10/raml-10.md

But if you’re a tutorial person, you can use the RAML 1.0 tutorial.

https://raml.org/developers/raml-100-tutorial

Make sure you can write an API in RAML with nothing but a notepad. You will get some RAML snippets and will have to answer those questions based on them. This means you have to know whether the syntax is correct and what those snippets mean to the question. Also, lots of questions will come up, like syntax, design best practices, versioning best practices, and security. This is one of the most important topics in the exam as it’s the core of the API design in Mule.

One more thing, I’ve found a lot of people who think that RESTful is JSON. This is not true at all. While the usage of JSON in RESTful APIs is widely used, remember that it also supports XML and other payload formats via content-type header reference. This is particularly true for RAML as it can serialize objects based on the content type you specify in the document.

API-led connectivity and lifecycle

MuleSoft has a set of best practices for APIs. This is very well documented as API-led connectivity. You can get a quick view here.

https://www.mulesoft.com/lp/whitepaper/api/api-led-connectivity

Also, MuleSoft has very specific products and practices to manage the lifecycle of your APIs through the Anypoint Platform, such as API designer, API portal and Exchange. Make sure you know all these products inside out. As part of the lifecycle management, be sure you understand the role of each product in it. To start looking into these components, see this link:

https://www.mulesoft.com/ty/ebook/api-lifecycle-management

One resource I knew recently is the API Notebook. A tool for writing API tutorials that you can share with your peers and that runs JavaScript snippets.

https://api-notebook.anypoint.mulesoft.com/

Be sure to know the API of this, you can find it here:

https://api-notebook.anypoint.mulesoft.com/help/api-guide

Summary

In my experience taking this exam, I noticed HTTP and RAML specifications are covered extensively. In the HTTP spec side, I got a bunch of questions about codes, requests formats, responses and headers in order to define an API properly.

I strongly advise you to get familiar with the API lifecycle management products in Anypoint platform. Moreover, do your own study projects on these products. Design an API from scratch using API Designer, publish it and make it discoverable within your organization. This will help you to understand MuleSoft’s practices and products while you study the specs as well. Will save you a little of time.

Let me know your experience about this exam. Write any comment and let’s help others looking for help on this topic.

Resources


IO Connect Services - https://www.ioconnectservices.com/

IO Connect Services - MuleSoft partnership - https://www.mulesoft.com/partner/io-connect-services

API Design course overview - https://training.mulesoft.com/instructor-led-training/apdev-api-design

API Design course guide - https://training.mulesoft.com/static/public_downloadables/datasheets/APApiDesign3.9_datasheet.pdf

REST API Tutorial website - http://www.restapitutorial.com/

W3C SOAP tutorial - https://www.w3schools.com/xml/xml_soap.asp

RAML Specification - https://github.com/raml-org/raml-spec/blob/master/versions/raml-10/raml-10.md

RAML 1.0 tutorial - https://raml.org/developers/raml-100-tutorial

API-led connectivity - https://www.mulesoft.com/lp/whitepaper/api/api-led-connectivity

API Lifecycle management - https://www.mulesoft.com/ty/ebook/api-lifecycle-management

API Notebook - https://api-notebook.anypoint.mulesoft.com/

API Notebook guide - https://api-notebook.anypoint.mulesoft.com/help/api-guide

Tuesday, September 26, 2017

AWS Step Functions in Enterprise Integration

The article, Achieve enterprise integration with AWS, depicts the orchestration of Lambdas using Amazon Simple Workflow (SWF) with outstanding results. As stated, SWF requires a standalone application running in order to process the flows and this time we wanted to migrate the application to a 100% serverless solution. The article also mentions that a new service is available and looks very promising in the serverless scenario, Step Functions. Here, we want to show you how we took the previous approach and transform it into a Step Functions-led approach.

AWS Step Functions is a service that helps you to create a flow based on several units of work, often implemented in AWS Lambdas. This service is basically a state machine: given an input, an initial state will compute what's required by the underlying implementation and will generate an output. This output serves as the input for the next state whose output might be used as an input for another step and so on until the flow is completed and the last state gets executed. Each state, or node in the visual editor in AWS Step Functions console, is implemented with a Lambda and the flow of the state machine is orchestrated by the logic specified in the transitions' definitions.


AWS Step Functions provides the following functionality:

  • Create a message channel between your Lambdas.
  • Monitoring the functionality of the Lambdas by reporting the status of each.
  • Automatically trigger each step.

The Scenario

In IO Connect services we want to test this new feature of AWS with an enterprise integration use case based on the scenario described in the SWF implementation. We modified the file size according to the AWS Step Functions free tier for testing purposes:

  1. Reduced the CSV (comma separated values) file stored in AWS S3, from 800k+ to 100K+ records and 40 columns for each record. This because we want to be sure the number of state transactions will not overpass the 4000 established in the free tier, reduce the records to 100+ give me approximately 400+ pages to be created, in the case of the "Parallel Branches" approach (explained below) it consumes 1200+ transactions, give me at least 3 runs before pass the free tier limit vs the 3200+ pages created by the original file that consumes approximately 9200+ transactions, generating a cost of $0.14 USD the first execution and $0.23 USD per execution after.
  2. Create pages of the file according to the batch size specified. SQS has a limit of 256KB per message, and using the UTF-8 charset with 250 records peer page/message gives me 230KB approximately.
  3. Store the pages in individual files in a Storage Service like AWS S3.


For this approach, the main idea is to use AWS Step Functions as an orchestrator in order to see all its features - logs, visual tools, easy tracking, etc. - that provides to support enterprise integration. The actual work parts are implemented using AWS Lambda. Because of the AWS Lambda limits, the units of work are very small to avoid reaching these limits, hence a good flow written in AWS Step Functions requires a series of steps perfectly orchestrated.


What we can do with AWS Step Functions.

This was a completely new tool for us so we did some diligence to investigate about what can, can't-do and other useful information about this tool:

Can.

  • Use a simple JSON format text to create the State Machine.
  • Use states to call AWS Lambda Functions.
  • Run a defined number of branches that execute steps in parallel.
  • Have a different language for your Lambdas in the same State Machine.
  • Send serializable objects like POJOS in the message channel.
  • Create, Run and Delete State Machines using the API.
  • Use the logs and other visual tools in order to see execution messages.
Can not.
  • Edit an already created State Machine. You'll have to remove it and then re-create a new state machine.
  • Launch the state machine using a trigger event like create a file in S3. Instead, you'll need to write a Lambda to trigger it.
  • Create a dynamic number of branches of states to be run in parallel. It's always a pre-defined set of parallel tasks.
  • Use visual tools (like drag and drop) to create the state machine. All the implementation must be done writing a JSON. The tool only shows you a graph with the representation of the state changes, but you can not use it to create the state machine, it only visualizes it.
  • Send non-serializable objects in the message channel. This is a big point as you must be sure the objects you return in your Lambda are serializable.
  • Resume the execution if some of the steps fail. Either it runs completely or fails completely.
Consider this. 
  • The free tier allows you 4,000 step transitions free per month.
  • All the communication between steps is made using JSON objects.
  • A state machine name must be 1-80 characters long.
  • The maximum length of the JSON used as input or result for a state is 32768 characters.

Approach 1: Lambda Orchestration.

For the first approach, we wanted to test how Step Functions work. For this purpose, we only set two steps in order to see what we can examine using the Step Functions logs and graph.

The StateMachine JSON
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
  "StartAt": "FileIngestion",
  "States": {
    "FileIngestion": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaFileIngestion",
      "Next": "Paginator"
    },
    "Paginator": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaPaginator",
      "End": true
    }
  }
}

The Graph

  1. As mentioned before, a state machine cannot be triggered by an S3 event or similar, so we used a Lambda to trigger the state machine when an S3 object is created in a certain bucket and ends with ".csv" extension, then this Lambda passes the execution to the state machine by starting it and also passes the S3 object details as an input parameter.
  2. The FileIngestion step calls a Lambda that reads the information provided by the event trigger to locate and read the file created in S3, calculate the number of pages to create and returns this number and the file locations as output.
  3. The Paginator calls a Lambda that this reads the lines of one single page, stores it in a variable, then call another Lambda in an async mode to write a file with the page content. This process is repeated until the original file is completely read.
In this approach, the Lambdas have more flow control than the state machine, because one Lambda calls another and orchestrates the asynchronous executions. Also in this case, if the Lambda that writes the pages fails you can not notice it in the graph, you need to check the Lambda executions and manually identify which Lambda and why it failed.

The Metrics. 
  • The total time of execution is 4 minutes average to process 100K+ records.

Approach 2: Lineal Processing.

Taking in count the previous implementation, We wanted to create a state machine that has more authority of the control of the flow execution. As a first step, we decided to implement a linear execution with no parallelization defined.

The StateMachine JSON
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
{
  "StartAt": "FileAnalizer",
  "States": {
    "FileAnalizer": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaFileIngestion",
      "Next": "FileChecker"
    },    
    "FileChecker":{
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.writeComplete",
          "BooleanEquals": false,
          "Next": "PageCreator"
        }
      ],
      "Default": "QueueChecker"
    },
    "ReadSQStoS3":{
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaPageWriter",
      "Next": "QueueChecker"
    },    
    "QueueChecker": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.emptyQueue",
          "BooleanEquals": true,
          "Next": "SuccessState"
        }
      ],
      "Default": "ReadSQStoS3"
    },    
    "PageCreator": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:functionLambdaPaginator",
      "Next": "FileChecker"
    },    
    "SuccessState": {
      "Type": "Pass",
      "End": true
    }
  }
}

The Graph


  1. The FileAnalizer step calls a Lambda function that consumes the .csv file and creates a POJO with the start and end byte of each page to be created, these bytes are calculated depending on the page size parameter specified in the Lambda. You can see it this way, FileAnalizer creates an index of the start and end byte for each page.
  2. FileChecker is a Choice Step that verifies a boolean variable in which it determines if all the pages were completed. This information is stored in a SQS queue.
  3. PageCreator calls a Lambda that reads the start and end bytes of each page in the received POJO, reads the S3 file only in that portion - start and end bytes- and creates a SQS message with the page content.
  4. QueueChecker is similar to FileChecker but in this case waits until no messages are left in the SQS queue.
  5. ReadSQStoS3 is a Resource step that calls a Lambda function, it reads the messages in the SQS queue that represents a page of the .csv file and stores it in an S3 folder.
  6. SuccessState ends of the state machine execution.
For this approach, the message channel always contains the POJO with the start and end bytes of each page.

The Metrics.
  • The total time of execution is 15 minutes average to process 100K+ records.

Approach 3: Batch Writing.

We took the same state machine for the linear processing but, the Lambda resource in step ReadSQStoS3 was modified with the intention to reduce the execution time of the previous approach. I've added a long polling behavior in the Lambda with a maximum of 10 messages, with this, the Lambda waits for a maximum of 10 messages in SQS if available (if 20 seconds pass and the 10 messages are not visible, it gets the maximum available at that moment) in the queue, get them and calls another Lambda asynchronously to write these 10 messages.

The Metrics. 
  • The total time of execution is 10 minutes average to process 100K+ records.

Approach 4: Parallel Branches.

For this implementation, we added a series of 5 branches in order to read the first and end byte of each page and send a message to SQS with the page content in parallel.

Here we faced two problems:
  1. We confirm the Step Functions limitation that you cannot create a dynamic number of parallel branches. This means you have to define a fixed set of parallel jobs since the beginning.
  2. At the end of the parallel execution, meaning the 5 tasks depicted below, Step Functions aggregates the results of all tasks and passes a single message with all results in it. This results in a problem with big JSON structures that can convert into a bigger JSON at the end of the parallel execution. If this JSON is bigger than 32768 characters then an error will be thrown and the execution will fail.
The StateMachine JSON
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
{
  "StartAt": "FileAnalizer",
  "States": {
    "FileAnalizer": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaFileIngestion",
      "Next": "FileChecker"
    },
    "FileChecker":{
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.writeComplete",
          "BooleanEquals": false,
          "Next": "ParallelWritePage"
        }
      ],
      "Default": "QueueChecker"
    },
    "ParallelWritePage":{
      "Type": "Parallel",
      "Next": "DeleteRead",
      "Branches": [
        {
          "StartAt": "SetBatchIndex0",
          "States": {
            "SetBatchIndex0": {
              "Type": "Pass",
              "Result": 0,
              "Next": "PageCreator0"
           },
            "PageCreator0": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex1",
          "States": {
            "SetBatchIndex1": {
              "Type": "Pass",
              "Result": 1,
              "Next": "PageCreator1"
           },
            "PageCreator1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex2",
          "States": {
            "SetBatchIndex2": {
              "Type": "Pass",
              "Result": 2,
              "Next": "PageCreator2"
           },
            "PageCreator2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex3",
          "States": {
            "SetBatchIndex3": {
              "Type": "Pass",
              "Result": 3,
              "Next": "PageCreator3"
           },
            "PageCreator3": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        },
        {
          "StartAt": "SetBatchIndex4",
          "States": {
            "SetBatchIndex4": {
              "Type": "Pass",
              "Result": 4,
              "Next": "PageCreator4"
           },
            "PageCreator4": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPaginator",
              "End": true
            }
          }
        }
      ]
    },
    "DeleteRead":{
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaGetLastIndex",
      "Next": "FileChecker"
    },
    "ReadSQStoS3":{
      "Type": "Task",
      "Resource": "arn:aws:lambda:REGION:ACCOUNT_ID:LambdaPageWriter",
      "Next": "QueueChecker"
    },    
    "QueueChecker": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.emptyQueue",
          "BooleanEquals": true,
          "Next": "SuccessState"
        }
      ],
      "Default": "ReadSQStoS3"
    },    
    "SuccessState": {
      "Type": "Pass",
      "End": true
    }
  }
}


The Graph


  1. FileAnalizer is a resource step in which a Lambda is called to read the csv file stored in AWS S3, create the index of the start and end byte of each page and storage this indexes in an external service like a database, file, cache service, etc. 
  2. FileChecker is a choice step. It reads a boolean variable that indicates if all the pages were already stored as SQS Messages.
  3. SetBatchIndexX is a transition step. Sets a variable used by "PageCreator" to know which index is next to be read.
  4. PageCreatorX depending on the integer value passed by "SetBatchIndex", it extracts the page's start and end bytes, use them to read the CSV File only by the portion defined by these indexes and sends a SQS Message with the page content. It returns the information it reads in order to know in the next step which pages are left.
  5. DeleteRead receives the output payload of the parallel process and determines what pages were already written in SQS and deletes the information related to these pages from the external service (step 1).
  6. QueueChecker, ReadSQStoS3 and SuccessState, works as the same of the "Batch Writing" approach.
The Metrics. 
  • The total time of execution is 6 minutes average to process 100K+ records. 

Conclusion.

AWS Step Functions is a tool that allows you to create and manage orchestrate flows based on small units of work. The simplicity of the language makes it perfect for quick implementations, as long as you already have the units of work identified.

Unfortunately, as this is fairly new service in the AWS ecosystem, functionality is severely limited. A proof of this is the fact that you need to maintain a fixed number of parallel steps and if you end up having less work than parallel steps you must add control logic to avoid unexpected errors.

Moreover, given the limits found in AWS Lambda and Step Functions, computing of high workloads of information can be very difficult if you don't give a good thought to your design to decompose the processing. We highly recommend you give a read to our blog Microflows to have an understanding of what this means.

On the plus side. if you want to transport small portions of data or compute small processes in a serverless fashion, Step Functions is a good tool for it.
In the future, we will evaluate a combination of other new AWS services like AWS Glue and AWS Batch together with Step Functions to achieve outstanding big data processing and enterprise integration.

Thanks for taking time to read this post. I hope this is helpful to you at the moment you decide to use Step Functions and do not hesitate to drop a comment if you have any question.

Tuesday, August 22, 2017

Achieve Enterprise Integration with AWS Lambdas, SWF and SQS

In recent days, We were asked to do an ETL flow using Amazon Web Services. Because we excel in Enterprise Integration we had a particular design in mind to make it happen. The job was pretty simple:
  1. The trigger was a file placed in a particular S3 bucket.
  2. Take the S3 object metadata of the file as the input of the job.
  3. Read the file and package the records in pages, each page is sent asynchronously as a message. This technique is to increase parallelism in the job processing since the files contain one million records in average.
  4. Consume all pages asynchronously and upload them as micro-batches of records into a third-party system via a Restful API.
  5. Other tasks to complete the use case like recording the completion of the job in a database.
On top of these basic requirements we had to make sure the system was robust, resilient and as fast as possible while keeping low the costs of the different systems.

We chose to use different services from Amazon Web Services for this: S3, Simple Workflow (SWF), Simple Queue Service (SQS) and Lambda.

Here a diagram of the solution (click on the image to see it bigger).

Solution diagram
Solution diagram


Why Simple Workflow (SWF)?

As you can see in the diagram, every task is executed by a Lambda function, so why involve Simple Workflow? The answer is simple: We wanted to create an environment where the sequence of task executions was orchestrated by a single entity, and also be able to share with the different tasks the context of the execution.

If you think of this, we wanted to have something similar to a Flow in a Mule app (MuleSoft Anypoint platform).

It is important to highlight that AWS has some specific limits to execute Lambdas like one Lambda function can only run for a maximum of 5 minutes. Due to these limits, we had to break the tasks into small but cohesive units of work while having a master orchestrator that could run longer than that. Here's where the shared context comes useful.

Note: There's another service that plays very well on the serverless paradigm as opposed to SWF, Step Functions, but at the time We were working on this task it was still in Beta, hence not suitable for production. There is a follow-up post about full Serverless integration that will include Step Functions.

Challenges and recommendations


While working with SWF and Lambdas, We learned some things that helped us a lot to complete this assignment. Here I'll show you the situation and solution that worked for me.

Invoke Lambdas from activities, not workflow workers


One thing you should know about working with SWF is that every output of an activity returns as a Promise to the workflow worker - very similar to a Promise in JavaScript. This Promise returns the output as a serialized object that you need to deserialize if you want to use it as an input for a Lambda function execution directly from the workflow worker. This overhead can be very cumbersome if you use it frequently. In your lambdas you're supposed to work with objects directly, not serialized forms.

Here my first advice, even though you can invoke a Lambda function from within a workflow worker don't do it, instead use an Activity worker. This way each workflow worker implements a unit of work that calls an Activity worker which in turn calls a Lambda function internally. Why? Because in the Activity worker you will be able to use a proper object to pass to the Lambda as an input parameter. This technique requires you to deal with some extra plumbing in your SWF code since you'll need one Activity per Lambda, but in the end, this provides you a very flexible and robust mechanism to exchange information between SWF and Lambdas.

See this sequence diagram to understand it.

Workflow, activity and lambda sequence diagram.
Workflow, activity and lambda sequence diagram.


Wrap your payload in a Message object


All in all, we are talking about Enterprise Integration and one of the central pieces is the message. In order to uniformly share information between the workflow and the different Lambdas, it's better to standardize this practice by using a custom Message object. This Message must contain the workflow context you want to share and the payload. When the Lambda functions are called, they receive this Message object that they use to extract the information required to perform the task fully with no external dependency.

Decompose large loads of data into small pieces


As mentioned before, Lambdas are supposed to run small tasks quickly and independently, therefore they have limits that you should be aware of, such as execution time, memory allocation, ephemeral disk capacity, and the number of threads among others. These are serious constraints when working with big amounts of data and long running processes.

In order to overcome these problems, we recommend decomposing the entire file content into small pieces to increase task parallelism and improve performance in a safe manner - actually, this was one of the main reason to use Lambdas since they auto-scale nicely as the parallel processing increases. For this, we divided the file content into packages of records as pages, where each page can contain hundreds or thousands of records. Each page was placed as a message in an SQS queue. The size of the page must consider the limit of 256 KB per message in SQS.

Keep long running processes in Activities, not Lambdas


As you see in the diagram above, there's a poller that is constantly looking for new messages in the SQS queue. This can be a long running process if you expect dozens of thousands of pages. For cases like this, having activities in your flow is very convenient as you can have an activity running for up to one year, this contrasts highly with the 5-minute execution limit of a Lambda function.

Beware of concurrency limits


Consider the scenario where you have an Activity whose purpose is to read the queue and delegate the upload of the micro-batches to an external system. Commonly, to speed up the execution you make use of threads - note I'm talking about Java but other languages have similar concepts. In this Activity, you may use a loop to create a thread per micro-batch to upload.

Lambda has the limit of 1024 concurrent threads, so if you plan to create a lot of threads to speed up your execution, like uploading micro-batches to the external system mentioned above, first and most importantly, use a thread pool to control the number of threads. We recommend do not create instances of Thread or Runnable, instead, create Java lambda functions for each asynchronous task you want to execute. Make sure you use the AWSLambdaAsyncClientBuilder interface to invoke Lambdas, the ones in AWS, asynchronously.


Conclusion


This approach was particularly successful for a situation where we were not allowed to use an integration platform like Mule. It is also a very nice solution if you just need to integrate AWS services and move lots of data among them.

AWS Simple Workflow and Lambda work pretty well together although they have different goals. Keep in mind that an SWF application needs to be deployed on a machine, like a standalone program, either in your own data center or maybe an EC2 instance, or another IaaS. 

This combo will help you to orchestrate and share different contexts, either automated through Activities or manual by using signals, but if what you need is isolated execution and chaining is not relevant to you, then you could use Lambdas only, but the chained execution will no truly isolate them from each other and the starting Lambda may timeout before the Lambdas functions triggered later in the chain finish their execution.

Moreover, every time you work with resources with similar limitations like AWS Lambdas, always bear in mind the restrictions they come with and design your solution based on these constraints, hopefully, in Microflows.  Have a read on the Microflows post by Javier Navarro-Machuca, Chief Architect at IO Connect Services.

To increase parallelism we highly recommend using information exchange systems such as queues, transient databases or files. In AWS you can make use of S3, SQS, RDS or DynamoDB (although our preference is SQS for this task)

Stay tuned as we're a working on a solution that uses Step Functions with Lambdas rather than Simple Workflow for a full Serverless solution integration.

Happy reading!

Resources


Enterprise Integration Patterns - http://www.enterpriseintegrationpatterns.com/
Amazon Simple Workflow - https://aws.amazon.com/swf/
Amazon Lambda - https://aws.amazon.com/lambda/ 
Amazon Simple Queue Service - https://aws.amazon.com/sqs/


Thursday, August 3, 2017

Mule Batch - Adding Resiliency to the Manual Pagination Approach

In the previous post Benchmarking Mule Batch Approaches, written by my friend and colleague Victor Sosa, we demonstrated different approaches for processing big files (1-10 GBs) in a batch fashion. The manual pagination strategy proved to be the fastest algorithm, but with one important drawback: it is not resilient. This means that after restarting the server or the application, all the processing progress is lost. Some of the post commenters highlighted that this lack was needed to evaluate this approach against the Mule Batch components equally since the Mule Batch components provide resiliency by default.

In this post, I show how to enhance the manual pagination approach by making it resilient. For the testing of this approach I use a Linux virtual machine with the following hardware configuration:
  • Intel Core i5 7200U @ 2.5 GHz (2 cores)
  • 8 GB RAM
  • 100gb SSD
Using the following software:
  • MySQL Community Server 5.7.19
  • AnyPoint Studio Enterprise Edition 6.2.5
  • Mule Runtime Enterprise Edition 3.8.4 
To process a comma-separated value (.csv) file that contains 821000+ records with 40 columns each, the steps are as follows:
  1. Read the file.
  2. Build the pages.
  3. Store in a Database.

You can find the code in our GitHub repository https://github.com/ioconnectservices/upsertpoc.

The Approach.

We based on the manual pagination approached from the aforementioned article and created a Mule app that processes .cvs files.  This time we added a VM connector to decouple the file read and page creation from the bulk page-based database upsert. We configured the VM connector to use a persistent queue so that messages are stored in the hard disk.


Description of the processing flow:

1. The file (.csv) is read and the number of pages is calculated according to with the number of records configured per batch size, in this case, 800 records are set per page.

2. The file put in the payload as a stream in order to be accessible and forward read it to create pages in each ForEach loop.

3. Each page is sent to a persistent VM connector to store the pages in the DB in a different flow. Make it the VM connector persistent means that pages are written into files in the disk, hence the inbound VM connector can resume the consumption of the messages as files after an application reboot so that the records in those messages can get upserted in the database.

Metrics.

I took the metrics used in the previous Mule Batch article as a baseline to compare the efficiency of this new approach. I recreated very similar flows to test in my environment and I obtained the following results:

Out-of-the-box Mule batch jobs and batch commit components.
  • The total time of the execution is 7 minutes average.
  • The total of memory usage is 1.34Gb average.
Custom pagination.
  • The total time of the execution is 6 minutes average.
  • The total of memory usage is 1.2Gb average.
Custom Pagination with VM connector (This Approach).

At first, I obtain good results with this approach, but they were 30 seconds slower than the "Custom Pagination" one:
  • The total time of the execution (without stopping the server) is 6 minutes and 30 seconds average.
  • The total of memory usage is 1.2Gb average.
After increasing the number of Threads from 25 to 30 in the Async connector configuration these are the results:
  • The total time of the execution (without stopping the server) is 6 minutes average.
  • The total of memory usage is 1.2Gb average.

Conclusions.

When designing an enterprise system many factors come into the play and we have to make sure it will work even in disastrous events. Adding resiliency to a solution is a must-have in every system. For us, the VM connector brings this resiliency while keeping the execution costs within the desired parameters. Also, you need to know that some performance tuning should be implemented in order to obtain the best results in resiliency without compromising performance.

Friday, July 7, 2017

MCD - Integration and API Associate certification tips

In this post, I want to share my experience on how I passed successfully the MCD - Integration and API Associate certification exam. This is one of the entry level certifications for the Mule platform.

Working for IO Connect Services as an Integration Engineer, I need to focus on Enterprise Integration for my daily tasks, and for this purpose, we use the Mule integration platform by MuleSoft.
To prepare for the MCD certification, you can find plenty of documentation and tips in many sites and developer forums like in StackOverflow and MuleSoft.U - where courses and tutorials are for free.

Introduction to the exam.

This is not a complicated exam, but you must have a good software developer background in order to understand the topics.
You have the chance to use AnyPoint Studio Enterprise Edition for 30 days free, this is a very useful tool and this is where you are going to do your practices. AnyPoint Studio is an Eclipse Based IDE that contains visual tools that are very intuitive. After all, Mule uses Java, then if you are familiar with this language and you have some experience using the Spring framework, you are more than ready to start.

I highly recommend that in all of the practices, you use the debugger tool to see how the variables change between the application components. This will help you to identify why and how these changes happen, and this type of knowledge is fundamental to pass the exam.

The exam contains the following topics:
  1. Introducing API-Led Connectivity
  2. Designing APIs
  3. Building APIs
  4. Deploying and Managing APIs
  5. Accessing and Modifying Mule Messages
  6. Structuring Mule Applications
  7. Consuming Web Services
  8. Handling Errors
  9. Controlling Message Flow
  10. Writing DataWeave Transformations
  11. Connecting to Additional Resources
  12. Processing Records

Exam preparation.

MuleSoft offers two courses for training: the instructor-led Anypoint Platform Development Fundamentals course - onsite and online delivery - and the self-paced MuleSoft.U Development Fundamentals course. Both options cover all the topics for the exam, the difference is that in the first one you have to attend regular classes with an instructor for five days, eight hours a day; and in the second option you are given the training material and you study and practice it on your own. The official MuleSoft.U site says that the self-paced training may take you up to eight weeks if you study the material 3 hours per week, but if you are very dedicated, you can prepare it in a shorter time. Both training options are great, you can decide which one to take based on your available time, the way that you feel more comfortable, and last but not least... the price. By the time of writing this post, the instructor-led training has a price tag around the $2,500 USD.

These are the links to get more details of the two training options:
  • Instruction-led: https://training.mulesoft.com/instructor-led-training/apdev-fundamentals
  • Self-learning: https://training.mulesoft.com/instructor-led-training/mulesoftu-fundamentals
In my case, and based on the support of my employer and manager, I decided to take the self-learning option since I could study around six to eight hours every day during weekdays. I found myself confident to take the exam in less than 2 weeks to do the exam, and you know the result… I passed! At the end, it all depends on your available time and dedication.

Extra documentation.

Unfortunately for this certification there aren’t practice exams out there as in other certification programs, but the course material provided by MuleSoft is very complete and easy to follow. If you feel that you need more information you can use this other training resource:

MuleSoft User Guide.

The MuleSoft official site provides you a complete guide to its products, there you can find a more detailed information about the tech specifications and code examples of all Mule modules and components.

https://docs.mulesoft.com/mule-user-guide/v/3.8/

Ask the Experts.

If you know someone that was obtained the certification before, try to ask him/her, if not, you always can go to the MuleSoft forums looking for answers, try to search it first before post, it is highly possible that someone asks the same question before.

https://forums.mulesoft.com/index.html

Tips for the exam.

Here are some tips that I recommend that you take into consideration before taking the exam:

You Have Opportunities.

If you don’t pass the exam, don’t worry, you have another two chances to take it, at the same modality, so if this happen to you, check the results page and see in which modules you need to improve, go back to the material, study and take notes of these parts and bring it to you in the next chance.

Bring Notes.

This is an online unproctored open book exam, that means that you have the possibility of bring books and notes, even to search on internet, but I highly recommend that you bring notes of the modules that you really think that not completely understand, you didn’t want to waste time searching the answers that you already know only for “be sure” if you have time at the end go back to the question that you are no sure and compare it with your notes.

Use the Debugger.

I mentioned it before, but I want to say it again. It is very important that you take time to use the AnyPoint Studio Debugger in the most of the practices, this way you get to see how the variables change their values and the way the flow components interact with each other.

A big thing to highlight her, the debugger is a tool only available on AnyPoint Studio Enterprise Edition - it is not in available in the Community Edition. Thankfully, you have a 30 days for free to use the EE version, so take advantage of it. If for some reason you can’t launch the application in debug mode, check your target Mule server edition.

Read all in the exam.

Be careful at the moment of answering the questions, read all the description of the question and all the possible solutions; some answers can be tricky.

Learn about Java.

You use visual components and xml documents to develop Mule applications, but all is based in Java. Also, in the final training modules, you learn how to create components for your applications based on Java. It is important that you know this programing language in advanced to understand how the Mule applications run. You have to use a JVM based language like Java if you want to build custom Mule extensions.

Learn about Enterprise Integration Patterns.

Many Mule components are concrete implementations of the Enterprise Application Patterns. This is not a must for the MCD exam, but they are very handy if you want to build robust, reliable, extensible, and fault tolerant Mule applications. I recommend that you take a look to http://www.enterpriseintegrationpatterns.com

Conclusion.

I hope this post can be helpful for your preparation for the MCD - Integration and API Associate certification exam. No matter what option you choose, instructor-led or self-pacing study, be constant, study and prepare your notes. This is not a complicated exam. Don’t worry if you fail in the first time, you have other opportunities (the certificate doesn’t show how many times you take to pass the exam).

Thank you for taking your time to read this article, I really hope my shared experience could be helpful to you. If you have some comment to complement this post, please share it with us! We would love to hear from you.

Happy studying!

Tuesday, June 27, 2017

AWS Certified Developer Associate Tips

I may not include a lot of information, but I wanted to share my experience of how I passed the AWS Certified Developer Associate certification with a score of 94/100.

I posted initially this on my personal blog, but I thought this is a good read for a more serious blog site such as IO Connect Services blog so I decided to re-post it. You can find the original here 
https://victorsosasw.blogspot.mx/2017/06/aws-certified-developer-associate-tips.html

A week ago, I presented and passed the AWS Certified Developer - Associate exam. I found it difficult even though it is an associate level, with due dedication, one can pass it.

I'd like to share my experience with you.

I got into AWS as part of my work at IO Connect Services with one of our customers. It's exciting as it's my first time doing Serverless and Cloud computing with AWS Lambda and other AWS technologies. Because of this, and other plans on my list, I decided to prepare for the AWS CDA exam. One thing I wasn't aware is that there's a lot you have to learn for this. In my opinion, this certification is not an associate level and I'll tell why.

First of all, the topics you have to learn, and mostly memorize are:
  • AWS Cloud computing fundamentals.
  • Identity and Access Management (IAM).
  • Elastic Cloud Computing (EC2).
  • Elastic Block Store (EBS).
  • Simple Storage Service (S3).
  • Virtual Private Cloud (VPC).
  • Elastic Load Balancer (ELB).
  • DynamoDB.
  • Simple Workflow Service (SWF).
  • CloudFormation.
  • Simple Queue Service (SQS).
  • Simple Notification Service (SNS).
  • Elasticbeanstalk.

All these services are spread in the following 4 categories in the exam:
  • AWS Fundamentals.
  • Designing and Developing.
  • Deployment and Security.
  • Debugging.

The format of the exam is questions with multiple choices, where in multiple times you have to select all that apply, increasing the difficulty of selecting the right combination of answers.

In my experience, from these topics, you have to go in detail with IAM, VPC, EC2, DynamoDB, SQS, and S3. The exam is bloated with questions about these services in deep, so you better get familiar with them and make sure that you can master how to build and deploy an application without many supporting references.

Also, I haven't mentioned SDKs yet. They are not covered in deep. As long as you can identify the supported SDKs you'll be mostly fine. This means that you will not be questioned about a particular API or routine from the SDKs. Bear in mind that you will be questioned about how to interact with the REST APIs of the services though, mostly the common aspects like authentication, token management, HTTP response codes, among others.

Like I said, it's difficult but not impossible. I used the resources listed below to prepare myself.

Udemy

I took AWS Certified Developer Associate courses from Udemy. They are video tutorials and the instructor really does a good job at explaining each of the topics and scenarios that are covered in the exam. Also, the practice exams gave a lot of chances to improve as you can see explanations of the answers. You can retake the exams to train your knowledge.

https://www.udemy.com/aws-certified-developer-associate/
https://www.udemy.com/aws-certified-developer-associate-2017-practice-tests/

Even though I also looked at different resources, I think these links will give you a good understanding of what this exam entails.

Safari Books.

Also, as I have a subscription to Safari Books through IO Connect Services, I used these video tutorials too. You can find the same videos in Packt Publishing.

https://www.safaribooksonline.com/library/view/aws-certified-developer/9781788298384/
https://www.safaribooksonline.com/library/view/aws-certified-developer/9781788294942/
https://www.safaribooksonline.com/library/view/aws-certified-developer/9781788297721/
https://www.safaribooksonline.com/library/view/aws-certified-developer/9781788290722/

Test King

Here you will find questions that are a lot like the ones in the exam. Really useful resource. I encourage you to go through all questions and read the comments, as some of the answers are wrong but the people in comments give you a really good hint of what answers are the good ones.

http://www.aiotestking.com/amazon/category/exam-aws-cda-aws-certified-developer-associate/

Mobile app

These are mobile apps with a variety of exam-like questions. I bought it for under $10 USD. Available on Android and iPhone.

https://play.google.com/store/apps/details?id=com.ionicframework.awsquiz543924
https://itunes.apple.com/us/app/aws-certified-developer/id1065366598?mt=8

AWS FAQ

Last but not least, make sure you read the FAQ, particularly for S3, DynamoDB, EC2, and VPC. A lot of the questions are about things that do not come up often in normal scenarios of development like limits, region support, and corner cases.

Is it an online exam?

Yes but it's proctored. Have in mind that this exam can only be applied by authorized proctors and probably you or your employer will have to allocate budget for your travel like in my case. When you schedule your exam, you will see the authorized centers close to you.

I would love to read your comments or questions.

Happy reading!

Thursday, April 27, 2017

Benchmarking Mule Batch Approaches

Note: This blog post demonstrates that when fine-tuning a Mule application, the processing of really big volumes of data can be achieved in a matter of minutes. A next blog post, written by Irving Casillas, shows you exactly how to do this but adding resiliency. See http://blog.ioconnectservices.com/2017/08/mule-batch-adding-resiliency-to-manual.html.

Bulk processing is one of the most overlooked use cases in enterprise systems, even though they’re very useful when handling big loads of data. In this document, I will showcase different scenarios of bulk upsert to a database using Mule batch components to evaluate some aspects of performance. The objective is to create a best practice of what’s the best configuration of Mule batch components and jobs that process big loads of data for performance purposes. Our goal is to optimize timing executions without compromising computer resources like memory and network connections.


The computer used for this proof has a very commodity configuration as follows:
  • Lenovo ThinkPad T460
  • Intel Core i7 2.8 GHz
  • 8 GB RAM Memory
  • HDD 100GB SSD


The software installed is:
  • MySQL Server v8.0.
  • Mule v3.8.2.
  • Anypoint Studio v6.2.2.
  • JProfiler v9.2.1


The evaluation consists of the processing comma-separated values (.CSV) file which contains 821000+ records and 40 columns for each record.  First, we consume the file content, later we transform the data, and then we store them into a database table. The file sizes 741 MB when uncompressed. To ensure that each record has the latest information, the database queries must implement the upsert statement pattern.


Three approaches are shown here:
  1. Out-of-the-box Mule batch jobs and batch commit components.
  2. Using a custom pagination algorithm.
  3. A hybrid solution.


The first two approaches use the bulk mode of the Insert operation in the Database connector. The third approach uses the Bulk operation in the Database connector.


One important distinction here: The term “upsert execution” is used to refer to the action of inserting a new record but if you find a record with that key then update it with the new values if any. In MySQL this is called “Insert … on duplicate key update” and you can find the documentation here https://dev.mysql.com/doc/refman/5.7/en/insert-on-duplicate.html. The term “Insert operation” is used to refer to the Insert configuration element in Mule. You can find documentation here https://docs.mulesoft.com/mule-user-guide/v/3.8/database-connector-reference#insert. In other words, we are executing MySQL upsert statements with the Insert operation from the Db component in Mule.


You can find the code in our GitHub repository https://github.com/ioconnectservices/upsertpoc.

Approach 1: Out-of-the-box Mule batch jobs and batch commit components

Some background about the Mule Batch connector. You can configure the block size and the maximum amount of threads of the connect. Moreover, in the Batch Commit component, you can also configure the commit size of the batch. This gives a lot of flexibility in terms of performance and memory tuning.


This flexibility comes with a price: You must calculate the amount of memory the computer will be able to manage for this process only on each block size. This can be easily calculated with the following formula:


Maximum memory = Size of the record * block size * number of maximum threads


For instance, in our test, the size of the record -which is a SQL upsert statement- is 3.1 KB. The settings for the Batch component are 200 records of block size and 25 running threads. This will require a total of 15.13 MB per block size. In this case, this will be executed a minimum of 4105 times approximately (remember the 821000 records?). Also, you must verify that your computer host has enough CPU and memory available for the garbage collection too.


The flow

batchFlow.png
Figure 1. The batch flow.


  1. The batch job is configured to use a maximum of 25 threads and a block size of 200 records.
  2. The file content is transformed into an iterator which is then passed to the batch step.
  3. The commit size of the Batch Commit is set to 200. This matches the block size of the batch job, meaning the full block will be committed.
  4. The Database connector is using an Insert operation in bulk mode and it’s parameterized.


A simple flow as all the pagination and bulk construction is done by Mule, we just need to worry about the SQL statement and performance.

The metrics

As explained before, batch jobs are designed to run as fastest as possible by running multiple processes in threads.


  1. The total time of execution is ~7 minutes.
  2. 4105 round trips are made to insert the records into the database.
  3. The maximum memory used during the execution of the batch is 1.29 GB.

Approach 2: Custom pagination

The overall idea here is to read the file, then to transform the content into a list and iterate through the list to create a page of records that allows us to construct a series of SQL queries. Later, the queries are sent in bulk fashion to a Database connector with the bulk mode flag enabled.

The flow

manual_pagination.PNG
Figure 2. Custom pagination flow.


  1. The number of pages is determined based on the page size. In this case, the page size is set to 200 like in the first approach.
  2. The For Each scope component takes the number of pages as the collection input.
  3. The CSV File Reader consumes the file and builds the map that will be used as the payload, and then it maps the CSV fields to columns in a database record.
  4. The created queries are passed to the Database asynchronous scope which executes the bulk statements with a maximum of 25 threads like in approach 1.

The metrics

  1. The total time of execution was 5 minutes average. The processing of the SQL bulk is done in a single thread, but the upsert execution is done asynchronously.
  2. The total of round trips to the database was 4109.
  3. The memory consumption was at a maximum of 1.42 GB.

The extra approach 3: Hybrid

This is an approach that was also tested but the results were not as satisfactory as the two above in regards to execution timing but it showed the least memory consumption. The results of the testing are presented next.


The SQL bulk query is constructed manually but the pagination is now handled by the batch job.

The flow

hybrid-approach.PNG
Figure 3. Batch manually building the SQL bulk statement.


  1. The CSV content is transformed into an iterator to be passed to the batch process step.
  2. The batch process handles the block size and pagination.
  3. In the batch process, each record is used to construct the SQL for one record in particular. The SQL query it is added to a collection that will be used to create one single SQL statement with all the queries appended to be processed in bulk.
  4. The Batch Commit connector handles the same size as the block size of the batch job.
  5. The Database connector uses the Bulk Execute operation to insert the records into the database.

The metrics

  1. The total time of execution when completed successfully is 18 minutes average.
  2. The number of connections matches the maximum 25 threads running in the batch job. This gives a total close to 4105 roundtrips to the database.
  3. The maximum memory used during the execution of the batch is 922 MB.


Conclusion

Many times, a thoughtful design is more helpful than the out-of-the-box features that any platform may offer. In this scenario, the custom pagination approach is the fastest to upsert the records into the database than the batch approach. However, a couple of things to consider as the outcome of this proof of concept:
  • The custom pagination approach is more flexible at treating data that can’t be split in records so easily.
  • For scenarios where you have a source with millions of records coming from separate systems, it’s generally a good practice to consume the content in a stream fashion to not blow the memory or the network.
  • It’s easier to maintain the batch job flow than the custom pagination flow.
  • Using Mule’s batch jobs gives you more facilities for batch result reporting as it gives you the total, succeeded and failed records.
  • If memory management is the most important factor to honor in your solution, then a hybrid algorithm approach is better as it shows the best numbers in memory.


As side experiments, I also observed that using the Bulk Execute operation in the Database connector is slower in performance than the Insert operation in bulk mode. Moreover, the parameterized mode allows you to take the data from any source -trusted or untrusted- and still have the queries sanitized.