ACM.113 Example demonstrating how swallowing errors can come back to bite
This is a continuation of my series of posts on Automating Cybersecurity Metrics.
I wrote a about error handling in my series on secure programming. I explained that it is not advisable to swallow errors, of in other words catch them in some code and not report them in any output from the application.
This post shows that CloudFormation is doing that in at least one case and how it causes problems. I wasn’t planning to write this post but I had to spend time working around the problem when I discovered it, so here you go.
When swallowing errors to ignore one type of error affects all errors
In regards to my delete script I presumed that I had not followed my own rules because I was in a hurry and being a tad lazy. Instead of checking to see if a CloudFormation stack exists before I delete it, I thought I had simply ignored errors where it didn’t exist because it was already deleted. I initially thought that was the problem. As it turns out this wasn’t my code after all.
I was trying to delete the first stack in my list — an EC2 instance — because I want to test my delete an redeploy scripts. I wrote about how I developed that deletion script here and dealt with various dependencies.
Deleting CloudFormation Stacks
Well now I have a new dependency on the EC2 stack. I knew I had some new resources and dependency issues. That’s why I’m going back to test my scripts. Rather than try to insert all the new dependencies up front, I figured I would test and resolve any errors along the way.
The only thing is, I didn’t get an errors after running the script from the CLI output but when I checked to see that the first stack in the list got deleted, it still existed. I did not get an error on the command line. There was also no error message in the CloudFormation stack list so initially I didn’t realize what was going on.
CloudFormation sometimes warns you when things can’t be deleted and leaves things in an ugly error state which I don’t like. I wrote about that recently in my post on how to fix CloudFormation:
In this case, CloudFormation did not leave the stack in an error state. I didn’t realize that an error had occurred. I clicked on the stack and realized that although the stack did not report an error or end up in an error status, the stack event list included an error message. The stack couldn’t be deleted due to an EIP association I added since the last time I tested this delete stack:
How an EIPAssociation in CloudFormation can Help Prevent Dependency Issues
I used the EIP association to avoid having to change firewall rules when I delete stacks. I will not delete the EIP but only the association for right now so I don’t have to delete the local firewall rules I created in this post:
Local Firewall Rules to Connect to an AWS EIP via SSH
I added a comment in the script indicating that still needs to be completed.
When swallowing errors comes back to haunt you
Here’s where my error is causing me grief. Initially I thought I was swallowing all errors associated with the aws CloudFormation delete-stack command to ignore errors when a stack doesn’t exist. That would have caused a similar result to what I am facing.
Rather than swallowing errors, I need to adjust the code to what a good engineer should do. Shame on me. I need to check to see if the stack exists before I attempt to run the delete command and not be lazy.
However, upon further investigation, I am not the one swallowing the error. I had already removed my code that swallowed that error and was properly reporting the results to the screen. So I was confused. Why did I not get an error when the stack failed to delete?
When I run my delete stacks I print out the commands so I can easily re-run them. Here’s the command I run to delete my EC2 stack:
aws cloudformation delete-stack –stack-name AppDeploy-EC2-Developer –profile delete
When I run that command in a terminal — I get no indication that the deletion failed. Aha. AWS is actually swallowing the error not reporting it back on the command line. My code has no way to know an error has occurred. It looks like everything was successful. How is my script supposed to know that the EC2 instance never got deleted?
This is not good. An AWS customer may be running a list of delete commands for stacks and things end up in a weird state because their script continues to delete things after an error has occurred and it should have stopped.
DO NOT SWALLOW ERRORS IN PRODUCTION ENVIRONMENTS.
This is what the error looks like in the AWS console. It is not even reported as an error it is reported as “UPDATE_COMPLETE” which is not accurate.
Perhaps AWS could instead leave a stack in a valid state but have a flag that there was an error on the last run or something like that. Because, yes, I don’t really like stacks hanging around like this that don’t really have errors. A change rolled back and I just decided to leave the stack as is.
But not reporting any error when an error occurred causes problems for downstream actions taken by a single script. The script continues when it should stop and report an error. The subsequent resources don’t delete properly because a prior dependency didn’t get deleted. Because that still exists, subsequent stacks fail to delete.
How can I resolve this problem?
Luckily, I had safety checks built into my delete script. That means that I can step through and check each stack delete properly before proceeding to the next one. I wrote about that in the above post on deleting resources.
Whenever I am testing that my script properly deletes new resources, the script gives the option to verify every single call to delete a stack. I used that option and paused the script after the first stack to make sure it worked before proceeding. That’s when I determined the resources in the first stack had not deleted correctly even though I got no error.
That is helpful but it doesn’t really solve the problem. I would like to know that the error occurred and have the script exit.
We can’t do that. Recall that neither the stack status nor the events indicate an error state. It would be pretty hokey to parse out the stack reason and try to decipher that. What other options do we have?
We can instead run a query to see if the stack still exists after calling the delete function and throw our own error based on the fact that the stack still exists when it should not.
Unfortunate that we have to do this extra work here all due to a swallowed error message. I hope this illustrates why you should not swallow errors. This whole blog post could have been avoided and it took some time to sort this out.
We could use this function that waits for the stack to get into a completed deletion state:
What I don’t like about this option is that as you can see it is going to check 120 times! I know in this case the stack deleting completes in about 2 seconds and then it exists in an UPDATE_COMPLETE status. So that’s seems like it’s going to waste my life away doing unnecessary checks in this particular case.
This function to check if a stack exists has the converse problem:
If I’m deleting a stack and the deletion is successful and I sit around waiting for the stack to exist that also is going to waste likely even more minutes or hours of my life.
All we need to do is run one query to see if the stack exists — yes or no. If it still does after running our delete script we have an error. I can use cloudformation describe-stacks:
What I can do is output the error message if the stack does not exist to a variable:
If the stack exists it will return the standard response:
Now I can write a function to check if a stack exists and return false if the results contain the string “does not exist.” Otherwise I will return true.
What I actually did after testing further is create two functions.
get_stack_status
Note that in this function, I am swallowing an error to capture it in a variable, but I am not ignoring it. I’m using it to provide an appropriate status, “NOEXIST”, that I can use in subsequent logic.
Also, am adding || true at the end of that statement so my script doesn’t get into an error state and stop when the AWS CLI throws an error because the stack doesn’t exist.
stack_exists
In this function I use the get_stack_status to get the CloudFormation status which will be NOEXIST if the stack does not exist. If it does not I return false.
The other thing this function does is check for the DELETE_IN_PROGRESS state. The deletion action does not wait for the the deletion of a stack to complete before returning control to the calling application. Then the check to see if the stack still exists will indicate true even though the stack is in the process of being deleted. When the calling function determines the stack still exists, it will exit the program. We need to make sure the deletion is complete before reporting whether or not the stack exists.
We can query the state of a stack like this:
aws cloudformation describe-stacks –stack-name AppDeploy-EC2-Developer –query Stacks[0].StackStatus –output text
We can run a loop as long as the stack remains in the DELETE_IN_PROGRESS state.
Then once the state is not in a DELETE_IN_PROGRESS state we can check to see if the stack still exists return true if it does.
I am not sure if this code works in all cases. I wonder about the stack status when it contains multiple resources. Most of my stacks don’t. We’ll deal with that problem if and when it happens.
I can use that function in two ways to improve my code.
First I will skip attempting to delete a stack that does not exist. That should save time waiting on AWS status reports from CloudFormation.Secondly, I can report an error when stack deletion fails even though the stack status indicates success and solve the problem that inspired this post.
Testing the script shows it works and now the script reports whether a stack does not exist or whether it was deleted.
Now there’s one more thing we need to do. If the stack was not deleted we need to exit. If the stack still exists after attempting to delete it, there’s a problem that needs to be resolved:
Now I can test my entire delete script and make sure it works.
I got another error where a stack was not deleted and my script properly exited:
In this case the stack status was CREATE_COMPLETE not UPDATE_COMPLETE so our method of checking for deletion catches cases that would not have been caught by checking for UPDATE_COMPLETE.
Do not swallow errors
All that extra code and time spent writing this post was due to the fact that an error occurred without reporting an error state. Seems like AWS could save us some time and make sure to report the error properly instead so our code can take a proper action as a result.
Do not swallow errors.
Especially when other people are depending on your code.
Now on to what I wanted to be doing …deploying a user-specific EC2 instance on AWS.
Follow for updates.
Teri Radichel
If you liked this story please clap and follow:
Medium: Teri Radichel or Email List: Teri Radichel
Twitter: @teriradichel or @2ndSightLab
Requests services via LinkedIn: Teri Radichel or IANS Research
© 2nd Sight Lab 2022
All the posts in this series:
Automating Cybersecurity Metrics (ACM)
____________________________________________
Author:
Cybersecurity for Executives in the Age of Cloud on Amazon
Need Cloud Security Training? 2nd Sight Lab Cloud Security Training
Is your cloud secure? Hire 2nd Sight Lab for a penetration test or security assessment.
Have a Cybersecurity or Cloud Security Question? Ask Teri Radichel by scheduling a call with IANS Research.
Cybersecurity & Cloud Security Resources by Teri Radichel: Cybersecurity and Cloud security classes, articles, white papers, presentations, and podcasts
Why You Should Not Swallow Errors was originally published in Cloud Security on Medium, where people are continuing the conversation by highlighting and responding to this story.