Building peaceful and scalable infrastructure

Posted by in Devops, Server Side, Startup, Technology

This is the first of the series of blog post on building peaceful and scalable infrastructure.

Every DevOps dream is to have a stable infrastructure so that he/she can sleep peacefully without having to worry about systems going offline. From my experience if you cover up following areas in your infrastructure. You can sleep peacefully.

1. Design/Architecture

The better you know your components in your infrastructure, the better you can design it. There are multiple layers at which the components in your infrastructure interact with each other:
1. Application level interaction of services, this will need digging deep
2. Network level interactions: this involves the understanding of various components an application would be using and how they would be interacting with each other.
3. Third party interactions: Not all the services used in your infrastructure be hosted on premises, some would be outside your scope. You need to identify them as well.

Once you have identified the interactions, It’s time to build out diagrams. These diagrams are very important, they would be a very good reference for all the things you would be doing in upcoming things.
1. Network diagram: this will help you understand how would you place different components, this would mean creating different subnets, define availability zones. Here you have to keep in mind future scalability by keeping enough space(not too much)
2. Application diagram: this is about how your services would interact, Below is an example of typical web-job setup:
SimpleWebJob

2. Security

Now that you have a clear picture of how the interactions are going to happen. It’s time to create boundaries, limit the scope. Try to keep different components in the different subnet for individual services. If you are planning to share a resource say, share a database then keep it in the subnet of the service consuming it most. Once you know that what components would be present in which subnet. It’s time to firewall them. You can either firewall them using built in firewall or say if you are using AWS then use security groups for the same. So every service, role will have its own security group allowing inbound from required services ‘subnets’ Know you will say that why not use security group in source instead of the subnet. Yes, you can do that, but I have found managing through subnet to be easier and also at the same time this can be replicated across different cloud platform. This was about network level security.

Service-level security, when working closely with AWS, I recommend using IAM roles, this will help you manage your resources way better. You can actually have an infrastructure where you won’t need any credentials to access database/redis. Because you just cannot access it from any not authorised machine. IAM roles work by making resource behave as if it’s a user and limiting access on that front.

Next is access management.  This is also a key important thing to take care. When I talk about access management, it’s not about developers accessing machines, but machines accessing machines. Also the 7 level model I have developed, would help you define how other people in your organization can access machines. Also, I believe in the concept of making production machines black boxes, i.e. something which no one can access, everything in it should be automated or accessible over auditable systems.

3. Configuration Management

This is a part which gives you stability and insurance for your infrastructure. Having a proper CM setup will give you flexibility in the longer run. There are alternative approaches to cover this part. To implement a CM the first 2 areas (Design & Security) should be very clear. Also following are another set of prerequisite before you jump into setting up CM:
1. You should have clear details on the software packages which would be used in the entire stack.
2. You need to have clear details on the version of these software packages.
3. Production system configs which you would want to use in individual services.

Note: Try using the stable releases of any software package, which you are going to use. This will give you more confidence over dependencies.

Once you have cleared all dependencies, next is about picking up of the tool. There are many tools like Ansible, Chef, puppet which you can choose and move ahead. You should choose the right tool based on your requirement. To be honest, it’s not the tool which matters, It’s the implementation which does. All tools will cover 90% of use cases in the world. Next is writing scripts to configure and deploy your services. You can use blue-green strategy to replace machines which have been CM’ed

4. Build & Deployments

This is another crucial part of the infrastructure. Your infrastructure should be capable of making faster deployments and have support for rollbacks. To achieve this you first need to have a proper build.

A build is nothing but a packaged version of your software application, which can be just rolled out on the production machines in seconds, for that to happen you build should contain all dependencies. There are different ways to deal with dependencies (Coming soon…) Simplest one would be to cache dependencies in the build itself, for example in the ruby world you can have bundler cache as part of the package. If you have frontend assets then you will understand how useful this would be for you as you will do the compilation only once and the compiled assets would be part of your build. Once the build is ready you should tag it and store in highly available storage.

After you have generated build, it’s time to ship it out. Generally, blue gree deployments are done to ensure 0 downtimes. The blue-green asks you to double the number of servers, launching the new one with the latest build. But I think strategy will increase your deployment time. So I think you should use the existing machines itself and do a blue-green at the application level. i.e. keep two directories, latest and stable. You move the latest to stable on the start of deployment and update the latest. Now you start the application on latest and slowly turn off stable. This strategy comes with another benefit, split  second rollback as the machines just have to tune the volume back to stable. There are various other strategies for deployment as well.

You can use Chef/Ansible/Capistrano/Shell, etc for this. Again the strategy is important more than a tool.

5. Logging

This is an important part, because when I say that your production box should be a black box, the developers would say they need logs to debug issues happening in production. Hence, your logging infra needs to be robust and diverse. ELK stack is one of the best stacks to use. I have setup and used it in my career till date. The setup is easy. But how you log is very important. You need to write good log stash parsers to aggregate relevant logs. One of the best setups I have seen is using rsyslog, logrotate. You read the logs from individual machines and transmit over a queue or redis to the parsing layer, which can be scaled horizontally. This layer will then dump the parsed logs into the ES.

There can be cases when you might miss log because of some or the other issue. To cover this edge case. You should have a mechanism to dump raw logs from your application. Logrotate & S3 comes to the rescue. So whenever you do a log rotate you also dump that file to S3. Also, when using spot instances, aws has a mechanism to let the machine know that it is going to be terminated. So you should also use that as a trigger to take a dump of the log before terminating.

Tip: When designing applications you should try to generate custom logs to a separate log file and then pass it in raw format to ES. This is a nice way to keep your logs de-cluttered. [I will write a separate blog on this]

6. Alerting & Monitoring

This is a key area of your infra, where you should invest a good amount of time. Having right set of tools is very important for a robust system. I have used cloud watch/sensu till date for monitoring and both works very well.  There are certain metrics which you should be collecting from the machines, which help you understand the utilisation pattern and make decisions based upon that. RAM, CPU, Network IO, Disk Usage are 4 must have metrics. Apart from that, you should have the capability to add custom check scripts, which provide you data on your services. Like the number of requests in the queue for Nginx, monitor uptime through status pages.

When you have all these metrics, it’s time to analyse them and setup base lines. Once you have that, any metric crossing those base line should create alerts. Now there are certain levels which are acceptable, so there isn’t a need to wake someone up. You just raise it as a warning and wait for it to restore. So any metric crossing the baseline becomes a warning. Now there are levels which when reached can cause the system to crash. So any data point which is reaching that level should raise a critical alarm, and wake someone from your team :D.

Now, you need a set of tools to properly communicate critical alarms to the team members, for which you can use pager duty/ops genie or something. These tools not only support pushing notification but they will keep calling until you acknowledge it :P. You can also define custom escalation policies.

Tip: If you are using CM you can make this as part of your base script so all machines which are spawned will automatically have monitoring inbuild by default. You can do the same with your Logging bit as well.

I have added lot of coming soon, which I will definitely write some or the other day. Let me know if you have queries in the comment section below. I will be happy to help you guys out 🙂