Our Next Batch Solutions Architect Associate Certification Course (Workshop Included) starts from 12th October !!!!!!Feel Free to call us..........to know more.....

Problem faced with EFS throughput

Problem faced with EFS throughput

Problem faced with EFS throughput:

We faced a problem while build and delivery phase in a Production environment which we would like to share with you all.

Once the AWS infrastructure provisioned and deployed the application stack and commenced testing, everything was working fine.  One week later we understand that application response was very much sluggish.  This sluggishness was noticed particularly when copying files to particular application directory which was mounted in ElasticFileSystem (EFS).  On the process of finding the issue, we tried executing some commands like “ls” in EFS mounted application directory and “df” both command output was very much delayed.  Other metrics like Memory Usage & CPU are minimal as well.

Analysis:

  1. Issue was not with one particular server, it was with all the three application servers which is part of the Load Balancer.
  2. On our first level of analysis, “ls” command was not working as expected in a particular directory but working in other directories. From this behavior we understood some problem with that particular directory. This was a tip to us explore further on that directory and found it has a symbolic link to the EFS Storage.
  3. On further exploration on EFS metrics using CloudWatch, we found that “BurstCreditBalance” was gradually declining for the past one week and reached the value of ZERO, where we concluded this is the issue.

Root Cause:

There are two throughput modes available with EFS, Bursting Throughput mode and Provisioned Throughput mode. With Bursting Throughput mode, throughput on Amazon EFS scales as your file system grows. With Provisioned Throughput mode, you can instantly provision the throughput of your file system (in MiB/s) independent of the amount of data stored.
All file systems, regardless of size, can burst to 100 MiB/s of throughput. Those over 1 TiB large can burst to 100 MiB/s per TiB of data stored in the file system. For example, a 10-TiB file system can burst to 1,000 MiB/s of throughput (10 TiB x 100 MiB/s/TiB). The portion of time a file system can burst is determined by its size. The bursting model is designed so that typical file system workloads can burst virtually any time they need to and whenever it’s inactive

or driving throughput below its baseline rate, the file system accumulates burst credits.

For example, a 100-GiB file system can burst (at 100 MiB/s) for 5 percent of the time if it’s inactive for the remaining 95 percent. Over a 24-hour period, the file system earns 432,000 MiBs worth of credit, which can be used to burst at 100 MiB/s for 72 minutes.

File systems larger than 1 TiB can always burst for up to 50 percent of the time if they are inactive for the remaining 50 percent. The minimum file system size used when calculating the baseline rate is 1 GiB, so all file systems have a baseline rate of at least 50 KiB/s.File systems can earn credits up to a maximum credit balance of 2.1 TiB for file systems smaller than 1 TiB, or 2.1 TiB per TiB stored for file systems larger than 1 TiB. This approach implies that file systems can accumulate enough credits to burst for up to 12 hours continuously.

The following table provides more detailed examples of bursting behavior for file systems of different sizes.

File System Size (GiB) Baseline Aggregate Throughput (MiB/s) Burst Aggregate Throughput (MiB/s) Maximum Burst Duration (Min/Day) % of Time File System Can Burst (Per Day)
10 0.5 100 7.2 0.5%
256 12.5 100 180 12.5%
512 25.0 100 360 25.0%
1024 50.0 100 720 50.0%
1536 75.0 150 720 50.0%
2048 100.0 200 720 50.0%
3072 150.0 300 720 50.0%
4096 200.0 400 720 50.0%

(Thanks to Amazon Documentation)

So during the go live period, the application responded well until it fully utilized the credit balance. Once the credit balance exhausted, it started responding at the baseline rate of 50 Kb/s.  We were using 5GB. The cause is there is no enough available IOPs to respond in time.

Solution:

We changed the mode to Provisioned Throughput mode to have a dedicated throughput rather than throughput based on size of the file system. So we Provisioned throughput as 5 Mb/s to ensure we start minimally and optimize further after performance testing is complete.

 

Learnings:

  1. Need to monitor and set alarm for the metric “BurstCreditBalance”, if the EFS was in burst throughput mode, most importantly if it is in Production environment.
  2. Use EFS Bursting throughput mode only for Non-Prod environments.
  3. For Optimization, Monitor “PercentIOLimit” CloudWatch metric for EFS on ongoing basis.
© 2019 - www.nuageacademy.com. All rights reserved
Designed & Marketed by 99Webmaker

Enter your information to get the invitation for free seminars

X