Architecting and implementing large scale web based systems
I specialize in building scalable and highly available software systems based on cloud computing.
Managing EC2 production environment. System, DB and application servers administration on Windows and Linux platforms.
Over 8 years of experience in designing and developing software.
Technical architect / Senior developer
SkillPagesApril 2010 - Present
Pocket Kings LtdOctober 2008 - March 2010
Web Services and Database developer
Pocket Kings LtdAugust 2006 - September 2008
Web Services and Database developer
Pocket Kings Ltd.August 2006 - September 2008
EracentJune 2004 - June 2006
Our latest revamp of the SkillPages site includes many important backend changes. Among others we reviewed caching. In our system
we employ multiple caching layers including an object cache that is implemented on top of Memcached. In our latest release
we introduced an improvement to our caching layers: cache areas, i.e. logical domains that can be mapped onto actual servers: multiple clusters of ElastiCache nodes.
We decided to use Amazon's product primarily because of the integrated CloudWatch metrics that it offers out-of-the box.
Our main goal is to improve user experience and reduce operational cost. Understanding cache access pattern allows to tailor the number and size
of the servers to meet the exact requirements.
An interesting idea is to adjust the size of the cache dynamically based on daily traffic patterns.
Data in cache is not accessed uniformly (i.e. the distribution typically obeys power law) and this presents an opportunity to trim the cache size
in periods of low utilization while maintaining the overall performance requirements. Research by Intel and Carnegie Mellon University (see attached)
demonstrates this is a feasible technique in cloud computing enviroments like EC2.
Deprecating legacy component
In our constant battle to improve system speed, availability and cost we are working on deprecating a large legacy component.
It is a code surgery requiring effort in many areas. Big thanks to: @[80e75030-0b26-11df-b1a2-00155d010212|john.hannon|John Hannon], @[45c67909-603b-e011-85c0-d804b45758e4|lorcan|Lorcan O'Neill], @[e28125bc-1c76-11df-b1a2-00155d010212|jorg|George Ciubotaru], @[3eb7e770-14da-11df-b1a2-00155d010212|padraic.mulligan|Padraic Mulligan] and @[ebe78f36-122d-11df-b1a2-00155d010212|mike.mccarthy|Mike McCarthy]
Scheduling recurring tasks
Great arictle series explaining why one of the most popular way to schedule tasks is incorrect:
Automating Cassandra cluster setup
Setting up a multi-node Cassandra cluster in the cloud is straightforward but time consuming, error prone and mind numbing activity. The key to success is obviously automation. By combining tools like cloud-init, upstart and our own scripting we can now create such cluster without a need to log in to any of the instances.
Initially we will use it to create an on-demand test beds to performance testing different data models,
cassandra settings or OS level configuration. Once it is battle tested these new AMIs will be promoted
to our production environment.
The secondary goal was to create as simple pattern as possible which can be easily adopted for other Linux machine types (e.g. Solr clusters). We introduced two standard hooks: first-boot and every-boot where custom bash scripts can be added (as per diagram attached).
"When Akamai goes down...
...it takes the Internet with it". It is used by the giants of the Internet and delivers 15-%20% of the entire Internet traffic. Akamai's technoglogy is certainly an interesting to take a glance at. Given the scale of their operations the company is relatively little known but there is some information on them online. I came across one very good paper attached in this update.
There's a couple of things that I took out of it:
- It is for common theme the successful Internet companies to apply recovery-oriented computing (Netflix and their 'chaos monkey'
killing servers and network connections at random are particularly good example). High availiability is an obvious benefit but there is other very important aspect: it takes human staff out of the critical path. That is a significant cost reduction for the company that operates many servers.
- Like many other big companies Akamai centralizes the log collection but it also uses relational databases as the front-end. That is in my opinion a perfect solution: SQL gives nice and flexible interface and analytics can be supported by MapReduce tehcniques where needed. Akamai probably built an in-house infrastructure for this but in the era of Cloud Computing this possibly be quite cheap to
built on the existing cloud components. I am very keen to see if we could introduce this here in SkillPages.
- The existence of the 'middle mile'. You hear about the first and last mile but it turns out at it's the middle mile (aka peering point) where many root Internet problems lie. Lack of financial incentive and political pressures are primary reasons for bottlenecks in this area.
- The state of the Internet: video streaming is expected to constitute 90% of the overall traffic by 2014. President Obama inauguration was viewed in 7 millions *simultanous* streams. This is a game changer. It is not a secret that Internet protocols were not designed
with the current use in mind but at this scale they fail so spectacularly that complex and expensive overlay networks have to be engineered on top of the TCP/IP to meet the new demand.
- Finally a few gems hidden in the References paragraph (e.g. "Determining the Return of Investment of Web Application Acceleration
Managed Services") added to the reading queue.
Performance Counter vs. CloudWatch Metric
When AWS team announced the ability to put custom metrics into their CloudWatch monitoring system the question asked by many people was: how could Windows Performance Counters be integrated? Amazon promptly updated code samples in their SDK to demonstrate how straightforward this is.
However, the one essential thing they have not mentioned is how the two worlds map onto each other in a real-world distributed system. Fortunately both are decently documented and it's easy to figure it out own on you own.
They both turn out to be quite similiar with Amazon's concept of dimensions making their's slightly more general. Attached diagrams show how one can be translated into the other. The idea of Statistics Set is not depicted as it does not exist in the Window's model.
Best temperature to work on code