This is a guest blog post. Views expressed in this post are original thoughts posted by Glen Kemp, Solutions Consultant at SecureData Europe. These views are his own and in no way do they represent the views of the company he works for.
Recently a customer (the Insolvency Service) faced a not uncommon problem; a large portion of critical switching infrastructure was out of support. The story also quite familiar; the environment itself was marked for decommissioning several years ago so it wasn’t transitioned into support by the main network provider. The same infrastructure rather than fading out as expected actually became more critical as additional Internet services came on stream. The upshot was the customer had an array of Cisco switches consisting of an early generation “3750” core network, a dozen “2900” series and a “3500XL” acting as an access/distribution layer. The servers they support were split across eight racks in two banks, some of these were pretty ancient, plus there were also two large VMware hosts and a handful of brand new servers.
The environment had some minor issues, which led to major outages.
The Insolvency Service asked if we could propose the refresh of out of support equipment and bring the whole deployment under network management.
The original network
The "Easy" Option
The path of least resistance would have been to do “fork-lift” replacements with the next in line Cisco switches; but those that know me will attest that I’m an advocate of doing things “properly” and I believe a little effort can yield a lot of rewards. After auditing the switches, we discovered that whilst the total capacity was circa 400 ports (many of which were 10/100) the actual “lit” count was around 150 devices. This gave us an opportunity to consolidate a large number of legacy devices into a handful of current generation switches. Quoting on a like for like basis did not address all of the issues being faced in terms of fault tolerance. Furthermore, whilst 802.11q VLANs were in use in the environment, switch “A” was VLAN 1; switch “B2” was VLAN 2 etc., which led to some extremely creative cabling. Further inefficiencies were also uncovered in the VMware environment; many guest hosts were bound a single physical interface rather than using VLAN trunks. This “burnt” switch ports and left many critical hosts without any failover and created several bottlenecks.
Doing it Better
The design I proposed to the Insolvency Service was essentially a distributed virtual chassis ring compromising of five EX4200 switches. The principle was to replace as transparently as possible the most at risk access layer, whilst we unlocked the secrets of the “legacy” core switch with an undocumented routing table.
These five switches are linked using pairs of “VC” (virtual chassis) cables, each providing 32Gbps of bandwidth over distances of up to 5 metres. Once connected, the switches act and behave as a single logical switch with redundant routing engines and distributed processing power. This “single IP” approach provided a major advantage over maintaining the status quo; a significantly reduced device management overhead. As a managed service provider we put a tangible, operational cost and SLA against the management of hardware; otherwise this cost is completely intangible and is the responsibility of the network manager. The upshot is that it either doesn’t get done at all, or to the bare minimum of standards. As there is no significant operational difference between five physical switches in a virtual chassis and a single “traditional” fixed configuration chassis, we only charge for a single device, regardless of the number of “line cards” involved. Essentially, this small detail meant that the Insolvency Service only needed a single management contract to cover what would have previously been twelve distinct switches. This demonstrated a significant cost saving and essentially “paid” for the upgrade.
The proposed design
One of the challenges faced was the distance between the two server racks; it was more than the maximum distance possible on standard VC cables. Fortunately it is possible to re-task SFP+ ports on the optional uplink cards as virtual chassis ports (VCPs) using the “request virtual chassis vc-port” command. The “standard” design would have been to use short reach (SR) or ultra-short reach (USR) 10GBe optics to connect the switches using fibre cables. However, these are relatively expensive and we would have needed six of them to broach the distances. After some research I realised that Juniper support 10GbE DAC (Direct Attached Copper) cables up to 7m and these are pretty inexpensive. Essentially they are 10GbE cables terminated with SFP+ connectors so they plug directly into the uplink cards on the EX4200 switches. On paper, this saved a significant amount of cost and complexity, but I couldn’t find anyone who had previously attempted this, despite talking to my local Systems Engineer and the wider Juniper forum. After talking through the risks versus reward with the Insolvency Service, my apparently untested design was accepted. This is where having a good relationship with your customer helps. The cost savings and the potential performance was deemed to be worth a small amount of risk; should it not be technically feasible to implement the worst case scenario was we use “cheaper” 1GbE optics to extend the chassis or just split it in two, which wasn’t significantly different from what was already in place.
Juniper Networks List Dollar Price of Optics – June 2012
The use of DAC cables obviously doesn’t preclude the use of more common optics at a later stage. With conventional “dark” fibre the switches could be physically much further apart and yet still act as single cluster. The other benefit of this design is that it provides spare 10GbE connectivity which will be used for connecting a new ISCSI SAN to the network.
The first step was to replace the access layer switches. Working with the Insolvency team, our professional services engineer installed the switches “top of rack” and chained together in the virtual chassis. Initially this was clumsy as we essentially had to emulate the “one-switch, one VLAN” approach of legacy with multiple links heading back to the old core. This was necessary as there simply wasn’t enough time to move everything all at once. However, as each link back to the legacy core was identified, this was replaced with a trunk providing link redundancy and capacity.
One snag we picked up quite early on was with the trunk link on the legacy Cisco end. When trying to configure a cross-stack port-channel I got the following message “With PAgP enabled, all ports in the Channel should belong to the same switch, Command rejected”. After some Googling I came across an article which indicated this feature (standard for as long as I’ve been messing around with switches) required a firmware upgrade. This required us to take down the legacy core ahead of the other planned work for a relatively risky upgrade just for a “basic” feature. Fortunately our escalation team “entered the Matrix” for me and found the correct firmware and it installed without a hitch. This lead to the uncomfortable realisation that the failure of a core switch would have isolated 50% of the VLANS.
Once all the major VLANS were trunked into the new Juniper EX Virtual Chassis, we were able to start the process of migrating the routing from the legacy core. This was performed on a per-VLAN basis and took some time as we had to make sure we identified which route went where; this is not the kind of thing that can be performed “live”, even with the EX’s ability to rollback configurations easily.
After the install
We are now at the point where only have a handful of devices still connecting to the old Cisco core. Now everything is “under one roof” policy and routing changes are significantly simplified. We have a policy of continuous improvement as we hunt down and transition the handful of legacy systems and networks. My intention is to put in place QoS in order to better manage traffic streams to make sure that the network backup events don’t flood specific links. This will much easier to achieve as the policy only needs to be created once and we don’t need to involve separate management tools.
The customer is also happy with the finished result:
Vince Thompson, Network Architect at the Insolvency Service:
“We have used and trusted various Juniper technologies for a number of years so when SecureData proposed we consolidate our legacy switching into a Juniper Virtual Chassis design we could see the merits. Furthermore, the numbers made sense and we could see that reducing the number of physical devices reduced our operational overhead to the point that it would pay for a significantly upgraded and more efficient infrastructure.”
This consolidation project has now been running for the best part of year and we are now in reach of the network nirvana we have sought from the beginning. Had time allowed, it could have been potentially achieved in a few weekends of very hard graft, but change windows are relatively difficult to come by and it’s taken a while to perform the required network archaeology on the legacy kit.
Since we started the project, Juniper has launched the EX3300 series switches which are also Virtual-Chassis capable. Whilst these would have been a bit more cost effective, I don’t feel too bad as the EX3300 VC can connect up to six switches; whilst the EX4200 can stretch to up to ten providing plenty of expansion. Furthermore, should the Insolvency service require additional 10GbE capacity the big-brother EX4500 can be retrofitted into the virtual chassis.
I realise that there are several ways in which this could have been deployed, any would be interested to hear your comments on the design and any way it can be improved.