Automation
Automation

Expert Advice: Creating Repeatable Labs and Demos Using Automation

by Super Contributor on ‎02-03-2016 07:39 AM - edited on ‎08-23-2017 11:58 AM by Administrator Administrator
02.03.16   |   07:39 AM

Using Ansible, PyEZ and Contrail to build repeatable labs and demos

 

One of the greatest challenges for technical organisations, both within Juniper and in our customers and partners, is creating repeatable labs and demonstrations for systems engineering, professional services and technical support organisations.  These are often required to demonstrate new technologies, to gain experience with those new technologies and to test various scenarios prior to deployment.  Such labs are used by Juniper Networks SEs have to deal with a vast array of highly diverse technologies so providing them with a platform on which to train and gain experience and with which to demonstrate to customers is key.  For those responsible for providing these resources one of the biggest challenges is how to ensure that each user gets the same initial platform while giving them the freedom to make whatever changes to the configuration they want without imposing any requirements on the individual to "clean up after themselves".

 

As an example, recently I was asked to build such a lab for the Service Control Gateway (SCG) and Contrail.  The idea was to take this somewhat complex combination of products and associated tools and systems and produce a lab with which a basic example of most operator use-cases could be demonstrated.  This involved a number of elements, each of which needed to be brought back to a known good state before each new user could start using the lab.  The idea of requesting that each user undo their own changes was simply impractical, so the only real option was to rebuild the lab for each new user.  The number of elements in the lab and the complexity of the configuration meant that this would be very time consuming and highly vulnerable to user error if performed manually, thus reducing the availability and efficiency of the lab equipment.  Automation was the only way to achieve our goal.

 

One Step At A Time

The process of creating and automating the lab was quite a long one and will proceeded through a number of independent steps.

 

A Working Baseline

The first task was to manually build the elements of the lab and ensure that each element worked correctly.  The elements we used for this example are listed below.  Your lab may be more or less complex but the principles remain unchanged.

 

  • An MX480 with MS-MPCs to act as the SCG
  • An MX104 to act as the Broadband Network Gateway (BNG)
  • A QFX3500 to act as the switch for Access and DC fabric functions
  • An SRX240 to act as the egress to the Internet
  • 4 Servers, each with 1 x 10GE and 1 x 1GE + IPMI to support virtualised functions

VMware ESXi was installed on one of the servers.  Several virtualised functions were built on top of VMware as follows:

 

  • Two Windows clients from which to test both PPPoE and DHCP subscriber connectivity
  • One FreeBSD server running a PCRF emulator
  • One Linux server running a RADIUS server
  • One Linux server running Contrail Server Manager
  • One Linux server as the primary platform for all automation (Contrail vnc_api, PyEZ, ansible and jnpr.junos modules installed)

These functions are all unchanging in this lab, so we simply created snapshots of each so that they could be respawned in the event of failure.  Normally, they would not be respawned between labs.  If that becomes necessary, we will investigate using pysphere with Ansible to kill and respawn the instances.

 

The other three servers were used to construct a relatively simple Contrail cluster with one controller and two compute nodes using the Server Manager function to build the setup.

 

The configuration of the BNG was such that we could support both PPPoE subscribers and DHCP subscribers in order to provide both IP style and IFL style subscribers to the SCG.  Normally, this configuration would never change as part of the lab/demo, so a baseline configuration was created and saved.

 

The configurations of the QFX and SRX are similarly fixed.  The physical layout of the lab and its Internet access will not change, therefore a baseline configuration was created and saved.

 

The SCG, on the other hand, was one of the two significantly variable elements.  As for the other Junos devices, a baseline configuration was created and saved.  This configuration supported the initial use-cases required by the proposed lab but, in each lab that you create, it is likely that the users will wish to modify those use cases or add further use cases to certain elements.  The configuration must be returned to the known good initial state at the start of each lab booking. 

 

In our lab, the SCG has the following functions:

 

  • Traffic Detection Function - This provides the capability to perform subscriber and application aware actions (steering, load-balancing, CGNAT, HTTP Header Enrichment) upon matching traffic.  The application awareness is based on DPI.  Subscriber awareness is based on interception of RADIUS Accounting messages and per-subscriber policies either defined statically or optionally with interaction with a PCRF over the 3GPP Gx reference point.
  • Traffic Load Balancing (TLB)
  • HTTP Header Enrichment (HTTP HE)

In order to demonstrate these functions, several virtualised functions were created in the Contrail cluster.

 

  • A ten-server Web Farm was created with which we could demonstrate TLB and HTTP HE.
  • A "blue" service chain using vSRX to perform URL filtering
  • A "red" service chain using vSRX to perform URL filtering with a different set of policies

In order to create the Web Farm, an Ubuntu cloud server version 14.04.4 was spawned using Openstack Horizon and modified in order to enable remote access via the console and SSH.  An apache2 web server and PHP were installed and a simple PHP home page installed, which simply printed back the HTTP Headers and the local server's real IP address.  These were important for demonstrating the functionality of TLB and HTTP HE.  Once it was created and running, we took a snapshot of the running web server in Openstack.  This could then be copied as an image onto the automation server so that it could be subsequently re-imported onto the freshly re-installed Contrail cluster and spawned multiple times to create a whole Web Farm.

 

The blue and red vSRX images were also created in a similar fashion but using the Contrail Service-Template and Service-Instance model.  Each was configured manually to perform a specific set of actions and then a snapshot taken of each of the active instances using Openstack Horizon.  These snapshots then each formed the basis of an image that could be imported during the automation.

 

Once the entire platform was up and clients could connect to the BNG, were recognised on the SCG and policy (local static policy or dynamic policy from the PCRF) was applied, the TLB, HTTP HE and service chain selection were all shown to be working, it was possible to start automation.

 

Automating The Manual Process 

The framework within which all the automation was performed was Ansible.  Ansible provides mechanisms to configure Junos devices, perform operational activities (for example clearing subscribers or reloading a device) on Junos devices, perform various activities on different server OSes and even, had we wanted, perform activities in VMware.  Once created, the user can rebuild the entire lab with a single command from the automation host.  Around 1 hour and 15 minutes later, the completely fresh lab would be ready for use.

 

The series of tasks to be performed were as follows.  It was assumed that the VMware server and the associated VMs were permanently active and never changed.  Similarly, the SRX as the device connecting to the public network was not configurable by the user:

 

  • Reconfigure the QFX switch with the base configuration
  • Reconfigure the BNG with the base configuration
  • Reconfigure the SCG with the base configuration - This required three independent steps
  1. Put the TDF gateway on the SCG into maintenance mode including clearing all active subscribers
  2. Load the baseline config, in which maintenance mode was off
  3. Reload the node to ensure all counters were cleared
  • Reimage the three servers used for Contrail with Ubuntu Linux
  • Reprovision the three servers as a Contrail Cluster
  • Build a ten-server Web Farm in the Contrail Cluster
  • Build the two vSRX based service chains

Taking each function in turn, the first two were basically identical.  Ansible provides the tools to perform a load override of a known configuration.

 

---
- hosts: qfx3500-8
roles:
- Juniper.junos
connection: local
gather_facts: no
tasks:
- name: Overwrite the default configuration file to qfx3500-8
junos_install_config:
user=username
passwd=password123
port=830
host={{ inventory_hostname }}
file=/root/scg-automation/qfx3500-default.conf
overwrite=yes
logfile=/root/scg-automation/qfx-junos-config.log

The item above shows the three dashes at the top of all YAML files followed by details to install a new configuration file (file=/root/scg-automation/qfx3500-default.conf) on the host qfx3500-8 (details of this host are included in /etc/ansible/hosts).  The line "overwrite=yes" causes a load override (versus a load merge).  In this way we can guarantee that the configuration is reverted to a known good configuration irrespective of what was done by the previous user.

 

A similar block is used for the BNG to revert its config to a known good state.

 

For the SCG, things are slightly more complex.  Built-in rules make it impossible to apply a configuration that changes the behaviour of the TDF while there are active subscribers.  Therefore, it's necessary to first put the gateway into "maintenance mode" and then ensure that all subscribers are cleared.  This is performed with a small python script.

 

- hosts: scg
roles:
- Juniper.junos
connection: local
gather_facts: no
tasks:
- name: Set the SCG gateway into service-mode maintenance
junos_install_config:
user=myuser
passwd=mypwd123
port=830
host={{ inventory_hostname }}
file=/root/scg-automation/scg-activate-service-mode.conf
overwrite=no
logfile=/root/scg-automation/scg-junos-config.log

- hosts: automation
gather_facts: no
tasks:
- name: Clear subscribers from SCG prior to loading new configuration
shell: python clear_scg_subs.py
args:
chdir: /root/scg-automation

- hosts: scg
roles:
- Juniper.junos
connection: local
gather_facts: no
tasks:
- name: Overwrite the default configuration for the SCG
junos_install_config:
user=myuser
passwd=mypwd123
port=830
host={{ inventory_hostname }}
file=/root/scg-automation/scg-default.conf
overwrite=yes
logfile=/root/scg-automation/scg-junos-config.log

- hosts: automation
tasks:
- name: Revert IFL subscribers from SCG having restored configuration
shell: python clear_scg_subs.py revert
args:
chdir: /root/scg-automation

- hosts: scg
roles:
- Juniper.junos
connection: local
gather_facts: no
tasks:
- name: Reboot SCG to clear all state
junos_shutdown:
user=myuser
passwd=mypwd123
port=830
host={{ inventory_hostname }}
shutdown="shutdown"
reboot="yes"

 

The first element uses a configuration exerpt applied in merge mode to activate "maintenance mode".  The next element uses a python script using PyEZ to clear all subscribers from the node.  Next, we override the entire config and bring it back to the known good state.  We then bring all subscribers back online using the same script as we did to clear the subscribers with the "revert" keyword.  Finally, to ensure that all counters are cleared and that all state is reset, we reboot the chassis.

 

An example of the file /root/scg-automation/scg-activate-service-mode.conf is below.

 

unified-edge {
gateways {
tdf my_TDF {
active: service-mode maintenance;
}
}
}
routing-instances {
my_AccessVR_IFL {
access {
address-assignment {
address-pools {
my_TDF_IFL_POOL1 {
active: service-mode maintenance;
}
}
}
}
}
my_AccessVR_IP {
access {
address-assignment {
address-pools {
my_TDF_IP_POOL1 {
active: service-mode maintenance;
}
}
}
}
}

 

The python script clear_scg_subs.py is below.

 

#!/bin/env python

from jnpr.junos import Device
import time
import sys
import re

host = {'host':'192.168.1.1',
'user':'myuser',
'passwd':'mypwd123'}

def clear_subs(gateway, revert):
get_subscribers = scg.rpc.get_tdf_gateway_subscribers(gateway=gateway)
if len(get_subscribers) != 0 and not revert:
print 'Clearing IP and IFL subscribers'
clear_subscribers = scg.rpc.clear_mobile_tdf_subscribers(gateway=gateway)
while len(get_subscribers) != 0:
time.sleep(2)
get_subscribers = scg.rpc.get_tdf_gateway_subscribers(gateway=gateway)
elif revert:
print 'Reverting IFL subscribers'
clear_subscribers = scg.rpc.clear_mobile_tdf_subscribers(gateway=gateway, revert=True)

if __name__ == '__main__':
revert = None
expression = re.compile('\S*/clear_scg_subs.py')
for arg in sys.argv:
if arg == 'revert':
revert = arg
elif expression.match(arg):
pass
else:
print 'Unrecognised command line argument: %s' % (arg)
scg = Device(host=host['host'], user=host['user'], passwd=host['passwd'])
scg.open()
clear_subs(gateway='my_TDF', revert=revert)

The next task is to build a complete, clean Contrail cluster consisting of one controller and two compute nodes.  The purpose of this lab was not to demonstrate any of the advanced features of Contrail but more to demonstrate its integration with the SCG, therefore a very simple Contrail cluster is ideal.  In order to verify load balancing is working across multiple compute nodes, however, it was necessary to have a minimum of two compute nodes, so an all-in-one Contrail installation would have been inadequate.

 

This task is initiated from the Server Manager node running as a VM inside ESXi.  This is triggered from the CLI of the SM host using Ansible.

 

- hosts: svr_mgr
gather_facts: no
tasks:
- name: start reimage of whole cluster
shell: server-manager reimage --cluster_id "SCG-cluster" --no_confirm "ubuntu-14_04_1-server-amd64_iso"
args:
chdir: /root

- name: sleep 15m and assume all servers have been reimaged and rebooted
shell: sleep 15m

- name: waiting for servers to come back
local_action: wait_for host=contrail3 state=started

- name: install Contrail/Juno image
shell: server-manager provision --cluster_id "SCG-cluster" --no_confirm "contrail_install_packages_2_21_102_ubuntu_14_04juno_all_deb"
args:
chdir: /root

- name: sleep 45m and assume all servers have been provisioned and rebooted
shell: sleep 45m

- name: waiting for provisioning to be completed
shell: python chk_prov.py
args:
chdir: /root 

As you can see, one hour of the total one hour and fifteen minutes taken by this automation is spent waiting for the reimage of Ubuntu and then the reprovisioning of Contrail.  This is critical to ensure that the installation is completely clean.

 

The server-manager commands on the CLI of the host svr_mgr first install a completely fresh installation of the correct verison of Ubuntu Linux then wait for 15 minutes for that to be completed.  Next, they install the Contrail image as defined for the cluster SCG-cluster and wait 45 minutes for that to be completed.  This is a predefined cluster in svr_mgr, with each node's function defined.  The final python script simply repeatedly calls a status check for the cluster and, if it finds any incomplete installations, it sleeps then calls itself again.

 

Only once the Contrail cluster is up and running, do we perform the final actions; to build the web farm and the two service chains. These are built using a unique python script for each object.

 

- hosts: automation
gather_facts: no
tasks:
- name: Run python script to build Web Farm
shell: python autobuild_web_farm.py
args:
chdir: /root/scg-automation

- name: Run python script to build vSRX Blue service chain
shell: python autobuild_vsrx_blue.py
args:
chdir: /root/scg-automation

- name: Run python script to build vSRX Red service chain
shell: python autobuild_vsrx_red.py
args:
chdir: /root/scg-automation

 

An example of the python script to build the web farm is provided below.

 

#!/usr/bin/env python

from keystoneclient.v2_0 import client as ksclient
from keystoneclient.auth.identity import v2
from keystoneclient import session
from neutronclient.v2_0 import client as neclient
from glanceclient import Client as glclient
from novaclient.v1_1 import client as nvclient
from novaclient.v1_1 import servers as nvsrv
from vnc_api import vnc_api
from config_obj import *
import uuid
import time
import pdb
import re

''' Create Authentication Credentials Store '''
names = {'domain':'default-domain',
'tenant': 'WebFarm',
'ipam': 'WebFarmIPAM',
'sg': 'WebFarmDefaultSG',
'vn': 'WebFarmFrontEndVN',
'image': 'web_template_ubuntu',
'flavor': 'm1.small',
'filename': '/var/images/web_template.qcow2',
'subnet': '10.128.0.0/24'}

cred = {'username':'admin',
'password':'abc123',
'tenant_name':'admin',
'auth_url':'http://192.168.100.201:5000/v2.0/',
'api_host':'192.168.100.201',
'auth_host':'192.168.100.201'}

wcred = {'username':'web-admin',
'password':'abc123',
'tenant_name':names['tenant'],
'auth_url':'http://192.168.100.201:5000/v2.0/',
'api_host':'192.168.100.201',
'auth_host':'192.168.100.201'}

client_creds = {'auth_username':'web-admin',
'auth_password':'abc123',
'auth_tenant':names['tenant'],
'api_server':'192.168.100.201',
'auth_server':'192.168.100.201',
'region':'RegionOne',
'tenant':'WebFarm'}

''' Connect to Keystone using Credentials '''
keystone = ksclient.Client(**cred)

vnc=vnc_api.VncApi(username=cred['username'],
password=cred['password'],
tenant_name=cred['tenant_name'],
auth_host=cred['auth_host'],
api_server_host=cred['api_host'])

neutron = neclient.Client(username = cred['username'],
password = cred['password'],
auth_url = cred['auth_url'],
tenant_name = cred['tenant_name'])

def uploadimage(filename, imagename, tenant='admin', public=True, container_fmt='bare', disk_fmt='qcow2', tenant_auth=None):
''' Upload an image using qcow2/bare by default '''
glance_epurl = keystone.service_catalog.url_for(service_type='image',
endpoint_type='publicURL')
glance = glclient('1', endpoint=glance_epurl, token=tenant_auth)
with open(filename) as fimage:
myimage = glance.images.create(name=imagename,
data=fimage,
owner=tenant,
is_public=public,
container_format=container_fmt,
disk_format=disk_fmt)
return myimage

def uuidstr(uuid_str):
uuidlist = []
foo = [uuidlist.append(char) for char in uuid_str if char != '-']
uuidstr = ''.join(uuidlist)
return uuidstr

def main():

''' Obtain admin details from keystone '''
admin = keystone.users.find(name='admin')
admin_uid = keystone.users.find(name='admin').to_dict()['id']
admin_role = keystone.roles.find(name='admin').to_dict()['id']
heat_stack_user_role = keystone.roles.find(name='heat_stack_user').to_dict()['id']
''' Create tenant using keystone and neutron hack '''
print 'Create tenant using keystone then use neutron "nudge" to get project into Contrail DB'
web_project_class = keystone.tenants.create(tenant_name = names['tenant'], description = 'Web Cluster')
web_project_id = web_project_class.to_dict()['id']
web_admin = keystone.users.create(name='web-admin',
password='abc123',
email='root@localhost',
tenant_id=web_project_id)
admadm = keystone.roles.add_user_role(admin, admin_role, web_project_id)
admhsu = keystone.roles.add_user_role(admin, heat_stack_user_role, web_project_id)
webadm = keystone.roles.add_user_role(web_admin, admin_role, web_project_id)
webhsu = keystone.roles.add_user_role(web_admin, heat_stack_user_role, web_project_id)
wneutron = neclient.Client(username = wcred['username'],
password = wcred['password'],
auth_url = wcred['auth_url'],
tenant_name = wcred['tenant_name'])

dummynet_body = {'network': {'name': 'dummynet', 'tenant_id': web_project_id, 'admin_state_up': True}}
dummynet = neutron.create_network(body=dummynet_body)
dummynetdel = neutron.delete_network(dummynet['network']['id'])
''' Connect to the VNC API using admin's credentials '''
vnc=vnc_api.VncApi(username=cred['username'],
password=cred['password'],
tenant_name=cred['tenant_name'],
auth_host=cred['auth_host'],
api_server_host=cred['api_host'])
web_project_obj = vnc.project_read(fq_name=[names['domain'],names['tenant']])

print 'Connect to VNC API using config_obj'
client = ConfigClient(**client_creds)
print 'Create IPAM'
ipam = ConfigIpam(client=client)
ipam.add(name=names['ipam'],dns_type='default')
ipam_obj=ipam.obj_read_func(fq_name=['default-domain',names['tenant'],names['ipam']])
print 'Find default SecurityGroup and delete incorrect rules and replace them with correct rules'
security_groups = wneutron.list_security_groups()['security_groups']
for sec_grp in security_groups:
if sec_grp['name'] == 'default' and sec_grp['tenant_id'] == uuidstr(web_project_obj.uuid):
wfe_def_sg = wneutron.show_security_group(sec_grp['id'])['security_group']
wfe_def_sg_rules = wfe_def_sg['security_group_rules']
for wfe_def_sg_rule in wfe_def_sg_rules:
wfe_def_sg_rule_id = wfe_def_sg_rule['id']
if wfe_def_sg_rule['direction'] == 'ingress':
ethertype = wfe_def_sg_rule['ethertype']
if ethertype == 'IPv4':
rem_ip_pref = '0.0.0.0/0'
elif ethertype == 'IPv6':
rem_ip_pref = '::0/0'
else:
print 'Error, unrecognised ethertype'
wneutron.delete_security_group_rule(wfe_def_sg_rule_id)
wfe_def_sg_rule_in_v4 = wneutron.create_security_group_rule({'security_group_rule':
{'direction':'ingress',
'security_group_id':wfe_def_sg['id'],
'remote_ip_prefix':rem_ip_pref,
'remote_group_id':None,
'protocol':None,
'ethertype':ethertype}})
print 'Create VirtualNetwork'
del client
client = ConfigClient(**client_creds)
network = ConfigNetwork(client=client)
network.add(name=names['vn'],ipam=names['ipam'],subnet=names['subnet'],route_target='64512:50000')
vnobj = network.obj_get(name=names['vn'])
vn_id = vnobj.uuid

''' Obtain a token for admin within the new tenant '''
print 'Obtaining a token for admin (used by glance)'
token = keystone.auth_token

''' Import a new image for the web servers '''
print 'Importing Web Server Template image'
admin_project_id = keystone.tenants.find(name='admin').id
foo = uploadimage(filename=names['filename'],
imagename=names['image'],
tenant=admin_project_id,
tenant_auth=token,
container_fmt='bare',
disk_fmt='qcow2',
public=True)

''' Create ten web servers using the image just imported '''
print 'Creating ten instances of the required image with the required flavor'
nova = nvclient.Client(username=wcred['username'],
api_key=wcred['password'],
auth_url=wcred['auth_url'],
project_id=wcred['tenant_name'])

flavor = nova.flavors.find(name=names['flavor'])
image = nova.images.find(name=names['image'])

for server_count in range(1,11):
files = {}
server_name = 'web_server_' + str(server_count)
server_ip = '10.128.0.' + str(253 - server_count)
nics = [{'net-id':vn_id, 'v4-fixed-ip':server_ip}]
with open('/root/scg-automation/hostfile.template','r') as hostfiletemplate:
expression = re.compile('127\.0\.0\.1\s+localhost')
with open('/root/scg-automation/hostfile', 'r+') as hostfile:
for line in hostfiletemplate:
if expression.match(line):
hostfile.write(line + server_ip + ' ' + server_name + '\n')
else:
hostfile.write(line)
files['/etc/hosts'] = hostfile.read()
hostfile.close()
hostfiletemplate.close()
with open('/root/scg-automation/apache2conf.template', 'r') as apache2conftemplate:
with open('/root/scg-automation/apache2conf', 'r+') as apache2conf:
for line in apache2conftemplate:
apache2conf.write(line)
apache2conf.write('ServerName ' + server_ip + ':80')
files['/etc/apache2/apache2.conf'] = apache2conf.read()
apache2conf.close()
apache2conftemplate.close()
print 'Booting server ' + server_name
vm = nova.servers.create(name=server_name,
image=image,
flavor=flavor,
files=files,
nics=nics)

if __name__ == '__main__':
main() 

 

Once the entire build is complete, the user can use the platform with the supplied instructions to do self-training or present a set of pre-canned demonstrations to his customer.  Equally, he can modify any element of the setup to add use-cases of his own.  Once he is done, if he wishes to retain any changes to his configurations, he can download the relevant files before the whole platform is returned to its original state.  Should he wish to rerun the tests later, he can reserve the lab again and restore his own configuration changes to a newly created fresh lab.

 

Summary

This is a long blog with quite a lot of details and code snippets.  The real message that I am trying to get across is that automation can provide a really effective way to build repeatable lab environments with minimal manual intervention required for each cycle of the lab.  There are many tools with which to automate your workflows.  The key is to pick those that best address each element of your problem and then bind them together.  I have chosen to use a variety of tools including some Python scripting, some unix shell commands and some Ansible to glue it all together.  Others may find using the REST API or HEAT templates better for building the service chains or use the full PyEZ capabilities or even SLAX commit scripts to bring the configs of the Junos devices back to a base level.  No one method is right but automation as a general principle has huge benefits to offer in this area.

'
Comments
02.08.16
Distinguished Expert

GREAT article Guy, thanks for sharing!

02.12.16
Super Contributor

Thanks dfex.  I hope that others will also find it useful.