Showing posts with label vmotion. Show all posts
Showing posts with label vmotion. Show all posts

Monday, November 24, 2014

Vmotion over a VXLAN overlay.

In an old Post I showed how to create a simple VXLAN tunnel to stretch Layer 2 over an Layer 3 network. Now I'll show how VMWare VMotion works over this same type of setup. Note: I researched this and it warned that VXLAN was not a solution for doing VMotion over VXLAN. The latency could be high over a DCI so VMotion may not always work. This is just a proof of concept.


From what I've read, vmotion over vxlan on the hypervisor is not supported in ESXi 5.1 or earlier. I'm not sure about ESXi 5.5. In any case it doesn't matter what version of ESXi you have, because in this scenario the VXLAN tunnel starts and stops at the SPINE layer. The Hypervisor is NOT using VXLAN at all and is communicating to the underlay through a normal vlan tag. Even the leaf switch is communicating to the spine over a vlan tag. 

In this setup I have a NetApp SAN that will serve as the NFS datastore for the VMs. The SAN network is using VLAN 10, while the VMs Data path is using VLAN 100. 

On VMWare I created a NFS Datastore. The IP address of SAN NetApp is 100.10.1.254.


Each Host Serveris is seperated by a L3 network as in the above topology.

Host 1 has an VMKernal IP of 100.10.1.4 and Host 2 has an IP of 100.10.1.5 for communicating with NetApp.

 

Vmotion is enabled on both dvSwitches.


I created a distributed switch for the NFS network and added a 1G NIC as an uplink. This was done on both Hosts.
 
A second dvSwitch was created for the VMs Data path. This is using a 10G NIC for the uplink.
On NetApp


Vol1 was created which allows Read/Write access to the VMware hosts.

NetApps Network Interface is configured with the IP 100.10.1.254 and uses a vlan tag of 10.


On Juniper Leaf switch.

vlan 10 and vlan 100 are created

jnpr@QFX5100-48S-5# show vlans
v10 {
    vlan-id 10;
}
v100 {
    vlan-id 100;
}

Interfaces are created and vlans added.

set interfaces xe-0/0/2 description TO-VMW-145-vmnic9
set interfaces xe-0/0/2 unit 0 family ethernet-switching interface-mode trunk
set interfaces xe-0/0/2 unit 0 family ethernet-switching vlan members v100
set interfaces ge-0/0/47 description TO-VMW-145-vmnic3-for-NETAPP
set interfaces ge-0/0/47 unit 0 family ethernet-switching interface-mode trunk
set interfaces ge-0/0/47 unit 0 family ethernet-switching vlan members v10
set interfaces et-0/0/50 description TO-SPINE1
set interfaces et-0/0/50 unit 0 family ethernet-switching interface-mode trunk
set interfaces et-0/0/50 unit 0 family ethernet-switching vlan members all

The 1GE link is used for the NFS and the 10G is used for Data between the Leaf and Host1. The uplink to Spine1 is over a 10G link and has vlan member all.

On the Juniper Spine1:

Again two vlans are created, but here is where the mapping of vlan to VXLAN tunnel is created.


jnpr@EX9200-1# show vlans
v10 {
    vlan-id 10;
    l3-interface irb.10;
    vxlan {
        vni 10;
        multicast-group 239.1.1.10;
        encapsulate-inner-vlan;
        decapsulate-accept-inner-vlan;
    }
}
v100 {
    vlan-id 100;
    l3-interface irb.0;
    vxlan {
        vni 100;
        multicast-group 239.1.1.100;
        encapsulate-inner-vlan;
        decapsulate-accept-inner-vlan;
    }
}

Core interface is L3.
set interfaces et-2/0/0 description TO-CORE1
set interfaces et-2/0/0 unit 0 family inet address 192.168.24.4/24
set interfaces et-2/0/0 unit 0 family iso

Link towards Leaf1 is a trunk with the 2 vlans.
set interfaces et-2/2/1 description TO-QFX5100-48S-5
set interfaces et-2/2/1 unit 0 family ethernet-switching interface-mode trunk
set interfaces et-2/2/1 unit 0 family ethernet-switching vlan members v100
set interfaces et-2/2/1 unit 0 family ethernet-switching vlan members v10

The NetApp sits only off of Spine 1 and traffic is switched there from Leaf1.

set interfaces ge-7/0/0 description TO-NETAPP
set interfaces ge-7/0/0 unit 0 family ethernet-switching interface-mode trunk
set interfaces ge-7/0/0 unit 0 family ethernet-switching vlan members v10

IRBs are created for the two vlans.  Vlan 100's IRB is the default gateway for the Data Path of the VMs. IRB for vlan 10 is used just to make sure we can ping the NetApp and the VMware VMKernal IP.

set interfaces irb unit 0 family inet address 100.1.1.1/24
set interfaces irb unit 10 family inet address 100.10.1.200/24

PIM and IGP routing protocols are created for connectivity.

set protocols isis interface all
set protocols isis interface fxp0.0 disable
set protocols isis interface lo0.0 passive
set protocols pim rp static address 192.168.0.1
set protocols pim interface lo0.0 mode bidirectional-sparse
set protocols pim interface et-2/0/0.0 mode bidirectional-sparse
set protocols lldp interface all
ON the Remote Spine2 switch the configuration is almost the same. The only difference is that there is no direct NetApp connection, so NFS needs to be tunneled through VXLAN so the VMKernal on Host 2 can access the storage.

jnpr@EX9200-2# show vlans       
v10 {
    vlan-id 10;
    l3-interface irb.10;
    vxlan {
        vni 10;
        multicast-group 239.1.1.10;
        encapsulate-inner-vlan;
        decapsulate-accept-inner-vlan;
    }
}
v100 {
    vlan-id 100;
    l3-interface irb.0;
    vxlan {
        vni 100;
        multicast-group 239.1.1.100;
        encapsulate-inner-vlan;
        decapsulate-accept-inner-vlan;
    }
}


jnpr@EX9200-2# show protocols | display set
set protocols isis reference-bandwidth 40g
set protocols isis interface et-2/0/0.0
set protocols isis interface all
set protocols isis interface fxp0.0 disable
set protocols isis interface lo0.0 passive
set protocols pim rp static address 192.168.0.1
set protocols pim interface lo0.0 mode bidirectional-sparse
set protocols pim interface et-2/0/0.0 mode bidirectional-sparse
set protocols lldp interface all

LLDP is turned on the Juniper switches and dvSwitches so we can monitor the underlay.

On Leaf1 we can see both dvSwitches

jnpr@QFX5100-48S-5# run show lldp neighbors
Local Interface    Parent Interface    Chassis Id          Port info          System Name
xe-0/0/2           -                   00:05:33:48:70:b9   port1
xe-0/0/2           -                   00:50:56:b9:0b:b3   eth1               South-VM           
et-0/0/50          -                   4c:96:14:6b:bb:c0   TO-QFX5100-48S-5   EX9200-1           
ge-0/0/47          -                                       port 2 on dvSwitch dvSwitch-NFS-v10 (etherswitch) localhost.jnpr.net 
xe-0/0/2           -                                       port 2 on dvSwitch dvSwitch-v100 (etherswitch) localhost.jnpr.net

And a Host running lldpd. We can also see the uplink to Spine1.


jnpr@QFX5100-48S-5# run show ethernet-switching table

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC)


Ethernet switching table : 5 entries, 5 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical
    name                address             flags              interface
    v10                 00:50:56:6f:2c:31   D             -   ge-0/0/47.0         
    v10                 02:a0:98:2c:06:cc   D             -   et-0/0/50.0  

The things to note here are the two macs. The NetApp uses 02:a0:98:2c:06:cc which is being learned from the uplink to the Spine. And 00:50:56:6f:2c:31 is the NIC of Host 1.

On Spine 1 

jnpr@EX9200-1# run show ethernet-switching table

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static, C - Control MAC
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC)


Ethernet switching table : 5 entries, 5 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical                NH        RTR
    name                address             flags              interface              Index     ID
    v10                 00:50:56:69:37:db   D             -   vtep.32769          
    v10                 00:50:56:6f:2c:31   D             -   et-2/2/1.0          
    v10                 02:a0:98:2c:06:cc   D             -   ge-7/0/0.0          
    v10                 4c:96:14:f2:b6:e0   D             -   vtep.32769          
    v10                 a8:d0:e5:f7:bf:f0   D             -   vtep.32769

We can see that macs are being learned over the VTEP or VXLAN tunnel.

Mac  00:50:56:69:37:db is the NIC of Host2.

Spine 2 we can see NETAPP is being learned through the VXLAN tunnel.

# run show ethernet-switching table

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static, C - Control MAC
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC)


Ethernet switching table : 3 entries, 3 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical                NH        RTR
    name                address             flags              interface              Index     ID
    v10                 00:50:56:69:37:db   D             -   et-2/2/1.0          
    v10                 02:a0:98:2c:06:cc   D             -   vtep.32769           <<< NETAPP
    v10                 4c:96:14:6b:bb:f0   D             -   vtep.32769 



Leaf 2

jnpr@QFX5100-48S-6> show ethernet-switching table

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC)


Ethernet switching table : 4 entries, 4 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical
    name                address             flags              interface
    v10                 00:50:56:69:37:db   D             -   ge-0/0/47.0        
    v10                 02:a0:98:2c:06:cc   D             -   et-0/0/50.0        
    v10                 4c:96:14:6b:bb:f0   D             -   et-0/0/50.0        
    v10                 a8:d0:e5:f7:bf:f0   D             -   et-0/0/50.0

Ethernet switching table : 2 entries, 2 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical
    name                address             flags              interface
    v100                00:50:56:b9:55:58   D             -   xe-0/0/2.0          
    v100                4c:96:14:6b:bb:f0   D             -   et-0/0/50.0    

Now we are ready to do vMotion.

You can see from Ubuntu the mac address is on eth1 which is going to vlan 100 on Leaf2


jnpr@vmotion-ubuntu:~$ ifconfig

eth1      Link encap:Ethernet  HWaddr 00:50:56:b9:55:58 
          inet addr:100.1.1.30  Bcast:100.1.1.255  Mask:255.255.255.0
          inet6 addr: fe80::250:56ff:feb9:5558/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8633 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7023 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1200614 (1.2 MB)  TX bytes:905670 (905.6 KB)

On Spine2 it's a local mac.
jnpr@EX9200-2# run show ethernet-switching table vlan-id 100
Ethernet switching table : 2 entries, 2 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical                NH        RTR
    name                address             flags              interface              Index     ID
    v100                00:50:56:b9:55:58   D             -   et-2/2/1.0          



On VMWare I choose the Vmotion ubuntu VM.




And Select Migrate and chooose Change Host and select Host1 as the Destination. Select the Priority of your choice and click Finish.



Migration completed!

On Leaf1 I can see the mac move.
jnpr@QFX5100-48S-5# run show ethernet-switching table vlan-id 100

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC)


Ethernet switching table : 2 entries, 2 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical
    name                address             flags              interface
    v100                00:50:56:b9:55:58   D             -   xe-0/0/2.0           
    v100                4c:96:14:6b:bb:f0   D             -   et-0/0/50.0

On Spine 1 it is local.

jnpr@EX9200-1# run show ethernet-switching table vlan-id 100

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static, C - Control MAC
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC)


Ethernet switching table : 1 entries, 1 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical                NH        RTR
    name                address             flags              interface              Index     ID
    v100                00:50:56:b9:55:58   D             -   et-2/2/1.0          

And on Spine 2 it now sits across the VTEP.


jnpr@EX9200-2# run show ethernet-switching table vlan-id 100

MAC flags (S - static MAC, D - dynamic MAC, L - locally learned, P - Persistent static, C - Control MAC
           SE - statistics enabled, NM - non configured MAC, R - remote PE MAC)


Ethernet switching table : 2 entries, 2 learned
Routing instance : default-switch
    Vlan                MAC                 MAC         Age    Logical                NH        RTR
    name                address             flags              interface              Index     ID
    v100                00:50:56:b9:55:58   D             -   vtep.32769          
    v100                4c:96:14:6b:bb:f0   D             -   vtep.32769          

One caveat I found was that vmotion sends large frames. IP mcast + VXLAN adds overhead. So when I setup my network with the default configuration I could not initiate vmotion. I had to change the MTU size on Spine to 9000 and then it worked.

So the conclusion is, you don't need VXLAN support on the Hypervisor if you want to do vMotion over a Layer 2 stretch technology. You can run VXLAN Tunnels between two Juniper Ex9200 Spine switches and map vlans to these tunnels.  The Overlay is unaware of what is happening in the Underlay.


Tuesday, March 4, 2014

Use (EVPN) Ethernet Virtual Private Network for Data Center Interconnections (DCI)

As Enterprises build Data Centers at different locations for disaster recovery and traffic distribution, there is a need to interconnect them transparently. Stretching Layer 2 across a WAN poses some challenges.

1) Workload Mobility aka VM migration from one DC to another.

2) Fast convergence in a multi homed environment.

3) Load balancing across multiple active paths between data centers.

The Trombone effect when migrating VMs across a WAN.



When VM1 is moved from one Hypervisor in DC1 to the other Hypervisor in DC2, the default GW for VM1 still resides on DC1. When VM1 sends traffic to VM2, the traffic will traverse the core before tromboning back to DC2.

EVPN solves this.  EVPN is a similar technology to VPLS except that mac addresses are learned and exchanged through the control plane using BGP as the transport protocol.  A new BGP family is introduced called EVPN.

bgp {
    group IBGP {
        local-address 1.1.1.1;        
        family evpn {
            signaling;  
        }
        neighbor 2.2.2.2;
    }
}

First an understanding of how EVPN works.

In a multi-tenant environment, each tenant will correspond to an EVPN instance (EVI). Route Distinguishers are used to distinguish between each EVI and Route Targets are used to share learned mac addresses between EVIs.

For mac learning, each PE router snoops for DHCP and/or ARP(IPv4)/ND(IPv6) packets for a particular EVI. The PE can then advertise the locally learned MAC address to remote PE nodes through MP-iBGP. MAC addresses are aggregated and a MAC prefix is advertised rather than advertising every single MAC address, thus allowing the ability to scale thousands of MAC addresses.  When a remote PE receives this bgp update it will extract the mac address and build a table with the next-hop pointing to the LSP of the advertising PE. Because this is BGP, policies can be created to filter and manipulate forwarding decisions.

When a local PE router sees an ARP request for an IP address and if the PE router has the MAC address binding for that IP address across the wan, the PE router performs a proxy ARP and responds to the ARP Request and can make the forwarding decision locally.  This reduces (BUM) flooding (Broadcast, Unknown Unicast and Multicast) across WAN links.

Gateway IP and MAC addresses syncing in EVPN allows the host to use the nearest gateway to route traffic. You do this by creating IRBs on both PEs using different GW IP addresses. To accomplish this IRBs (IP  + MAC addresses) are advertised using a BGP extended community. When VM1 migrates to DC2, it sends packets to the mac address associated to GW IP address of DC1. The IRB in DC2 notices that the destination mac address for these packets is across the WAN, so it does the routing locally. When the arp entry for the GW in VM1 expires, the VM will arp again and the IRB in DC2 will send a reply to VM1 with it's updated mac address.

Another thing that happens when VM migration is performed in an EVPN network, the MAC address of the VM is now advertised in DC2, the PE in DC2 updates their mac table table while the PE in DC1 withdraws the entry.

To address fast convergence in a multi homed environment, a concept called an Ethernet Segment is introduced. The set of links connecting to two or more local PE routers are called an Ethernet Segment. Each segment has an unique identifier called an ESI. An ethernet tag is also used to identify each broadcast domain such a vlan. When an Ethernet segment fails, the local PE withdraws the corresponding Ethernet "route" from BGP which triggers all remote PE routers to update their forwarding tables to update the corresponding next-hop to the backup PE.

EVPN introduces Split Horizon. BUM flooding aka, Broadcast, Unknown unicast or Multicast traffic are encapsulated in a MPLS packet with the Ethernet Segment Identifier. This allows the Egress PE to make a forwarding decision and prevents loops, because the PEs know where the packet originated from.
This in turn makes it possible to forward traffic over multiple active links through the WAN and allows for the ability to load balance.

With these advantages EVPN makes it a viable choice for interconnecting Data Centers.