Tuesday, September 9, 2014

Zen Load Balancer 3.0.3 Perfomance and Security Customization Part 5

Now, let's move on to NIC bonding. This is useful if one of our NICs goes dead; we obviously want to make sure if that happens, we have another standing by that will take over.

Many admins have a dedicated VLAN for cluster synchronization purposes. Some others just connect two nodes using a crossover cable. That means that if one NIC goes down, all hell breaks loose; if it is the cluster synchronization NIC, then both nodes think that the other node has gone down and they both try to become masters causing havoc to the network; in any other case your frontends and backends seem to be down due to your NIC being dead.

So in that case, we employ NIC bonding. There are actually a few types of network bonding (from here): 
  • balance-rr or 0: Round-robin policy: Transmit packets in sequential order from the first available slave through the last.  This mode provides load balancing and fault tolerance.
  • active-backup or 1: Active-backup policy: Only one slave in the bond is active.  A different slave becomes active if, and only if, the active slave fails. The bond's MAC address is externally visible on only one port (network adapter) to avoid confusing the switch.
    In bonding version 2.6.2 or later, when a failover occurs in active-backup mode, bonding will issue one or more gratuitous ARPs on the newly active slave. One gratutious ARP is issued for the bonding master interface and each VLAN interfaces configured above it, provided that the interface has at least one IP address configured.  Gratuitous ARPs issued for VLAN interfaces are tagged with the appropriate VLAN id. This mode provides fault tolerance.
  • balance-xor or 2: XOR policy: Transmit based on the selected transmit hash policy.  The default policy is a simple [(source MAC address XOR'd with destination MAC address) modulo slave count].  Alternate transmit policies may be selected via the xmit_hash_policy option. This mode provides load balancing and fault tolerance.
  • broadcast or 3: Broadcast policy: transmits everything on all slave interfaces.  This mode provides fault tolerance.
  • 802.3ad or 4: IEEE 802.3ad Dynamic link aggregation.  Creates aggregation groups that share the same speed and duplex settings.  Utilizes all slaves in the active aggregator according to the 802.3ad specification. Slave selection for outgoing traffic is done according to the transmit hash policy, which may be changed from the default simple XOR policy via the xmit_hash_policy option. Note that not all transmit policies may be 802.3ad compliant, particularly in regards to the packet mis-ordering requirements of section 43.2.4 of the 802.3ad standard.  Differing peer implementations will have varying tolerances for noncompliance. Prerequisites:
    1. Ethtool support in the base drivers for retrieving the speed and duplex of each slave.
    2. A switch that supports IEEE 802.3ad Dynamic link aggregation.
    3. Most switches will require some type of configuration to enable 802.3ad mode.
  • balance-tlb or 5: Adaptive transmit load balancing: channel bonding that does not require any special switch support.  The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave.  Incoming traffic is received by the current slave.  If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave. Prerequisites:
    1. Ethtool support in the base drivers for retrieving the speed and duplex of each slave.
  • balance-alb or 6: Adaptive load balancing: includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support.  The receive load balancing is achieved by ARP negotiation. The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server. Receive traffic from connections created by the server is also balanced. When the local system sends an ARP Request the bonding driver copies and saves the peer's IP information from the ARP packet.  When the ARP Reply arrives from the peer, its hardware address is retrieved and the bonding driver initiates an ARP reply to this peer assigning it to one of the slaves in the bond. A problematic outcome of using ARP negotiation for balancing is that each time that an ARP request is broadcast it uses the hardware address of the bond.  Hence, peers learn the hardware address of the bond and the balancing of receive traffic collapses to the current slave.  This is handled by sending updates (ARP Replies) to all the peers with their individually assigned hardware address such that the traffic is redistributed.  Receive traffic is also redistributed when a new slave is added to the bond and when an inactive slave is re-activated.  The receive load is distributed sequentially (round robin) among the group of highest speed slaves in the bond. When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all active slaves in the bond by initiating ARP Replies with the selected mac address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch's forwarding delay so that the ARP Replies sent to the peers will not be blocked by the switch. Prerequisites:
    1. Ethtool support in the base drivers for retrieving the speed and duplex of each slave.
    2. Base driver support for setting the hardware address of a device while it is open. This is required so that there will always be one slave in the team using the bond hardware address (the curr_active_slave) while having a unique hardware address for each slave in the bond. If the curr_active_slave fails its hardware address is swapped with the new curr_active_slave that was chosen.
In this example we will employ the active-backup method. This is the safest method to use. Most googlers like link aggregation, since an aggregation group will increase the overall bandwidth of the resulting interface.

Let's suppose we want to bond eth8 and eth3 to an interface with the IP 172.16.0.8/22, eth9 and eth4 to an interface with the IP 172.16.4.8/22, and eth0 and eth9 to an interface with the IP 172.16.8.8/23:

root@zen-lb:~# apt-get install ifenslave-2.6
root@zen-lb:~# vi /etc/network/interfaces
auto lo
iface lo inet loopback
auto bond0
iface bond0 inet static
    address 172.16.0.8
    netmask 255.255.252.0
    network 172.16.0.0
    gateway 172.16.0.1
    slaves eth8 eth3
    bond-mode active-backup
    bond-miimon 100
    bond-primary eth8
auto bond1
iface bond1 inet static
    address 172.16.4.8
    netmask 255.255.252.0
    network 172.16.4.0
    slaves eth9 eth4
    bond-mode active-backup
    bond-miimon 100
    bond-primary eth9
auto bond2
iface bond2 inet static
    address 172.16.8.8
    netmask 255.255.254.0
    network 172.16.8.0
    slaves eth0 eth5
    bond-mode active-backup
    bond-miimon 100
    bond-primary eth0

bond-primary is the NIC that will be our primary device.
bond-miimon is how often the link state will be polled.
So, in our case, every 100ms eth8 and eth3 will be polled; if eth8 is up, then this will serve our incoming and outgoing requests, otherwise eth3 will take charge.

root@zen-lb:~# rm /usr/local/zenloadbalancer/config/if_eth*
root@zen-lb:~# vi /usr/local/zenloadbalancer/config/if_bond0_conf
bond0::172.16.0.8:255.255.252.0:up::
root@zen-lb:~# vi /usr/local/zenloadbalancer/config/if_bond1_conf
bond1::172.16.4.8:255.255.252.0:up::
root@zen-lb:~# vi /usr/local/zenloadbalancer/config/if_bond2_conf
bond2::172.16.8.8:255.255.254.0:up::
root@zen-lb:~# vi /usr/local/zenloadbalancer/config/global.conf_conf
.....
#System Default Gateway
$defaultgw="172.16.0.1";
#Interface Default Gateway
$defaultgwif="bond0";
.....
#Also change the ntp server
.....
$ntp="0.europe.pool.ntp.org";
.....

You might also want to change these particular ports on your switch to portfast. That way, you won't have to wait for the forward delay (and as far as these particular ports go, forward delay is useless any way) and the transition will be seemless.

All right, let's see if it all works:

root@zen-lb:~# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth8
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth8
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:19:b9:e4:12:a3

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1f:29:57:cf:fe

root@zen-lb:~# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth9
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth9
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:19:b9:e4:12:a5

Slave Interface: eth4
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1f:29:0d:69:81

root@zen-lb:~# cat /proc/net/bonding/bond2
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1f:29:57:cf:fd

Slave Interface: eth5
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1f:29:0d:69:80

And if you try to disconnect, or otherwise bring down any of the primary slave interfaces you'll see that the active backup will come up almost instantly (provided you set those ports to portfast on your switch).

Thursday, September 4, 2014

Zen Load Balancer 3.0.3 Perfomance and Security Customization Part 4

Time to fine-tune our IP stack:

root@zen-lb:~# vi /etc/sysctl.conf
# Performance:
# Turn down swappiness, 0 means no swap on modern kernels, number is percent of free system memory swap will kick in.
vm.swappiness = 2
# Contains, as a percentage of total system memory, the number of pages at which a process which is generating disk writes will start writing out dirty data.
# Defaults to 10 (percent of RAM). Consensus is that 10% of RAM when RAM is say half a GB (so 10% is ~50 MB) is a sane value on spinning disks, but it can be MUCH worse when RAM is larger, say 16 GB (10% is ~1.6 GB), as that's several seconds of writeback on spinning disks. A more sane value in this case is 3 (16*0.03 ~ 491 MB).
vm.dirty_ratio = 3
# Contains, as a percentage of total system memory, the number of pages at which the background kernel flusher threads will start writing out dirty data.
# Defaults to 5 (percent of RAM). It may be just fine for small memory values, but again, consider and adjust accordingly for the amount of RAM on a particular system.
vm.dirty_background_ratio = 2
# Will not change the number of System V IPC message queue resources allowed
# Will not change Kernel semaphores (kernel.sem [semmsl semmns semopm semmni], kernel.shmmax and kernel.shmmin)

# Network Performance:
# Turn off TCP prequeue processing
net.ipv4.tcp_low_latency = 1
# Reuse time-wait sockets, better than recycling
net.ipv4.tcp_tw_reuse = 1
# Fast recycling TIME-WAIT sockets using recycling rather than reusing. Default value is 0. It should not be changed without advice/request of technical experts.
net.ipv4.tcp_tw_recycle = 0
# Maximum time-to-live of entries. Unused entries will expire after this period of time if there is no memory pressure on the pool.
net.ipv4.inet_peer_maxttl = 5
# How often to send out keepalive messages when keepalive is enabled. Default is 7200 seconds.
net.ipv4.tcp_keepalive_time = 512
# How frequent probes are retransmitted, when a probe isn't acknowledged. Default is 75 seconds
net.ipv4.tcp_keepalive_intvl = 15
# Number of keepalive probes to send until the server decides that the connection is broken.
net.ipv4.tcp_keepalive_probes = 5
# Number of outstanding syn requests allowed. This setting tells the system when to start using syncookies. When you have more TCP connection requests in your queue than this number, the system will start using syncookies. Note that syncookies can have an impact on performance.
net.ipv4.tcp_max_syn_backlog = 36000
# Size of the listen queue
net.core.somaxconn = 36000
# Maximum number of timewait sockets held by the system simultaneously.
net.ipv4.tcp_max_tw_buckets = 100000
# Increase TCP default and max receive/send buffer size
net.core.rmem_default = 16777216
net.core.rmem_max = 16777216
net.core.wmem_default = 16777216
net.core.wmem_max = 16777216
# Same for UDP
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192
# Increase the maximum amount of option memory buffers
net.core.optmem_max= 20480
# Increase Linux autotuning TCP receive/send buffer limit
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# Increase the length of the packets queue waiting on an interface until the kernel is ready to process them
# The backlog of pending connections allows the server to hold connections it’s not ready to accept, and this allows it to withstand a larger slow HTTP attack, as well as gives legitimate users a chance to be served under high load. However, a large backlog also prolongs the attack, since it backlogs all connection requests regardless of whether they’re legitimate. If the server supports a backlog, I recommend making it reasonably large to so your HTTP server can handle a small attack.
net.core.netdev_max_backlog = 30000
# This setting determines the time that must elapse before TCP/IP can release a closed connection and reuse its resources.
net.ipv4.tcp_fin_timeout = 30
# Turn connection accounting on
net.netfilter.nf_conntrack_acct = 1
# Maximum number of tracked connections, the toll is 300-350 bytes of unswapped RAM per connection. Hash table should accordingly be hashsize = conntrack_max / 8, that why the options ip_conntrack hashsize=25000 and options nf_conntrack hashsize=25000 in modprobe.conf
net.ipv4.netfilter.ip_conntrack_max = 200000
# Dynamically-assigned ports range; bear in mind that in theory IANA has officially designated the range 49152 - 65535 for dynamic port assignment. The default linux range for modern kernels is 32768 - 61000.
net.ipv4.ip_local_port_range = 10000 65535
# Now that we've increased the ports, we need to increase the number of file handlers as well. This parameter should be at least as twice big as the number of network connections you expect to support. We should also change the number of max number of open files a user can have in /etc/security/limits.conf.
fs.file-max = 1048576
# Increase the number of allowed mmapped files
vm.max_map_count = 1048576
# This setting determines the number of SYN+ACK packets sent in part 2 of a 3-way-handshake before the kernel gives up on the connection. Default is 5.
net.ipv4.tcp_synack_retries = 3
# Number of times initial SYNs for a TCP connection attempt will be retransmitted. This is only the timeout for outgoing connections. Default is 5.
net.ipv4.tcp_syn_retries = 3
# This defines how often an answer to a TCP connection request is retransmitted before it gives up. This is only the timeout for incoming connections. Default is 3.
net.ipv4.tcp_retries1 = 3
# Determines how the TCP stack should behave for memory usage; each count is in memory pages (typically 4KB).
net.ipv4.tcp_mem = 50576 64768 98152
#net.ipv4.tcp_mem = 128000 200000 262144 # Use this for 1Gb+ connections
# The TCP window scale option is an option to increase the receive window size allowed in TCP above its former maximum value of 65535 bytes. See IETF RFC 1323.
# Linux kernels from 2.6.8 have enabled TCP Window Scaling by default
net.ipv4.tcp_window_scaling = 1
# How may times to retry before killing TCP connection, closed by our side. Default 0.
net.ipv4.tcp_orphan_retries = 0
# Security:
# Debian does not have kernel.exec-shield, check that you have NX (Execute Disable) protection: active with dmesg | grep protection. To have NX protection, your BIOS, your CPU, your OS must support it and you must have a 32-bit PAE or 64 bit kernel (NX bit works on the 63rd bit of the address)
#kernel.exec-shield = 1
# Turn on protection and randomize stack, vdso page and mmap + randomize brk base address.
kernel.randomize_va_space = 2
# tcp_syncookies with appropriate tcp_synack_retries and tcp_max_syn_backlog can mitigate SYN flood attacks. Note That without SYN cookies, a much larger value for tcp_max_syn_backlog is required. Default is 1.
net.ipv4.tcp_syncookies = 1
# Protect against tcp time-wait assassination hazards
net.ipv4.tcp_rfc1337 = 1
# Timestamps can provide security by protecting against wrapping sequence numbers (at gigabit speeds) but they also allow uptime detection. Definitely enable for Gb+ speeds, up to the admin to decide what to do for slower speeds.
# 1 is the default value but it has some overhead, use 0 for slightly better performance.
net.ipv4.tcp_timestamps = 0
#net.ipv4.tcp_timestamps = 1 # Use this for 1Gb+ connections
# Source address verification, helps protect against spoofing attacks.
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1    
# Usually, we'd want to disable IP forwarding but our LB is also a router so no choice here:
net.ipv4.ip_forward = 1
# Log martian packets
# This is a router, it will receive martians all the time, better turn this off. Otherwise, we'd want to turn this on.
net.ipv4.conf.all.log_martians = 0
net.ipv4.conf.default.log_martians = 0   
# Ignore echo broadcast requests to prevent being part of smurf attacks (default)
net.ipv4.icmp_echo_ignore_broadcasts = 1
# Ignore *all* echo requests, including on localhost (default 0). Enabling it is paranoid really.
net.ipv4.icmp_echo_ignore_all = 0
# Ignore bogus icmp errors (default)
net.ipv4.icmp_ignore_bogus_error_responses = 1
# IP source routing (insecure, disable it) (default)
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.default.accept_source_route = 0 
net.ipv6.conf.all.accept_source_route = 0
net.ipv6.conf.default.accept_source_route = 0
# Send redirects: Usually, we'd want to disable it but we're a LB aka router:
net.ipv4.conf.all.send_redirects = 1
net.ipv4.conf.default.send_redirects = 1
# ICMP only accept secure routing redirects (we could deny redirects altogether actually).
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0 
net.ipv6.conf.all.accept_redirects = 0
net.ipv6.conf.default.accept_redirects = 0 
net.ipv4.conf.all.secure_redirects = 1
net.ipv4.conf.default.secure_redirects = 1
# Disable IPv6 router solicitations:
net.ipv6.conf.default.router_solicitations = 0
# Do not accept Router Preference in RA
net.ipv6.conf.default.accept_ra_rtr_pref = 0
# Do not learn Prefix Information in Router Advertisement
net.ipv6.conf.default.accept_ra_pinfo = 0
# Will not accept Hop Limit settings from a router advertisement
net.ipv6.conf.default.accept_ra_defrtr = 0
# Do not assign a global unicast address to an interface according to router advertisements
net.ipv6.conf.default.autoconf = 0
# Do not send neighbor solicitations
net.ipv6.conf.default.dad_transmits = 0
# Only one global unicast IPv6 address per interface
net.ipv6.conf.default.max_addresses = 1
# And after all this, we disable IPv6 awwww :(                  
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1


If you have memore to spare, you can use replace these corresponding settings with:

.... 
net.core.rmem_max=1677721600
net.core.rmem_default=167772160
net.core.wmem_max=1677721600
net.core.wmem_default=167772160
net.core.optmem_max= 2048000
....
net.ipv4.tcp_rmem= 1024000 8738000 1677721600
net.ipv4.tcp_wmem= 1024000 8738000 1677721600
net.ipv4.tcp_mem= 1024000 8738000 1677721600
net.ipv4.udp_mem= 1024000 8738000 1677721600
....