Linux, EMC SANs, and TCP Delayed ACKs

One of  relatively well-known issues when using EMC (and some other vendors’) SANs over iSCSI is the SANs’ dislike for TCP delayed ACKs. The reasons for the dislike are best described in this VMware KB article. EMC also has several articles discussing delayed ACKs on Primus, but overall the issue is confusing. With this post, I’ll try to clear up the confusion.

(Since you can’t deep-link to EMC Powerlink pages, I’m just going to give article numbers.) Out of many articles returned by Powerlink search on “delayed ack”, we can consider emc245445 to be the starting point, since it discusses general best practices for improving Clariion iSCSI performance and provides references to the articles covering Windows and ESX hosts specifically. About Linux hosts, it has this strange “may also apply” statement with no further explanation or instructions. Looking at the Windows article (emc150702), we see very detailed instructions on tweaking TCP stack settings; the articles for ESX (emc191777 and emc273003) lead to the VMware article mentioned above, which tells us where to find the magic checkbox that disables delayed ACKs. But the best Powerlink can do about Linux is emc264171, which reads like an exam anwer of a mediocre student, who remembers some stuff from lectures, but can’t put it all together to produce a complete answer. So, what’s going on?

The issue is not trivial and was probably never researched by EMC deeply enough to produce a useful Primus article. In Linux, delayed ACKs are disabled by setting the TCP_NODELAY option on the socket. It’s done by the application that wants to use the socket and is not a simple system-wide setting like in Windows or ESX. Therefore, in Linux, there are many different places where this setting may be specified.

If the Linux server is using software iSCSI, it is the responsibility of the iSCSI initiator code to set this option. Fortunately, open-iscsi, which is used by most (all?) modern Linuxes, does the right thing: TCP_NODELAY is hardcoded in the function that handles TCP connections.

In the case of hardware iSCSI initiators (proper HBA or iSCSI offload provided by Broadcom NICs and the like), things get more complicated. Since these adapters implement their own TCP stacks, delayed ACKs need to be disabled through the driver. And, of course, every driver will have its own setting for that. For example, Broadcom’s bnx2i will take “options bnx2i en_tcp_dack=0″ in modprobe.conf. For other iSCSI implementations, you will need to consult the documentation or contact the vendor (or just Google it).

Categories: Operating Systems, Storage

VMware vCenter Server 5 Upgrade and Service Accounts

When upgrading our vCenter Server instances, I encountered an annoying quirk in the vCenter Server installer, which may or may not depend on the way the database backend is configured. At least in our case, with the transition from a local SQL Express instance to a separate SQL database server, the installer didn’t allow to choose the account to be used to run vpxd and tomcat. Obviously, running the service under the personal account of the user who happened to be installing the service is a bad idea. So here’s a quick recipe for changing the service account:

  • Create the database per http://www.vmware.com/files/pdf/techpaper/vSphere-5-Upgrade-Best-Practices-Guide.pdf, pg.20. Run SQL Management Studio under the AD account that you will later use to install vCenter Server, make sure first to give it sufficient privileges in SQL Server to create the database (you can even make it a sysadmin – it’s temporary, so you can drop the privileges later). When creating the database, leave the owner default.
  • On the system where you’ll install vCenter Server, create a DSN (see same document), make sure to specify Windows Authentication.
  • Run vCenter Server installer.
  • After the installation is finished, install vCenter Client, confirm that everything works.
  • Stop VirtualCenter Server and Web Management services.
  • Go into these services’ properties and specify a special account you want to use for this purpose on the Log On tab.
  • Go to SQL Management Studio again. Create a regular user associated with the same AD account, then execute the following query on VCDB: “ALTER AUTHORIZATION ON DATABASE::VCDB TO <useraccount>”. This will change the database’s owner to this user.
  • Start the two vCenter services.

It seems there’s also a similar issue with VUM: its service gets installed with local SYSTEM account, but your default DSN configuration will most likely use Windows Authentication and therefore require a proper AD account. So, go to the service’s properties, change its account to the same one you used for vCenter, and restart the service. This issue is sufficiently widespread to have earned itself a KB article: http://kb.vmware.com/kb/1011858.

Categories: Virtualization

Why pfSense is not production ready

November 2, 2010 8 comments

(Caveat: everything said below is applicable only to pfSense 1.2.3, since this is only version I ever used.)

pfSense is a great piece of software. Easy to install, easy to configure, very powerful, lightweight, stable. It’s no surprise that so many people use it when they need a software firewall or router. But after running it for about half a year in production, I have formed the opinion that it was a wrong decision to use it in a critical role. And here’s why.

Over this period, I had exactly three issues with pfSense. One of them, the breakage of CARP due to multicasts coming back over teamed physical adapters , is mostly VMware’s fault, and I’m not going to count it against pfSense. The other two, however, are clearly a reflection of the FOSS mindset (or rather lack of resources).

The first of the two is the default number of entries in the state table: 10,000. This number is fine for home use or a small startup’s web site, but any organization beyond infancy will have more traffic and will need to increase the table size. The change is simple and can be made on the fly, so it may not seem like a problem, but it’s easy to miss, and difficult to troubleshoot: connections just randomly timeout or take a long time to establish, while pfSense happily keeps its system logs free of any notifications. Considering that each table entry occupies just 1K of memory, it would make a lot of sense to set the default to a much larger number, or, better yet, implement dynamic table resize.

The second problem is much nastier. There’s something broken with IP fragmentation handling. In our specific case it affected EDNS responses (DNSSEC-enabled servers now return 2-3KB-long responses, which necessarily become fragmented). pfSense’s scrub feature would reassemble them for analysis, then send them down to the destination, again in fragmented form, and the second fragment would come in with broken checksum, which made the reassembly at destination or any intermediary firewall impossible. There are some hints that this may actually be a problem with em driver checksum offload, but at this point it’s irrelevant: if pfSense can’t do something as basic as IP fragment processing, regardless of the underlying drivers and hardware (in this case it was actually pfSense-distributed virtual appliance, so no compatibility issues should be expected), it doesn’t qualify as a production-ready firewall.

I expect it to be gone from our environment in about two weeks.

Categories: Networking

Fixing VM-based pfSense CARP announcement echoes when using teamed network adapters

A few people who have tried to run pfSense as a virtual appliance in an ESX(i) host have found that CARP may refuse to work. Both pfSense nodes remain in “Backup” state – none of them is willing to take the Master role and start serving the VIP. This problem can be observed only if the VIP belongs to a virtual network interface that has multiple underlying physical adapters in a teamed configuration.

The best clue to the solution can be found in pfSense logs. There, one can see that the primary pfSense node actually tries to become a Master, but every time it receives a CARP announcement with the same advertisement frequency as its own, which makes it drop to Backup state (a router must see only lower frequency advertisements to remain Master). In fact, those advertisements are its own.

So, the real question is: why does the router VM receive IP packets that it has just sent? To answer it, we need to remember that, first, CARP advertisements are multicast, and, second, the typical ESX teaming setup uses the “Route based on the originating virtual switch port ID” option. This setting means that any given vNIC will consistently use the same pNIC, unless a hardware failure occurs. When this setting is used by the host, the physical switch, which the host’s pNICs are connected to, has all its corresponding ports configured by default, with no link aggregation.

Now, what happens when a CARP advertisement is sent? It exits the host on one pNIC, travels to the switch, where, being a multicast, it’s sent to other switch ports, including the other pNICs in the same team as the originating pNIC. The multicast comes back into the host, where it’s sent to all VMs on the same vSwitch, including the originating router. Oops.

We can argue whether or not this ESX behavior is correct, but the important fact is that VMware doesn’t seem to be interested in changing this behavior (the problem existed in 3.5 and still exists in 4.0). Instead, but there has been no VMware solution until 4.0U2. If you’re on this version, you can use the new Net.ReversePathFwdCheckPromisc option (refer to 4.0U2 release notes). Or we can fix it by ourselves, in a very simple way. We need to make the switch aware of the teamed nature of the pNICs involved. This way the switch won’t send the multicast packets back to the host.

I will let you figure out the correct setting for the switch (different vendors use different names for the same thing: Cisco has EtherChannel, Nortel calls it MultiLink Trunking, etc.). As for the host side, change the load balancing algorithm from the default to “Route based on IP hash”. Just keep in mind that until you have made the changes on both ends, that connection may not work, so make sure you’re not transferring anything important to/from the VMs on the same vSwitch while you’re making the changes. (I’m assuming your management network is on a different vSwitch; otherwise you’re on your own.)

Update: Thank you to Anne Jan Elsinga for pointing out that 4.0U2 provides a new option that can be used to solve the problem. I’ve modified the post to reflect this.

Missing access points after upgrade of Cisco Wireless LAN Controller to Release 5.2

If you upgrade your Cisco Wireless LAN Controller to Release 5.2, and suddenly some or all access points disappear, go to Controller, Advanced, Master Controller Mode, check the box, and power cycle the missing access points.

Root cause: 5.2 introduces CAPWAP protocol as the replacement for LWAPP. Some access points don’t transition to the new protocol unless the Master controller tells them to. Unfortunately, Master Controller Mode gets disabled after every reboot of the controller. Since you always have to reboot to apply the new software release, the problematic access points are left without guidance until you manually check that Master Controller Mode box.

Categories: Networking Tags: ,

Solution of the Clariion plaid issue

The problem was not with plaids as such, but rather with network congestion easily triggered by plaids.

The default configuration of PowerPath is to make all available paths active. With two iSCSI ports per SP, there are four paths, each path 1Gbps wide, 4Gbps total. However, that’s true only for the array side. On the host, we only have two 1G NICs. So, every time the array starts firing on all cylinders (and using plaids built from LUNs owned by different SPs is guaranteed to make it so), it is pumping twice as much data as the host interfaces can handle. Therefore, network congestion, frame discards, and severely impacted throughput.

Solution: change the mode of two of the four paths from active to standby, choosing them so that there’s one active path on each SP to each NIC. Alternatively, add two more NICs so that the host bandwidth is equal to the array bandwidth (though this may require four NICs on all other hosts, since using multiple host NICs on the same VLAN is not recommended).

Expect a new Primus KB and some changes to EMC iSCSI documentation.

Clariion plaids on RHEL 4

We will begin with a quick mention of the problem currently under investigation with EMC support:

  • Host: Dell PowerEdge R710, PowerPath, RHEL 4.8.
  • SAN: Clariion CX4-120
  • Connectivity: iSCSI – two ports per SP, two ports on the host, two VLANs on Nortel ERS5520.

Problem: if we present two LUNs to the host, put them into a single VG, then create an LV with no striping (LUNs get concatenated on that volume), everything is good. If we take the same LUNs and put them into a striped LV per various EMC Best Practices documents (since the LUNs themselves are striped across physical disks EMC calls this layout “plaids”), read performance suffers in a very bad way, dropping to about 4MB/sec. Write performance stays perfect at over 100MB/sec.

Follow

Get every new post delivered to your Inbox.