NSCD Trickery

Don’t forget to flush the NSCD cache when user/group lookups from LDAP don’t quite look right

nscd -i passwd nscd -i group

SSO Heaven

Due to the humidity and heat this past weekend, my wife and I decided to hang out around the house. I was a bit bored at first but then I decided to roll up my sleeves and start to finally wrap my head around the Kerberos architecture and kerberizing services to achieve SSO nirvana; something I’ve been wanting since I first implemented Kerberos authentication a couple of years back.

Background

Back in 2011, I had grown bored with my Windows 2003 AD so I ditched it and implemented OpenLDAP+Kerberos; the former for authorization/identity management and the latter for authentication. I got it all basically working with users/groups in LDAP, user principals and passwords set in the KDC, and Apache setup to perform basic auth back to the KDC. So it all worked, but I didn’t actually kerberize any of the hosts or services on the network; SSH and Apache prompted the user for credentials. I wanted to be able to grab a Ticket Granting Ticket and go.

There were two main things I wanted kerberized: SSH and Apache.

What I learned

First things first: get the foundation working and in this case, like a lot of things, the foundation is DNS along with NTP. The latter was all fine, but my Intranet and KDC were using two different DNS server, with different zones configured in them. For now, I made the two DNS servers have the same zones, each thinking that they are the master. I’ll probably change this at some point down the road as this can only lead to pain.

In order for DNS to be effective, I also had to clean up my hosts files where appropriate so that every client, server, and service all used the same names.

KDC

With the foundation laid, I turned to understanding how Kerberos works in general. As I understand it at this point, making use of a TGT to get you authenticated requires a triangle of trust: the client establishes a trust with the KDC when it gets a TGT, the network services establishes a trust with the KDC via its server’s keytab, and the workstation and network services establish a trust because they both trust the KDC. The part I had never setup was the trust between the various services and the KDC. This is done by host and service principals that can be created on the KDC using kadmin.local and published to the services via a keytab. On the KDC, launch kadmin.local and create the host and SERVICE principals thusly:

addpinc -randkey host/hostname.domain.tld addpinc -randkey HTTP/virtualhostname.domain.tld

Now that the KDC has created the keys it will accept for the services (in this case “HTTP” for Apache and “host” for, among other things, SSH), we have to push these out to the various servers hosting the services. This will establish the trust between the servers hosting the network services and the KDC. Again, on the KDC with kadmin.local:

ktadd -k /tmp/krb5.keytab -glob hostname.domain.tld ktadd -k /tmp/krb5.keytab -glob virtualhostname.domain.tld

The above command with create the keytab file with the keys that will establish the service/KDC trust. You can take a look at what’s in the keytab using klist -kt /tmp/krb5.keytab

Network Services

Move the keytab over to the sever hosting the services and place it in /etc/

Apache

Apply permissions to the keytab such that Apache can read it:

chown root:www-data /etc/krb5.keytab && chmod 640 /etc/kr5b.keytab

The “tricky” part for the Apache config was getting the Directory statement right in the site’s conf file. Things started working when I put in “KrbServiceName HTTP/virtualhost.domain.tld” to specify the principal to use:

<Directory /path/to/web/app/> AuthName "Kerberos" AuthType Kerberos Krb5Keytab /etc/krb5.keytab KrbAuthRealm YOURREALM.TLD KrbMethodNegotiate on KrbMethodK5Passwd off KrbSaveCredentials off KrbVerifyKDC on KrbServiceName HTTP/virtualhostname.domain.tld Require valid-user </Directory>

SSH

To get SSH working, the SSH server needs to be set with “GSSAPIAuthentication yes” in sshd_config and the client needs to be set with “GSSAPIAuthentication yes” and “GSSAPIDelegateCredentials yes” in ssh_config. Grab a TGT with “kinit username”. Connecting with username@hostname.domain.tld should now let you SSH into the server without being prompted for a password. Similarly, after an Apache restart, you should be able to pull up your webpage without auth, provided the web app is configured to allow Apache Basic authentication.

Resources

IBM had a nice page outlining the “minor” Kerberos error codes that I would see in the Apache debug logs and the SSH debug output

The Apache Mod Kerb sourceforge project had a nice writeup of the Apache Directory Kerberos options

Network Performance Testing

In troubleshooting our backup issue, it seems as though the issue might not be the backup agent, but rather networking related. This particular server that was having problems is virtualized and running in a vSphere cluster. Moving the VM to another host in the cluster improved the backup performance, though perhaps coincidentally. I’ve been running network bandwidth tests to try and confirm or refute this hypothesis. In the process I’m familiarizing myself with iperf and wanted to jot some notes down for myself.

The above is handy as I was able to compare/contrast not only bandwidth between various systems, but systems on different hosts, systems on the same host, etc to get a better picture

My tests so far have not backed up my hypothesis, but they did show some interesting figures, some more surprising than others

Backup MySQL, Old School Style

Our standard backup software at work decided to crap out on one of our more critical servers. I decided to write up a little mysqldump script to get database copies while we troubleshoot our backup agent. The destination is an NFS share to a separate computer node on a separate storage infrastructure. I'll use rsync to create incremental sets of data for the nightly backup job to tape on that NFS server. #!/bin/bash BKFLDR=/backups/server_name # get the list of databases DATABASES=`mysql -e "show databases" | tail -n +2` for DATABASE in ${DATABASES[@]} do # create a folder with the same name as the database if [ ! -d $BKFLDR/$DATABASE ] then mkdir -p $BKFLDR/$DATABASE fi # run the backup while creating individual files for each table mysqldump --flush-logs --single-transaction $database --tab=$BKFLDR/$DATABASE done

Migrate DNS to Windows

While there are better ways to migrate DNS (like zone transfers), the tech who setup AD didn’t transfer the records before the server was delivering services. Instead of risking a zone transfer with AD already running with DNS, and DNS partially populated, I just dumped the current DNS records into a series of powershell commands.

Line 4 is querying our current DNS server for each of the records in an IP range, echoing out only IPs which have hostnames, and removing the domain from each of the FQDNs.

#!/bin/bash i=1; while [ $i -lt 255 ]; do host=`nslookup xxx.xxx.xxx.$i | grep name.= | awk '{print $4}' | sed 's/..*//'`; if [ `echo $host | grep -c "[a-z]"` -eq 1 ]; then echo Add-DnsServerResourceRecordA -ZoneName domain-name -Name $host -IPv4Address xxx.xxx.xxx.$i -CreatePtr; fi; ((i++)); done;

What you get in the end is something you can copy/paste into PowerShell

binding linux to active directory

Continuing with the last few posts regarding my company's migration to Active Directory, I wanted to jot down my solution for getting our Linux systems onto AD. Our previous directory server was Apple's Open Directory which was basically a bundle of OpenLDAP, Kerberos, a password database, and a GUI. I won't get into the rationale behind migrating off of OD, suffice to say OD wasn't cutting it. That said, our previous binding (I'm using the term binding loosely here) was done with LDAP/Kerb. There were a couple of options for binding to AD including winbind but I've had issues with that in the past; it mostly worked, but it hiccuped enough and in different ways that it was just a pain. It also relied on more obscure pieces of the directory server to work compared to LDAP/Kerb. So I wound up just modifying two config files, nslcd.conf and krb5.conf, to get everything working.

nslcd

We're using LDAP for user/group information and Kerberos for authentication. The tutorials and forum posts I found with solutions all differed slightly for everyone's environment so a simple copy/paste didn't work out of the box. I wound up using a combination of "tcpdump -X host host -w /tmp/tcpdump.pcap" + wireshark and "nslcd -d" to debug. The former helped me figure out the the filter settings I needed. The latter helped me get the schema<->POSIX user/group info mapping sorted out. Below is something that got us up and running. There's still a few things we'll likely tweak before we call it done. We're mostly a CentOS shop at this point, though we have a couple of Ubuntu systems. They're similar for the most part, but the two vendors use different GIDs to run the nslcd daemon and the mapping was a little different. Kerberos was identical between the two vendors.

CentOS

uid nslcd gid ldap uri ldap://host base CN=Users,DC=domain,DC=tld binddn domain\user bindpw password scope group sub scope hosts sub pagesize 1000 referrals off filter passwd (objectCategory=user) filter group (objectCategory=group) map passwd uid sAMAccountName map passwd gecos displayName map passwd gidNumber primaryGroupID map group uniqueMember member ssl no tls_cacertdir /etc/openldap/cacerts

Ubuntu

uid nslcd gid nslcd uri ldap://host base CN=Users,DC=domain,DC=tld binddn domain\user bindpw password scope group sub scope hosts sub pagesize 1000 referrals off filter passwd (objectCategory=user) filter group (objectCategory=group) map passwd uid sAMAccountName map passwd gecos displayName map passwd gidNumber primaryGroupID ssl no tls_cacertdir /etc/openldap/cacerts
Don't forget to restart both nslcd and ncsd as needed

Kerberos

[logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] default_realm = DOMAIN.TLD dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] DOMAIN = { kdc = host admin_server = host } [domain_realm] domain.tld = DOMAIN.TLD .domain.tld = DOMAIN.TLD

Conclusion

With all of that sorted out, a simple push of these config files via one's favorite Configuration Management System makes for a relatively painless migration of server authentication from OD to AD.

Remote Install Software with Powershell

My company is rolling out Active Directory on Windows Server 2012. I’ve been using the project as an excuse to work on my PowerShell skills.

  1. I can’t remember how to start a remote session. I can remember Get-Command so I’ll Use that to find the command until my memory kicks in
    • Get-Command *pssession*
  2. Start the remote session. It will prompt you for the server to which you want a connection
    • New-PSSession
  3. Use Get-WindowsFeature to find what is installed, and the name of what you want to install
    • Get-WindowsFeature
  4. Do it
    • Add-WindowsFeature Feature Name

VMware Licensing

Hopefully most people will have their 4.x -> 5.1, or 5.0 -> 5.1 upgrade behind them, or they’re comfortable with the RAM limitations on 5.0. If not, check out at least the Resolution below to save yourself some headaches.

Backstory

My company runs a three node vSphere cluster. One of the divisions in the company has been running into memory limits on their system, and are in the process of growing. They had optimized their code as much as they could but it was past time to throw some more RAM their way. As I understood our licensing agreement from when we purchased VMware, a little of a year ago now, I recalled mention of a cap on the amount of memory that we could attach to our physical nodes. So, instead of researching for the answer on VMware’s KB site, or scouring DDG for random thoughts, I decided to make use of our support contract and just ask the “experts”.

The Confusion Begins

I called up the VMware Licensing Team and told them what we were doing and what we’d need to purchase to get beyond the RAM cap. The rep asked for our build number and I dutifully read that back from the location dictated by the rep. The response from the rep delighted me to no end, though I was still a bit skeptical. She noted that we were on 5.0 Update 1, which she saw as having no RAM limitations. She even went as far as to send along a KB article describing the change from 5.0 to 5.0 U1 with respect to RAM licensing. It called out vRAM, which I had heard changed, but I that change was before we even purchased the license. I still recalled a hard limit, not a vRAM limit. I pushed her some more but she was quite adamant.

At this point, I send a long the purchase proposal to the powers that be; they approved. Before I made the purchase, though, I called up Licensing one more time for a second option. I again walked through the issue and what I understood the limitations to be; that there was a hard limit with our license. Again, they rep pointed me to a KB article. All looked good. I went ahead and executed one of the quotes I had and the RAM was in my hand by week’s end, just enough time for me to through it in a server at the end of the day before I went on a three-day weekend.

I moved all the VMs around and took one of the nodes offline, popped the RAM in it and powered it back up. Sure enough, there was my 288G in the vSphere Client summary. I thought to my self, “excellent!”. I arranged time to allocate some more vRAM to one of the VMs, power the VM back up, and [un]shockingly I get an error about not enough vRAM capacity. It was the end of the day, so I bumped the vRAM on this VM back a bit, powered it up and handed it off to the team. I called VMware to open a ticket.

The Hassle Ensues

I’m told by the “receptionist” who takes the call that we have “Basic” support, which means I don’t get to talk to someone right away, but instead I’ll get a call back or email within 8 business hours. That was a little annoying, but that’s the decision we made when purchasing the support contract, so I was OK with it. This wasn’t a critical emergency so I figured we’d just get it all sorted out.

I never heard back from my assigned tech. I received one email the next day saying “I’m on the phone and will reach out to you later”, or something to that effect. There were no other follow up messages from that day. Come Monday, I reply to the email asking what I needed to do to answer any questions he may have. No response. I called him up, but he was on the other line. I left a voicemail. No reply.

I opened another ticket for the same issue the next day explaining the situation of not hearing back from the first tech. This new ticket got a new tech, who, to his credit, did reply with some follow up question, questions which were answered in the notes of the ticket when it was opened, but I digress.

The support tech said this was a licensing issue. I agreed and followed his advice to contact Licensing again. I did and asked that they look at the notes the Support tech. Their response? This is a support issue. ARG!!!!! I called in again and became a little bit more persistent. One of the shift managers recommended that the three of us have a conference call the next day. Long story slightly shorter: I finally got on the call with a tech who was able to explain the situation, and a License Team member who knew their product and model.

Resolution

The issue? Well, our nodes were all able to use unlimited RAM, but vCenter, the piece that orchestrates them, wasn’t. VMware Essentials Plus is limited to 32G of RAM per CPU. For our 6xQuadcore servers, we had a limit of 196G of total RAM we could use. Never was I asked what version of vCenter was I using in my first calls, and I doubt anyone looked at my account to see. So frustrating. The end result was upgrading vCenter from 5.0 to 5.1. I did that and have now bumped up the RAM on a few systems. All is fine now.

Brain Dump: VMware Memory Management

Definitions

Memory Layers

Ballooning

When the hypervisor runs multiple virtual machines and the total amount of the free host memory becomes low, none of the virtual machines will free guest physical memory because the guest operating system cannot detect the host’s memory shortage. Ballooning makes the guest operating system aware of the low memory status of the host.

If the hypervisor needs to reclaim virtual machine memory, it sets a proper target balloon size for the balloon driver, making it “inflate” by allocating guest physical pages within the virtual machine.

Ballooning Example

Much (if not all) of the information is this post is shamelessly ripped from VMware’s whitepaper, Understanding Memory Resource Management in VMware® ESXTM Server. and vSphere Resource Management Guide

Time Sync with Linux Guests on VMware

Apparently VMware recommends, when working with Linux guests, to sync the guest’s time via traditional NTP rather than the vmware-toolboxd.

NTP Recommendations Note: VMware recommends you to use NTP instead of VMware Tools periodic time synchronization. NTP is an industry standard and ensures accurate time keeping in your guest. You may have to open the firewall (UDP 123) to allow NTP traffic.

This is a sample /etc/ntp.conf:

tinker panic 0 restrict 127.0.0.1 restrict default kod nomodify notrap server 0.vmware.pool.ntp.org server 1.vmware.pool.ntp.org server 2.vmware.pool.ntp.org driftfile /var/lib/ntp/drift

This is a sample (RedHat specific) /etc/ntp/step-tickers:

0.vmware.pool.ntp.org 1.vmware.pool.ntp.org

The configuration directive tinker panic 0 instructs NTP not to give up if it sees a large jump in time. This is important for coping with large time drifts and also resuming virtual machines from their suspended state.

Note: The directive tinker panic 0 must be at the top of the ntp.conf file.

It is also important not to use the local clock as a time source, often referred to as the Undisciplined Local Clock. NTP has a tendency to fall back to this in preference to the remote servers when there is a large amount of time drift.

An example of such a configuration is:

server 127.127.1.0 fudge 127.127.1.0 stratum 10

Comment out both lines.

After making changes to NTP configuration, the NTP daemon must be restarted. Refer to your operating system vendor’s documentation.

VMware Tools time synchronization configuration When using NTP in the guest, disable VMware Tools periodic time synchronization.

To disable VMware Tools periodic time sync, perform one of these options:

Set tools.syncTime = "FALSE" in the configuration file (.vmx file) of the virtual machine.

OR

Deselect Time synchronization between the virtual machine and the host operating system in the VMware Tools toolbox GUI of the guest operating system.

OR

Run the vmware-guestd --cmd "vmx.set_option synctime 1 0" command in the guest operating system. To enable time syncing again, use the same command with "0 1" instead of "1 0".

For ESX 4.1 and later, use these parameters for Linux, Solaris, and FreeBSD:

To display the current status of the service:

vmware-toolbox-cmd timesync status

To disable periodic time synchronization:

vmware-toolbox-cmd timesync disable

These options do not disable one-time synchronizations done by VMware Tools for events such as tools startup, taking a snapshot, resuming from a snapshot, resuming from suspend, or VMotion. These events synchronize time in the guest operating system with time in the host operating system, so it is important to make sure that the host operating system’s time is correct.

To do this for VMware ACE, VMware Fusion, VMware GSX Server, VMware Player, VMware Server, and VMware Workstation, run time synchronization software such as NTP or w32time in the host. For VMware ESX run NTP in the service console. For VMware ESXi, run NTP in the VMkernel.

Note: VMware Tools one-time synchronization events should not disabled.

Source: VMware KB: Timekeeping best practices for Linux guests