Ok, so this topic has probably been debated to death. I've got a different perspective of this issue , and I reckon it's worth putting down in words.
Over the last few weeks, I've been involved with one of our local customers, who after a lot of consideration, has decided to make the linux jump. This was no quick decision, mind you, and was more than a "I'm tired of paying microsoft for licenses" thing.
Why the move?
Linux made it's way into the organization by me choosing to use it as a desktop system when I was still consulting to the customer, as a DBA and J2EE developer. (Yeah, I know, a weird combo, but I've never liked scripting languages much, so chose Java/J2EE for my DBA tools). I got pretty uptight when Eclipse and Windows (the company standard) decided to crash or hang on me every few hours, and I swapped back to Linux.
As a DBA, you tend to get involved with all sorts of issues, mainly because in the case of a reasonably large or busy application, the database is normally the first thing that takes the blame in the case of a performance dip. On one of my investigations, I found that the database had trouble sending data back to the client applications (network waits). The networking guys were just laughing at me, and told me that the database shouldn't send so much data back. (!?!)
There were some variables involved: All of the servers (application servers, web servers, database servers, mail servers) where hosted at a different site, and all traffic (including web traffic) where being routed to (through) the offsite location (a local ISP). My theory was that some people were misusing the web, as my investigation pointed out that HTTP traffic was extremely high.
This was really the first case where I could implement Linux with a direct business benefit. After a lot of consultation with the client (a Windows only type), I decided to take an older PC standing around in the storeroom, slapped some SCSI drives into it, and installed Mandrake Linux on it. My reasoning here was that Mandrake is a pretty friendly O/S for a Windows skilled "LANnie" to pick up. I then went for Squid proxy, and installed a web reporting tool (squint) onto the "proxy server" as it was called. This allowed us to report, per user, the amount of time spent on web sites, the amount of data downloaded, site details etc etc etc. We could basically pinpoint exactly who was surfing, for how long, and what sites they were viewing. We had to change some of the and client browser settings to point to the proxy (you change firewall rules to only allow http traffic from the proxy server, and the point all client browsers to the proxy)
We gathered statistics for 2 or 3 days, and our first report proved that my hunch was correct. Some guy in the admin department was using up a lot of our much needed bandwidth by downloading, well, porn. And some other guys where using the web for audio streaming. And some other guys where downloading MP3's and games etc etc. Now, in South Africa, bandwidth is expensive and slow. We only have one provider of leased or other telco lines (changing in 2006), and 3G isn't what it should be (yet)
We blocked some sites, the client issues some final warnings, and by the next day, the system was flying again. I started using our "proxy server" for more things, to see how much we could get out of a simple PC (about 128 mb ram, 40gb disk space, 1 ghz pentium 3 cpu). We implemented CVS, an open source version control tool. We gave users in the Operations department a home directory to backup documents. We set up some print queues.
The CIO was pretty happy with what we managed to squeeze out of the PC. The key thing to realize here is that Linux could significantly benefit the business by doing small things very well, at a low cost. The question was, could it take over critical operation in the enterprise system?
To me, the best place for Linux today, is with the most "invisible" part of the business: the data. A database should do one thing very well: Store data, and provide easy and efficient access to it. It doesn't need fancy GUI's. It doesn't need wizards, graphs, reporting and other things associated with client applications. The database is a storage engine (with a few twists). Linux on the desktop hasn't been successful so far, because of many reasons, that I'll address in my next post. But for database servers, application servers, web servers, mail servers? If configured correctly (on any operating system) they tend to run in lights-off mode for most of the time, or they should.
One of the issues in the environment, was that you had to reboot the Windows servers pretty regular, especially the database server. The database engine uses a lot of resources, and was pushing the box to the limit. I felt that a Linux O/S would be a better database server than Windows could, as for one, you have more flexibility in tuning Linux, and I perceive the Linux O/S to be more stable than Windows, after years of working with both environments. Especially for a RDBMS.
While we where contemplating the shift, the Windows O/S did it's best to help our decision. One night, I got a call at 3 o'clock in the morning, from the network admin, and was told that they couldn't boot any of their servers. A virus managed to corrupt the ntoskernel.dll file (or something like that), and the O/S had to be recovered. (At least backups were complete...) Something went wrong on the recovery, and by the time I arrived on site, was told that the O/S had to be trashed, and we would have to revert to backup. We lost about 4 days, due to wait time for hardware, O/S configuration. After that, the writing was on the wall - we were going Linux, wherever we could. As a matter of fact, we already had 2 Linux servers in the rack: Our integration server, and a server that was responsible for client communications (generated PDF documents and mailed it out)
Even before this happened I presented a greater Linux strategy to the customer. Here is a high level:
1. First, we move the database servers to Linux. This is the lowest risk, because the users aren't affected at all - except maybe that we expected more uptime and better scalability. In effect. We didn't anticipate too much of a performance boost - moving to Linux on the same 32 bit hardware wouldn't make too much of an outright performance change, but we were expecting a small improvement.
2. Move the web server (IIS) to Apache or Tomcat. Most web servers in the world run apache, and it gets rid of having to pay licenses for a commodity. Another thing to mention, is that the customer's enterprise application runs a J2EE webapp, and it was felt that we should standardize the corporate website to something like JSP, which could be supported by more than one person, and can run on multiple environments.
3. Move the application server to Linux. This should've been easy, but it isn't. The early application developers used the Powerbuilder Datawindow in their J2EE app, and we weren't convinced that the move would be seamless. So we left this until last
4. Convert all remaining client-server apps to thin client, browser based apps. A browser based app would mean that the end users could use any o/s and browser they felt comfortable with. Also, it puts the business in a position to test out Desktop Linux, and do this at their own pace. Why would they want to? The most significant saving to be made out of a corporate Linux shift, is at desktop level for application users. Power users may still want to run Windows, but for the guy who comes in the morning, switches on his PC and fires up his email client, and the application he requires to do his work, he could use ANY operating system. Mac, Linux, Windows, Solaris.
Even better, you probably don't need the "enterprise" version of Linux at desktop level, meaning that the O/S won't cost you a cent. Now, calculate this for an organization with 500 users? And remember to add up Office and any other Windows licenses, etc etc etc.
5. Desktop Linux, where it makes sense.
More of the above. There are some good articles out on the web from various authors that point out that most Windows fans, are really Office fans. Microsoft Outlook is the de-facto standard for organizations, because of the integrated collaboration. But, the largest portion of employees in a standard sized organization, probably uses about 15% of Office. It makes sense for these users to try out OpenOffice. Tactic here was to install OpenOffice on Windows, swap the mail client to something like Thunderbird, and do proper UAT to see how that goes.
6. Mail servers.Depending on the business, and how the organization uses the Outlook Calendaring (if they use outlook at all), this could be an easy or difficult shift. In this case, about 30% of the users in the organization uses Outlook with calendaring. So, not practical yet. So how do we do this? In this case, it doesn't really matter. Windows and Linux can co-exist pretty easily in the environment, and I would never advocate a "rip and replace" strategy. The best strategy we can think of now is to go for a CRM (the client needs, and wants to implement CRM) that integrates collaboration. First choices for now: SugarCRM, and possibly Compiere.
So the customer was ready for phases one, two and three. When we started strategizing the Linux shift, an interesting question came up, and it's one that comes up quite a lot now: While we're doing this move, how about investigating 64 bit architecture? Surely this will also make a massive difference? Our initial test showed that we would get a 10% - 15% performance increase by using our same hardware, but that's fairly insignificant. Sooner or later, we would run into hardware limitations. The Linux shift would extend the use of the current hardware to about 8 months, and this seemed to be a short sighted strategy.
The customer asked what was needed for a "significant" performance improvement at the database level, and how can we ensure that our hardware lasts us for the next 5 years? The key thing about a database, is that is only as fast as the amount of I/O requests it can process. And generally, disk writes and reads are very expensive, slow I/O operations. To offset this, you throw RAM at the problem, and increase the database cache so that it doesn't have to do as many direct disk reads and writes. There's a lot more to this, but that's the basic rule. This is especially true if you are sure that the database engine has been properly configured to use the machine resources efficiently, and that all of the queries thrown at the server are optimized.
The limitation for this particular client, is that they were running 32 bit Linux, which has the limit of only being able to address 4 gb of RAM per process. This means that your database server can only address about 3GB of RAM, seeing that the upper limit of that memory is used by the O/S. And moving to Linux won't solve that. There are some clever things you can do with memory (especially on Linux), but these are mainly workarounds.
A 64 bit processing architecture would solve this. There is no practical limit of addressable memory per process, and any application that relies heavily on data processing could benefit from a more processing cycles.
There are other ways to ensure that your production system stays manageable. Archiving is probably the best strategy: You can keep your production system flying by ensuring that you only store "current" data in the production environment, and archive the data to a secondary, larger and slower system. With current regulations, data needs to be kept online for longer periods, and doing a tape or disk backup and putting it in a safe is not good enough. I'll do another post on archiving at a later stage. This client decided that they would throw hardware at the problem. This can (and probably will eventually) backfire: A faster, bigger system allows you to collect and store more data in a shorter space of time. Already, statistics say that data is doubling every 2 years.
Back to the migration. After lots of investigation, testing, and sitting in labs with hardware vendors, providers and partners, we decided to go for the IBM OpenPower 720E, RHEL 4, and an IBM blade solution (all intel servers). The client decided to move the complete architecture to one Vendor (IBM), so the SAN, domain controllers, and everything was moved to IBM hardware, to get answers from as little vendors as possible in the case of a failure.
The OpenPower would run all 3 production database servers (Sybase ASE). This meant we were consolidating 3 separate machines (all 3.2 GHZ 4 way HP Proliant, 6gb ram each) into one machine. 4 CPU's, 24 GB of RAM. (I pushed for 32, the maximum the box could take, but no luck).
I'm not going to go into too much of the installation and configuration detail, but it was fairly painless. There were some issues that were fixed pretty easily, and after some extra Linux tuning, here some of the most important statistics:
The customer runs a "payment-run" every week, per country. The payment-run for the largest country took more than 6 hours on the older system. New System: 45 Minutes.
Another daily process to summarize account details normally took 17 minutes. New System: 3 minutes.
An application that does B2B style transaction processing took 8 seconds per transaction previously. Currently: 1 transaction per second.
CPU usage is down from 80% to 35% on average. And remember, 3 database servers were consolidated into 1.
To summarize, we've got about 90% of the servers running RHEL 4 now. Overall system performance improvement is dramatically better. Now for the desktops.....