Sean Caffery

Ruby, Rails and the things needed to make it work

Know Your Tools

I find that reading code from other developers and watching them work helps improve my approach to problems and techniques I use in my own projects. Over the last week, I have been watching the Play by Play videos on PeepCode. One thing that stood out while watching these videos was how well the developers knew the toolset that they were using. This made me rethink how well I know the tools that I use and how I could improve my knowledege to gain efficencies.

For example, in the Tenderlove play by play, Aaron shows how well he knows Ruby, Vim and his shell. He uses these tools to stop him having to remember specific library calls, or how a particular method is used. He is able to quickly look up anything that he is unsure of. Knowing that there is a way to to achieve something or the general concept is enough; a superior knowledge of the tools ensures that the answer to any question is found quickly.

I like this method of working. Being focused on the outcome of a task is much more important than remembering a whole lot of library calls. This is especially true when working in a pair. Gaps in library knowledge can be picked up the other person in the pair, but to reach maximum effiency a good knowledge of the tools being used it required.

Over the past few months, I have done the following things to improve the knowledge of the tools that I used everyday and in turn improve my productivity.

Learned how to touch type

This was a big gain in productivity. I thought that I was a reasonably good typist until I measured my skills with a typing trainer and found that I made a lot of mistakes and could only manage an effective speed of around 30 words per minute. After around 2 weeks, I was typing at almost 50 words per minute with 99% accuracy. I am not sure that this is enough but it was certainly a big improvement.

Customised my Vim setup and learned more about Vim concepts

I started out by reading about how others had started to use Vim. I then updated my .vimrc with the things that I thought would apply to my situation. I also installed some plugins relevant to Ruby and Rails development. I started to use Git to manage my Vim configuration which allowed me to have the same configuration across all development machines.

I think these changes would have resulted in approximately a 2 times improvement in productivity when using Vim.

I recommend updating your knowledge and improving your skills with the tools you use the most. The payoff is definately work the investment.

Optimising Merge Inserts With SQLServer 2008

At C3, I have to support a number of different databases, and for a number of reasons, I use PostgreSQL for development when one of our main clients uses SQL Server 2008 (we have CI builds that test all of our supported databases, of course). This means that I do not always see the problems that clients face when using our product. A client complained to my team that the time taken to stage records was unacceptable and that they would not take a new release of our software until we fixed this issue. This posed a serious problem to my team as none of us are SQL Server experts and we had learn very quickly how to make it the fastest database that we supported.

Firstly, we gained some intelligence from the client; what files were they trying to load, how many columns, how many rows they were loading, and we ended up with the following:

  • 56 columns
  • ~ 10,000 rows of a moderately complex table

Armed with this information, I set out, with another team member, to profile what were currently doing and looking at ways that we could improve.

Our system maintains audit logs for all records that are updated, deleted, or added to the system, this means that staging 10,000 rows into a table that already contains those rows will generate:

  • an update statement for any rows that currently exist, in the audit table
  • 10,000 upsert (using a MERGE statement in the case of SQL Server 2008) statements in the staging table
  • an insert statement for records that are new in the latest upload, in the audit table
  • a delete statement for records that should not exist, in the audit table
  • another set of update records, in the audit table

We were partially on the right track; using a merge statement is the correct way to perform an update / insert in SQL Server, our mistake was using so many of them. The SQL Server 2008 query planner does not cache query plans. So, every time that you execute a query, it must be parsed, compiled, and a query plan generated, therefore you want to keep the number of queries needed to perform an operation to a minimum wherever possible, especially if you need to perform a particular operation thousands of times.

Our second mistake was not having the correct indexes on the table that we were merging into. In fact, our client had no indexes on the table that we were merging into. We used the tools available with SQL Server to try and optimise our queries but it seems that our merge statement was too complex for the query optimiser, as it offered us the not very helpful hint of adding all of the columns in our target table to an index. Not surprisingly, this considerably slowed our inserts down. After reading up on merge statement best practices, we added the appropriate indexes to our join columns.

It was suggested by someone in another team that the best way to process this number of records was to use a temporary table, and have the database merge the temporary table into the target table. This is the technique we ended up using and it is super fast, but first, we had to get our new records into the temporary table.

SQL Server supports a number of insert syntaxes, some of which are non standard and allow you to exclude column names or insert multiple rows with one statement. We experimented with the T-SQL insert that allowed multiple rows to be inserted, which is expressed in the following manner:

1
2
INSERT INTO my_table_name (column_1, column_2, column_2)
    VALUES ('some', 'string', 'values'), ('some', 'other', 'string');

If the row data is inserted in the same order as the columns in the target table, you do not have to include the column names in your query, but we chose to include them for clarity.

This syntax allows you insert up to 1000 records at a time, using one insert statement. This was an interesting discovery. We had found the way to reduce the number of insert queries we had to produce and using this technique we would have 10 inserts into our temporary table, and the merge from our temporary table into our target table would be super fast. Mission accomplished!

We were wrong.

After some experimentation with the number of insert records, we found that as we increased the number of records inserted, the query slowed down considerably. We found for our situation that 10 records per INSERT statement was the magic number.

Our new procedure for staging records generated:

  • an update statement for any rows that currently exist, in the audit table
  • a temporary table that represented the target table
  • 1000 T-SQL INSERT statements, in the temporary table
  • 1 MERGE statement into the staging table
  • a drop statement for the temporary table
  • an insert statement for records that are new in the latest upload, in the audit table
  • a delete statement for records that should not exist, in the audit table
  • another set of update records, in the audit table

After implementing the changes mentioned above, and some heavy handed code optimisation, we managed to take our staging time for the given records from 16 minutes to sub 1 minute.

During our investigation, we also changed the backup strategies, how much was allocated for logging, and disabled auto- resize for all logs. Before you modify your backup strategy, you should read and fully understand the implications for your organisation in doing so. Doing this allowed us to focus on optimising our code and be confident that we were using our hardware to its full capacity. These settings were not changed in production.

Why You Should Automate Application Deployment

If you are deploying your applications by hand, you are doing it wrong.

I learned this very important lesson the hard way when the application that I was working on went from being used by one client, to 6. Each client has their own virtual server with a test and production version of the software, as well as configuration files that are different for each test and production instance. That is a total of 24 deployments. 24.

Each code change would involve the following:

  • Make code changes
  • ssh to client server
  • Get latest code from GitHub
  • cd to target directory
  • Run deployment script
  • Restart web component

Configuration changes involved a similar number of steps.

Each deployment took around 2 minutes to complete, assuming it was successful. Adding in the time for changing servers and correcting mistakes, that is around 30 minutes just to deploy a code change.

I was frustrated with the amount of time wasted executing the same steps over and over. That is what computers are good at, so I spent a few hours learning and implementing Capistrano tasks. I can now deploy to multiple servers at once and each deployment is just a matter of running

1
cap deploy:update HOSTS=<client IP addresses>

from my local machine. I don’t even have to login to the client server.

I have gone from a ~30 minute deployment to ~1 minute, saving my time and sanity in the process.

Mongrel Causing High CPU Usage on Windows

The mongrel_service gem is a great thing to have if you are deploying a Ruby on Rails application in a Windows environment. However, you should be aware that is does not handle problems encountered when trying to start the underlying mongrel server.

I was recently called to a client site to resolve an issue with our application causing high CPU usage. I checked the log directory and found that the mongrel.log was over 100MB. This is quite out the ordinary for mongrel.

Running

1
tail -n 50 mongrel.log

revealed that mongrel was encountering a database connection issue when trying to start Rails. This caused a serious problem as there were 5 services constantly trying to start and it brought the server to its knees.

The solution was simple; I rectified the database connection issue and restarted the mongrel_service instances.

If you encounter a similar issue when running mongrel_service on Windows, RAILS_ROOT/log/mongrel.log should point you to the cause of the problem in no time.

Using Ruby and Rails on Windows

Here are some quick notes that will hopefully help you get started with Ruby on Windows.

This post relates to the 7-zip Ruby version available from http://rubyinstaller.org/ and lists the additional libraries that I required to get the Rails application that I work on to run and pass all of the tests.

Libraries

I needed the following libraries in addition to those distributed in the default Ruby installation.

I have packaged these DLLs into an archive available here

Once you have unzipped the Ruby installation, you need to place these DLLs in the /bin directory.

I hope this relieves some the frustration that can occur when first installing Ruby on Windows.

Using Rails, Windows and SQL Server 2005

I have recently had to install a copy of our Rails app  on Windows. Using Ruby on Windows presents its own set of challenges and adding a gem that has a native extension makes it just that little more painful. Here is what I did to get SQL Server 2005 and Ruby to play nicely together on Windows.

The Requirements

To use SQL Server with ActiveRecord we use ODBC.

Here are the things that are required to get this working correctly.

  • A Windows install of Ruby, obviously. I used the 7-zip package available from http://rubyinstaller.org/downloads/ and  installed it to C:\ruby.
  • A set of DLLs available here I have documented the process for installing Rails on Windows.
  • The activerecord-sqlserver-adapter gem. I used version 2.3.13 because our app is Rails 2.
  • The activerecord-odbc-adapter.
  • The compiled version of the Ruby-ODBC adapter for Windows. Available at http://www.ch-werner.de/rubyodbc/
  • A compiled version of the win32-api gem. This apparently comes with most Windows installations of Ruby, but it wasn’t in mine.

The Process

After installing Ruby and ensuring that at least irb is working correctly with some basic smoke tests, put the ODBC library and the other required DLLs in the bin directory of your Ruby installation.

Set up a system DSN that provides access to your SQL Server installation.

Configure database.yml with an ODBC connection to your system DSN.

1
2
3
4
5
6
development:
  adapter: sqlserver
  mode: odbc
  dsn: your_dsn
  username: database_user
  password: database_password

That’s it. It’s time to fire up script/console and try to connect to your database.

1
2
Loading development environment (Rails 2.3.4)
ActiveRecord::Base.connection

If everything has gone well, you will get an object printed to the console that represents your database connection.

Run Cucumber Scenarios in Parallel

I have recently started to write code that relies on delayed_job for background processing. This means testing delayed_job with Cucumber. After a bit of research, it seems the most common way to test your background task is completing is to use Delayed::Job.work_off. This worked out fine, until I needed to test something that required a delayed_job task to be running when I tested another part of the system. Integrated, like, you know, how it would be in a production environment.

The other problem with using Delayed::Job.work_off in your Cucumber steps, is that it is run serially, and you don’t get to test for potential race conditions that manage to hide in your code until 6pm the day before go live.

This raised the question:

How do you test multiple cucumber steps in parallel?

The answer is: you don’t. At least that is the answer that I kept coming across.

Inspired by this post by the people at Square I decided to run our delayed in a separate process.

Here is what I ended up with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class CucumberDelayedJob
  class << self
    attr_accessor :pid

    def start

      self.pid = fork
      if pid
        Process.detach pid
      else
        start_worker
      end
    end

    def stop_worker
      `script/delayed_job stop -- #{Rails.env}`
    end

    def start_worker
      exec('script/delayed_job', 'start', '--', Rails.env)
    end
  end
end

Note that the Rails.env variable is being passed into script/delayed_job. This is important because delayed_job will use the environment given in that parameter to select which database to connect to when checking for new work.

The script above does have its limitations:

  • In its current form you can only run instance of delayed_job. This includes jobs running with a different Rails environment, so all other instances of delayed_job will need to be stopped before this is used.
  • It assumes that it can start and stop the delayed_job instance successfully every time.

This is just a start, it is a very basic way to start and stop a delayed_job instance cleanly.