I find that reading code from other developers and watching them
work helps improve my approach to problems and techniques I use
in my own projects.
Over the last week,
I have been watching the Play by Play videos on PeepCode. One thing
that stood out while watching these videos was how well the
developers knew the toolset that they were using. This made me
rethink how well I know the tools that I use and how I could
improve my knowledege to gain efficencies.
For example, in the Tenderlove play by play, Aaron shows how well
he knows Ruby, Vim and his shell. He uses these tools to stop him
having to remember specific library calls, or how a particular method
is used. He is able to quickly look up anything that he is unsure of.
Knowing that there is a way to to achieve something or the general
concept is enough; a superior knowledge of the tools ensures that
the answer to any question is found quickly.
I like this method of working. Being focused on the outcome of a task
is much more important than remembering a whole lot of library
calls. This is especially true when working in a pair. Gaps in
library knowledge can be picked up the other person in the pair,
but to reach maximum effiency a good knowledge of the tools being
used it required.
Over the past few months, I have done the following things to improve
the knowledge of the tools that I used everyday and in turn improve
my productivity.
Learned how to touch type
This was a big gain in productivity.
I thought that I was a reasonably good typist until I measured my
skills with a typing trainer and found that I made a lot of
mistakes and could only manage an effective speed of around 30
words per minute. After around 2 weeks, I was typing at almost
50 words per minute with 99% accuracy. I am not sure that this is
enough but it was certainly a big improvement.
Customised my Vim setup and learned more about Vim concepts
I started out by reading about how others had started to use Vim.
I then updated my .vimrc with the things that I thought would apply
to my situation. I also installed some plugins relevant to Ruby and
Rails development.
I started to use Git to manage my Vim configuration which allowed me
to have the same configuration across all development machines.
I think these changes would have resulted in approximately a 2 times
improvement in productivity when using Vim.
I recommend updating your knowledge and improving your skills with
the tools you use the most. The payoff is definately work the
investment.
At C3, I have to support a number of different databases, and for a number of reasons,
I use PostgreSQL for development when one of our main clients uses SQL Server 2008
(we have CI builds that test all of our supported databases, of course).
This means that I do not always see the problems that clients face when using our product.
A client complained to my team that the time taken to stage records was unacceptable and
that they would not take a new release of our software until we fixed this issue.
This posed a serious problem to my team as none of us are SQL Server experts and we had
learn very quickly how to make it the fastest database that we supported.
Firstly, we gained some intelligence from the client; what files were they trying to load,
how many columns, how many rows they were loading, and we ended up with the following:
56 columns
~ 10,000 rows of a moderately complex table
Armed with this information, I set out, with another team member, to profile
what were currently doing and looking at ways that we could improve.
Our system maintains audit logs for all records that are updated, deleted, or
added to the system, this means that staging 10,000 rows into a table that
already contains those rows will generate:
an update statement for any rows that currently exist, in the audit table
10,000 upsert (using a
MERGE
statement in the case of SQL Server 2008) statements in the staging table
an insert statement for records that are new in the latest upload, in the audit table
a delete statement for records that should not exist, in the audit table
another set of update records, in the audit table
We were partially on the right track; using a merge statement is the correct
way to perform an update / insert in SQL Server, our mistake was using so many
of them. The SQL Server 2008 query planner does not cache query plans. So,
every time that you execute a query, it must be parsed, compiled, and a query
plan generated, therefore you want to keep the number of queries needed to
perform an operation to a minimum wherever possible, especially if you need to
perform a particular operation thousands of times.
Our second mistake was not
having the correct indexes on the table that we were merging into. In fact,
our client had no indexes on the table that we were merging into. We used the
tools available with SQL Server to try and optimise our queries but it seems
that our merge statement was too complex for the query optimiser, as it
offered us the not very helpful hint of adding all of the columns in our
target table to an index. Not surprisingly, this considerably slowed our
inserts down. After reading up on merge
statement best practices,
we added the appropriate indexes to our join columns.
It was suggested by someone in
another team that the best way to process this number of records was to use a
temporary table, and have the database merge the temporary table into the
target table. This is the technique we ended up using and it is super fast,
but first, we had to get our new records into the temporary table.
SQL Server
supports a number of insert syntaxes, some of which are non standard and allow you to
exclude column names or insert multiple rows with one statement. We
experimented with the T-SQL insert that allowed multiple rows to be inserted,
which is expressed in the following manner:
If the row data is inserted in the same order as the columns in the target
table, you do not have to include the column names in your query, but we chose
to include them for clarity.
This syntax allows you insert up to 1000 records
at a time, using one insert statement. This was an interesting discovery. We
had found the way to reduce the number of insert queries we had to produce and
using this technique we would have 10 inserts into our temporary table, and
the merge from our temporary table into our target table would be super fast.
Mission accomplished!
We were wrong.
After some experimentation with the
number of insert records, we found that as we increased the number of records
inserted, the query slowed down considerably. We found for our situation that
10 records per INSERT statement was the magic number.
Our new procedure for
staging records generated:
an update statement for any rows that currently exist, in the audit table
a temporary table that represented the target table
1000 T-SQL INSERT statements, in the temporary table
1 MERGE statement into the staging table
a drop statement for the temporary table
an insert statement for records that are new in the latest upload, in the audit table
a delete statement for records that should not exist, in the audit table
another set of update records, in the audit table
After implementing the changes mentioned above, and some heavy handed code
optimisation, we managed to take our staging time for the given records from
16 minutes to sub 1 minute.
During our investigation, we also changed the
backup strategies, how much was allocated for logging, and disabled auto-
resize for all logs. Before you modify your backup strategy, you should
read and fully
understand the implications for your organisation in doing so. Doing this
allowed us to focus on optimising our code and be confident that we were using
our hardware to its full capacity. These settings were not changed in
production.
If you are deploying your applications by hand, you are doing it wrong.
I learned this very important lesson the hard way when the application that I was working on
went from being used by one client, to 6. Each client has their own virtual server with a
test and production version of the software, as well as configuration files that are
different for each test and production instance. That is a total of 24 deployments. 24.
Each code change would involve the following:
Make code changes
ssh to client server
Get latest code from GitHub
cd to target directory
Run deployment script
Restart web component
Configuration changes involved a similar number of steps.
Each deployment took around 2 minutes to complete, assuming it was successful. Adding in the time for changing servers and correcting mistakes, that is around 30 minutes just to deploy a code change.
I was frustrated with the amount of time wasted executing the same steps over and over.
That is what computers are good at, so I spent a few hours learning and implementing
Capistrano
tasks. I can now deploy to multiple servers at once and each deployment is just a matter of running
1
capdeploy:updateHOSTS=<clientIPaddresses>
from my local machine. I don’t even have to login to the client server.
I have gone from a ~30 minute deployment to ~1 minute, saving my time and sanity in the process.
The mongrel_service gem
is a great thing to have if you are deploying a Ruby on Rails application in a Windows environment.
However, you should be aware that is does not handle problems encountered when trying to start the
underlying mongrel server.
I was recently called to a client site to resolve an issue with our
application causing high CPU usage. I checked the log directory and
found that the mongrel.log was over 100MB. This is quite out the ordinary for mongrel.
Running
1
tail -n 50 mongrel.log
revealed that mongrel was encountering a database connection issue when trying to start Rails.
This caused a serious problem as there were 5 services constantly trying to start and it
brought the server to its knees.
The solution was simple; I rectified the database connection issue and restarted the mongrel_service instances.
If you encounter a similar issue when running mongrel_service on Windows,
RAILS_ROOT/log/mongrel.log should point you to the cause of the problem in no time.
Here are some quick notes that will hopefully help you get started with Ruby on Windows.
This post relates to the 7-zip Ruby version available from http://rubyinstaller.org/
and lists the additional libraries that I required to get the Rails application that I work on to
run and pass all of the tests.
Libraries
I needed the following libraries in addition to those distributed in the default Ruby installation.
I have recently had to install a copy of our Rails app on Windows. Using Ruby on Windows presents its own set of challenges and adding a gem that has a native extension makes it just that little more painful. Here is what I did to get SQL Server 2005 and Ruby to play nicely together on Windows.
The Requirements
To use SQL Server with ActiveRecord we use ODBC.
Here are the things that are required to get this working correctly.
A compiled version of the win32-api gem. This apparently comes with most Windows installations of Ruby, but it wasn’t in mine.
The Process
After installing Ruby and ensuring that at least irb is working correctly with some basic smoke tests,
put the ODBC library and the other required DLLs in the bin directory of your Ruby installation.
Set up a system DSN
that provides access to your SQL Server installation.
Configure database.yml with an ODBC connection to your system DSN.
I have recently started to write code that relies on delayed_job for background processing.
This means testing delayed_job with Cucumber. After a bit of research, it
seems the most common way to test your background task is completing is to
use Delayed::Job.work_off. This worked out fine, until I needed to test
something that required a delayed_job task to be running when I tested another
part of the system. Integrated, like, you know, how it would be in a production environment.
The other problem with using Delayed::Job.work_off in your Cucumber steps, is that it is run serially,
and you don’t get to test for potential race conditions
that manage to hide in your code until 6pm the day before go live.
This raised the question:
How do you test multiple cucumber steps in parallel?
The answer is: you don’t. At least that is the answer that I kept coming across.
Inspired by this post
by the people at Square
I decided to run our delayed in a separate process.
Note that the Rails.env variable is being passed into script/delayed_job.
This is important because delayed_job will use the environment given in that parameter
to select which database to connect to when checking for new work.
The script above does have its limitations:
In its current form you can only run instance of delayed_job.
This includes jobs running with a different Rails environment, so all
other instances of delayed_job will need to be stopped before this is used.
It assumes that it can start and stop the delayed_job instance successfully every time.
This is just a start, it is a very basic way to start and stop a delayed_job instance cleanly.