Saturday, August 21, 2010

Effective Aperture

Ever wondered what the numbers on an SLR camera's lens meant? My Nikon D80 has an 18-135mm 3.5-5.6 zoom lens. At the shortest focal length - 18mm - the maximum effective aperture is 3.5 f/stops; at the longest focal length - 135mm - the maximum effective aperture is 5.6 f/stops. Why the difference? And what does it mean?

A stop (note: a stop is not an f/stop) is an important measure in photography: it describes how much light reaches the device inside the camera that captures the image. Increasing the light by 1 stop is a two-fold increase in light; decreasing by 1 stop is a halving of light. Modern cameras control the amount of exposure by varying the shutter speed and/or the aperture size.

Ignoring shutter speed for now, we look at aperture size for the rest of the post. The diameter (and radius) of a circle are functions of the circle's area. Double the area (increase aperture by 1 stop) and the diameter goes up by SQRT(2). Halve the area and it comes down by 1/SQRT(2), which is the inverse function. Taking the first 10 stops (starting at 0 not just because I like programming), let's see what the areas and diameters (unitless ratios of the starting values) would be.

Stop=>Area=>Diameter
0=>1=>1
1=>2=>1.4
2=>4=>2
3=>8=>3.5
4=>16=>4
5=>32=>5.6
6=>64=>8
7=>128=>11
8=>256=>16
9=>512=>22
10=>1024=>32

Does the sequence on the right look familiar? Notice that it has both 3.5 and 5.6 in it: both numbers are stamped on the lens. These numbers are f/stops.

Here's a definition: the f/stop is the ratio between the diameter of the aperture and the focal length of the lens.

Starting at the longest focal length for this lens - 135mm - we discover the maximum effective aperture is f/5.6. As you know, things that are further away appear smaller, so to maintain a constant effective aperture as while shortening the focal length, the lens must shrink the real aperture. At the shortest focal length - 18mm - the same effective aperture of f/5.6 has a much smaller real aperture than at the longest focal length (18/135 to be precise) yet it's letting in exactly the same amount of light. At this focal length, this lens (but not all lenses) will allow us a bigger effective aperture of f/3.5.

The average zoom lens (with a range of focal lengths) will also have a range of maximum effective apertures. A 400mm lens with a f/2.8 aperture will be very wide (and heavy) while a shorter lens with the same aperture might fit (figuratively) in your pocket.

Monday, June 14, 2010

Idle = m + C1 + C2 + C3

There are a group of processor performance counters in Windows called "% Cx Time", which show the break down of processor idle time in which the processor goes into a low-power mode. Not all processors support these low-power idle states. The higher the number, the lower the power used by the processor, and the longer the latency before it can get back up running full speed again.

Thursday, June 10, 2010

CIL Constant Literals

Sometimes you'll want to create constant literals in CIL code. Most of the time you won't, but when you do, you'll need this. There are multiple ways of representing a constant, such as a constructor argument to a custom attribute, and ildasm.exe will give you something like this:
= ( DE AD BE EF )
If you want to re-write it, you know the signature (it's right next to the literal in the .il file,) so you only need to replace the single encoded binary literal into one like this:
= { string('Hello, World!') bool(true) type(int32) type([System.Xml]System.Xml.XmlDocument) }

Monday, June 07, 2010

Code Motion

The code we write in C# undergoes many changes before it is finally executed. There's the usual stuff interviews are made of: C# source is compiled to IL into a byte code representation that can be understood by the CLR. The CLR has another compiler - the JIT - which takes this byte code and turns each method into a sequence of instructions that can be understood by the hardware inside the computer. Then the hardware fetches the instructions from memory, decodes them, and executes them (fetching and writing to memory the program's data.) by driving voltages up and down across connected items inside your computer. If you thought your C# loop was executed as it's written, you're in for a treat: it's not.

The C# compiler produces different byte codes whether optimisation is turned on or off (off for DEBUG by default, which makes the emitted byte code look more like the source code, for easier pairing with the PDB files). The JIT compiler might take a loop and compile it down to a single inline sequence of instructions if they could make the execution faster. The processor might have multiple pipe lines in which it executes instruction streams in parallel (making use of the processor in times that it would otherwise have been wasting cycles waiting for the RAM or disk to send back data). Even after memory writes have been completed, they may be temporarily buffered close to the CPU rather than being expensively written directly to memory.

All of these changes are examples of code motion. To the average C# developer, most of these changes go unnoticed; on a single thread we have virtually no way of ever finding out that the instructions were actually executed out-of-order: the contract of C# guarantees these things for us. But in parallel processing with shared state, we start to witness evidence of the underlying chaos. Fields are updated after we expect - fields are updated before we expect. To limit any negative effects of code motion, the C# designers have given us the keyword, volatile. The keyword can be applied to instance/static fields (but not method local variables, as these cannot be safely accessed by multiple threads anyway - they will always live on their own thread's stack.) Read/write accesses to memory locations marked as volatile are protected by a "memory barrier". In oversimplified terms, any source code before a volatile read/write is guaranteed to have been completed before the access; similarly any source code after a volatile read/write will only be conducted after the access. To make your code ridiculously safe, you could add volatile to any updatable field (e.g. it can't be const or readonly) but you'd be losing out on all positive benefits of code motion.

Thursday, May 27, 2010

Dual Arduino Part 1 - Bus

I bought my first Arduino to help answer the question, "What can open source hardware teach me about running concurrent processes on my own computer?" The second was purchased to indulge my quest for knowledge of parallelism at a similarly low level, with Symmetric Multiprocessing (SMP). The third - a Mini - was just because I'm hooked. This is a post about the first and second. I'm only telling you about the Mini because of how great I think these little things are!

Threading: it's a technique to allow two logical processes to run concurrently, using shared physical computational resources. A program is usually logically separated into code and data. Code is executed, while data is read/written during execution (although executing data is not usually "done", it's completely possible). In a single-threaded program there is one call stack (representing the current method) and a bunch of special memory locations called registers, one of the registers "points" to the next instruction that will be executed. In a dual-threaded program, there must be two call stacks (one for each method currently being executed) and two sets of special memory locations (in one, the instruction pointer (IP) points to one part of the code; in the other it (usually) points somewhere different in the same code). However, in a single-processor system only one of these sets of memory locations can ever be stored in the physical registers at any given time, because there is only one set of physical registers. The other must be saved somewhere nearby, waiting in the wings. Something has to stop the current thread X, save its register values, restore Y's register values and start Y... e.g. the context switch.

SMP: two (or more) physical processors share the same bus (and thus access to all other hardware). With one thread, one processor would sit idle. With two threads, both processors get to do something (and unlike our single-processor example above, both sets of register values can now be physically stored in processor registers). With just two threads, we don't have to switch contexts and both processes can keep busy the whole time. Well, not all the time. While processor A is performing I/O operations against device C (reading/writing values in memory,) the bus route is busy and processor B must wait for access to C. Immediately, a memory bottleneck. Luckily, not all I/O needs to go over the bus every time: each processor has its own little local cache.

In the case of the Arduino, I don't have any fancy switching hardware to blur the distinction between buses and networks, so my bus is a set of parallel copper wires (some wires for addresses and some for data) connecting two Arduino devices to four EEPROM memory devices. Data written to an EEPROM by one processor can be read from the same EEPROM by another processor, and vice versa.

If both the Arduino devices are masters and all the EEPROM devices are slaves, how do we arbitrate access to the bus? How indeed. Something needs to send a clock signal. Something needs to stop both processors from trying to talk at the same time.

Monday, May 17, 2010

Reset IE Start Page

Here's a PowerShell snippet that will reset the home page in Internet Explorer; useful if you work somewhere where they always reset your home page to their own corporate one:
Set-ItemProperty
-path "HKCU:\Software\Microsoft\Internet Explorer\Main"
-name "Start Page"
-value "http://www.google.co.uk"

Tuesday, May 04, 2010

Yet Another Post About Compression in IIS 6.0

By default, IIS won't try and compress your dynamically generated ASPX files even if you've got dynamic compression enabled as per my earlier posts. Go ahead and include the ASPX extension like this:
cscript C:\Inetpub\AdminScripts\adsutil.vbs SET W3SVC/Filters/Compression/Deflate/HcFileExtensions htm html txt css js
cscript C:\Inetpub\AdminScripts\adsutil.vbs SET W3SVC/Filters/Compression/Deflate/HcScriptFileExtensions asp dll exe aspx

Sunday, May 02, 2010

Parallelism

It's not a word that rolls of your tongue. Try explaining - without appearing to have recently suffered a blow to the head - to a room full of people that turning down your SQL Server's maximum degree of parallelism may solve some of their performance woes.

Parallelism (this definition is stolen from the beginning of Joe Duffy's book, Concurrent Programming on Windows) is the use of concurrency to decompose an operation into finer grained constituent parts so that the independent parts can run [separately].

Parallelism is intended to take advantange of physical resources that might be otherwise unused. If you have a large table or index that needs to be scanned, and you have 24 available cores, SQL Server might want to parallelize your query into 24 concurrent tasks. In this case, it usually leads to resource starvation, how many database servers are sitting idle all the time, waiting for your "embarrassingly parallel" query to arrive. Do you need to put a question mark at the end of a rhetorical question. On a SQL Server, Microsoft's official word is to keep the maximum degree of parallelism (dop) below 8, or the number of cores in a single socket, whichever is lower. Don't forget that in the example above, your server would need 24x the RAM required to turn the sequential operation into a concurrent one.

PFX is a .NET library for enabling parallelism in your C# code. Although I haven't got to the bottom of it, I'm tempted to believe that it - too - would depend entirely on the load profile of your server, so give it a try and make sure you test it with an adequate cross section of concurrent users to make sure that the parallelism selfishness isn't the cause of any new performance issues.

Monday, April 26, 2010

Hello, Axum!

using System;
using Microsoft.Axum;

public domain HelloWorld {
public agent Program : channel Application {
public Program() {
string[] args = receive(PrimaryChannel::CommandLine);
Console.WriteLine("Hello, World!");
PrimaryChannel::Done <-- Signal.Value;
}
}
}

Saturday, April 10, 2010

Jerk: Scalability and Higher Order Derivatives

I've long thought of scalability as the first derivative in the relationship between resources and operations per unit of time. With no resources, we usually achieve no operations, regardless of how long. With a single resource, we might achieve Om operations over a time span of T. By adding a second resource, we might achieve On operations over the same time span. We would say that the marginal scalabity between Rm (1) and Rn (2) was (On - Om) / (Rn - Rm). Looking at Operations vs. Resources on a line chart, this marginal scalability would be the slope of the line connectiong ORm and ORn. Knowing your application's scalability between m and n is a starting point, but the probability of ideal linear scalability in a real world application is 1/infinity. We need to know what happens when you add a third resource. This (second derivative) will tell us the rate at which the scalability changes as more resources are applied, and can be a much more useful figure. It may take a little more time until I can clarify what jerk might meaningfully contribute to discussion of scalability, but essentially it's the change in acceleration. :-)

Wednesday, April 07, 2010

WebAdministration

If you're running Windows 7, PowerShell 2.0 and IIS 7.5 you may wonder how to get started with modifying your IIS configuration using PowerShell scripts (the iis.net web site is a bit unclear, the web installer gives an unhelpful lie phrased as an error message, and the downloadable install gives up without being much help either.)
No need for extra installs beyond the usuall Windows Features -> Internet Information Services install, just follow these steps:

  • "Run as Administrator" when starting PowerShell.
  • Set-ExecutionPolicy RemoteSigned (or whatever blows your hair back)
  • Install-Module WebAdministration
  • Get-PSDrive
  • cd IIS:

Saturday, April 03, 2010

F#ing Around

In my last post, I wrote a tiny standalone F# program that would print out a reversed string. It wasn't enough, I wanted to call the function from C#. Difficult? Not at all, but there are a couple of tricks that might catch you at first.
1) Add a new C# project to the solution
2) Add a new assembly reference to the F# project
3) One of two (or is that three?) things:
a) did you name your F# module (a declaration at the top of the .fs file like module MyFirstGreeting? If yes, you will reference the function from C# by it's fully qualified name: MyFirstGreeting.reverse(...)
b) do you have an unnamed F# module? If yes, it's assumed the name of the .fs file, most likely Program, so you'll need to invoke the function by calling Program.reverse(...)
c) if you answered yes to b) and the class in your C# project was also called Program, you are waiting for this: use the global namespace qualifier! global::Program.reverse(...). Remember your C# Program's fully qualified class name is actually something like ConsoleApplication2.Program.

Compile and run! Happy days! And marvel in the wonder of Visual Studio 2010 as you step through the code from C# to F# (the call stack window even shows what language your current stack frame is written in!). This is going to be a fun journey...

Monday, March 29, 2010

Hello F#

I just got my Release Candidate of Visual Studio 2010 up and running so I thought I'd do the obligatory recursive greeting function (not as pretty as Ruby or Python, but we're talking about F# here!)
let rec reverse (value : string) =
if value.Length < 2 then
value
else
value.Chars(value.Length - 1).ToString() + reverse(value.Substring(0, value.Length - 1))

[<EntryPoint>]
let main (args) =
printfn "%s" (reverse("!dlroW ,olleH"))
0

Tuesday, March 23, 2010

Installing memcached

On ubuntu:
* download the memcached source from memcached.org (it's on Google Code at http://code.google.com/p/memcached/)
* download the libevent source from http://www.monkey.org/~provos/libevent/
* open a terminal window
* cd to where the two tar.gz files are
* tar -zxvf memcached-1.4.4.tar.gz
* tar -zxvf libevent-1.4.10-stable.tar.gz
* cd into the libevent folder
* ./configure && make
* sudo make install
* cd up and into the memcached folder
* ./configure && make
* sudo make install
* at this point we could try starting memcached, but it would probably fall over with an error about . we need to give the memcached an easy way to find libevent. (running whereis libevent is good enough for us to verify where it's actually been installed: I suspect /usr/local/lib)
* sudo gedit /etc/ld.so.conf.d/libevent-i386.conf
* on it's own line in the file, enter the following, and save and quit gedit:
* /usr/local/lib/
* even though you may have just installed it on an amd64 box, the file name needs to be i386
* sudo ldconfig is the final set up step
* now fire up memcached in verbose mode (use the -vv parameter)
* alternatively run it in daemon mode (as it was intended, with the -d parameter)
* once it's up and run you can muck around with it by starting a telnet session
* telnet localhost 11211
* try some of the commands in the very helpful reference http://lzone.de/articles/memcached.htm
* :-)

Monday, March 22, 2010

"Velocity" Part 1

It's just the CTP3 version, but it's nearly rough enough to put me off new technology for a while. I eventually got it set up on a single machine using SQL Server for its configuration mechanism, which necessitated creating a login, a database named "VelocityCacheConfig" and linking the login with a user "Velocity" who was db_datareader, db_datawriter and db_owner.

Pre-requisites: .NET 3.5 SP1 and Windows PowerShell.

You need to run the supplied PowerShell script as admin. Fair's fair. It's intended usage is to manage the cache cluster and you wouldn't want just anybody doing that.

There was some error message about the firewall.

It created a new region for every item added to the cache.

It took a long time to Remove-Cache and you can't New-Cache with the same cache name until the old one's removed. Get-CacheHelp lists the PowerShell cmdlets for working with the cache.

Need to add a reference to ClientLibrary.dll and CacheBaseLibrary.dll to your Visual Studio project.

Before running the following C#, you'd have to Start-CacheCluster, and Add-Cache "Test".

using Microsoft.Data.Caching;

DataCacheServerEndpoint[] endpoints = new DataCacheServerEndpoint[]
{
new DataCacheServerEndpoint("SCANDIUM", 22233, "DistributedCacheService")
};
DataCacheFactory factory = new DataCacheFactory(endpoints, true, false);
DataCache cache = factory.GetCache("Test");

Wednesday, March 10, 2010

SqlBulkCopy

SQL Server 2008 has table-valued parameters in ADO.NET, but if you're stuck with a poorly performing inserts and a previous version of SQL Server, you can still get some awesome speed with the System.Data.SqlClient.SqlBulkCopy class. Just create a new one (in a using block), set up the destination table name and optionally a set of column mappings, and fire away. I ran a test with 65536 rows (width: 3 x INT + 1 x FLOAT) and it completed in 750ms compared with 36'447ms to send the equivalent table row-by-bleeding-row using multiple stored procedure invocations. Also, it's much easier to use than writing your own wrapper around bcp.exe (the only option if you're stuck with Sybase)!

Tuesday, March 09, 2010

Unicode

UTF-16 is a way of representing all of the UCS code points in two bytes (or four). It can encode all of the code points from the Basic Multilingual Plane (BMP) in just two bytes, but code points in other planes are encoded into surrogate pairs. UCS-2, as used by SQL Server for all Unicode text, was the precursor to UTF-16 and can only handle code points from the BMP. It is forward compatible with UTF-16, but any code point outside of the BMP encoded in UTF-16 will appear to be two separate code points *inside* the BMP if the encoding is UCS-2. The data will be preserved - it is only the semantics (i.e. the abstract "character(s)") that differ.

Note: This is extremely unlikely for modern business applications, as the languages outside the BMP are academic, and/or historical, such as Phoenician. For these purposes, UCS-2 and UTF-16 can be considered equivalent and interchangeable.

Side note: UTF-8 is just another way of representing all of the UCS code points from all of the planes, but the bits are encoded in a different way to UTF-16.

Extra note: .NET uses UTF-16 for its in memory encoding of the System.String type. A .NET System.Char is limited to 16-bits, and therefore a single char cannot hold a UTF-16 encoded surrogate pair (e.g. any code point not on the BMP). The char data type is similar to UCS-2 in this respect. Getting the char at a specific index of a string will always return a single 16-bit char, essentially breaking up a surrogate pair if one existed at this index of the string.

Saturday, March 06, 2010

Alternative to Clustering

Today my desktop machine had a hardware failure. Everything froze up, I tried rebooting and all I got was a symphony of beeps from the BIOS. Nada. If this had been a production database server and it wasn't in a cluster (as you can tell I'm having a lot of clustering problems lately) I would have been up the creek without a paddle. As it happens, I have a spare box lying around with very similar spec, and I was able to take my boot disk out of box 1 and put it into box 2 with very little downtime. It got me thinking. Even if you don't have a cluster with a warm machine waiting for failover, you could still reduce the cost of the outage by keeping a cold server of similar spec waiting for just such an occasion. If the data and log files are stored in a SAN, you could theoretically just bring up the cold server and attach the databases, and do some DNS magic on the clients... Would it work? Well, isn't that what DR days are for?

System.IO.Compression and LeaveOpen

I got into the habit of stacking up blocks of using statements to avoid lots of nesting and indentation in my code, ultimately making it more readable to humans!

using (MemoryStream stream = new MemoryStream())
using (DeflateStream deflater = new DeflateStream(stream, CompressionMode.Compress))
using (StreamWriter writer = new StreamWriter(deflater))
{
// write strings into a compressed memory stream
}


The above example shows where this doesn't work though. The intention was to compress string data into an in-memory buffer, but it wasn't working as expected. There was a bug in my code!

When you're using a DeflateStream or a GZipStream, you're (hopefully) writing more bytes into it than it's writing to its underlying stream. You may choose to Flush() it but both streams are still open and you can continue to write data into the compression stream. Until the compression stream is closed, however, the underlying stream will not contain completely valid and readable data. When you Close() it, it writes out the final bytes that make the stream valid. By default, the behaviour of the compression stream is to Close() the underlying stream when it's closed, but when that underlying stream is a MemoryStream this leaves you with a valid compressed byte stream somewhere in memory that's inaccessible!

What you need to do instead is leave the underlying stream open, using the extra constructor argument on the compression stream, like this:

using (MemoryStream stream = new MemoryStream())
{
using (DeflateStream deflater = new DeflateStream(stream, CompressionMode.Compress, true))
using (StreamWriter writer = new StreamWriter(deflater))
{
// write strings into a compressed memory stream
}
// access the still-open memory stream's buffer
}

Friday, March 05, 2010

Powershell Script for New Guid

> function NewID { [System.Console]::WriteLine([System.Guid]::NewGuid()); }
> NewID
83ced965-68e6-454d-aa8e-2f056ae1a030

Money and Trouble

Use "Swim Lanes" to isolate the money makers and to isolate the trouble makers. The principle is intended to give higher quality service to paying customers, or more specifically the ones on whom your business is more dependent. It is also intended to isolate failures in troublesome components so that they do not affect the overall system negatively.

Monday, March 01, 2010

Failover Fail

I couldn't get my SQL Server Failover Cluster up and running! Full disclosure: I couldn't even get the Windows Server 2008 Failover Cluster running. It failed the Validation Check "Validate SCSI-3 Persistent Reservation"; it looks like iscsitarget doesn't support that particular feature. Back to square one.

Sunday, February 28, 2010

MapReduce in a Nutshell: Reduce

What we've established so far (see my earlier post) is that we can take a large set of data, partition it into smaller sets, and distribute those sets for processing to take advantage of parallelism. Each of those smaller sets ("library slices" in the example) generates its own set of intermediate results. In order to get our results, we must process all of the intermediate results, but we've already established that the data set is too large to perform the operation in one go. Here is where the Reduce step of MapReduce comes into play.

Remember that we called Map() once* for every book. We now need to call Reduce once* for every distinct word.

public void Reduce(string key, IEnumerable<byte> values)
{
long total = 0;
foreach (byte value in values)
total += value;
Emit(key, total);
}


Take a look at the intermediate results:
SLICE:0
WORDS:[("", ),("", )]
SLICE:1
WORDS:[("", ),("", )]


We're only going to call Reduce() once* per distinct word, but the key "the" exists in two of our slices. Both of those slices need to be logically joined before Reduce() can be called, with the combined intermediate values from both slices. Seeing as we're not using a framework, we need to manually set up some sort of routing that will ensure matching keys are joined. This is why Reduce() takes a key and a list of values as its input.

We used 5 Mapper objects, but there's no requirement for the same number of Reducers. Because we don't care about optimum performance yet, I've chosen to use 3. In real life, we would tune the number based on the hardware that's available, and the amount of data that needs to be pushed around to complete the process.

Here's a simple routing function that partitions words based on their initial letter, always routing matching words to the same reducer:
if (word[0] <= 'e')
// route to reducer 0
else if (word[0] <= 'r')
// route to reducer 1
else
// route to reducer 2


As it turns out, the naive routing function will split our data like this:














Reducer 0Reducer 1Reducer 2
afeatherssticking
andfirstthe
beginninggodtill
buttheavenup
caredinwhich
chickenmakewho
creatednotwould
crossedoryears
doesreportedyou
downroadyour
earth  
egg  


The program continues to execute, this time calling Emit() once for every word, and its final (total) count. The behaviour of Emit() is not concrete, but remember that it's (probably) being called on different physical machines, and the final results - once emitted - need to be routed back to the start.

Saturday, February 27, 2010

MapReduce in a Nutshell: Map

You have a problem to solve, but the set of input data is far too large to solve it by simply looping over all the data.

What if you broke the input data into chunks and analysed only a chunk at a time, somehow aggregating the intermediate results into something meaningful at a later stage?

How do you perform this partial analysis in such a way that an aggregation is possible?

The standard Wikipedia example counts all the words in a collection of books. Only problem is: the collection of files is too large, or the intermediate word count is too large to run on one conventional computer.

So. First of all, we have the superset of books: the library. We need to break it up. Maybe if it was 1/5th the size, we could analyze it in a single process? We will call each 1/5th a slice of the library.

We still need to define our function that will do the first step of analysis on each book, remembering it will be called once* for each and every book in the library.
void Map(Book book)
{
foreach (string word in book.Words)
EmitIntermediate(word, 1);
}


Now we have the function, where do we put it? On a Mapper object, of course. We will create one* mapper for each slice of the library, then we will invoke the Map() function once per book in that library slice. Then for every word in the book, we will call the mysterious EmitIntermediate() function.

For simplicity, let's just assume that EmitIntermediate delegates its call to an Emitter object. We will probably have one Emitter per library slice; each Mapper will have its own Emitter. This is good for isolationg Mappers: it will mean we're not sharing state between mappers so there won't be any concurrency issues even though we're dealing with multiple Mappers working in parallel.

Here is a possible implementation of an Emitter for our example - it differs from the standard example in that it performs some arithmetic continuously updating the slice's word count:
class Emitter
{
private Dictionary intermediate; // constructed elsewhere
public void Emit(string key, long value)
{
if (!intermediate.ContainsKey(key))
intermediate.Add(key, 0);
intermediate[key].Value += value;
}
}


Here is another possible implementation of an Emitter - it's more like the standard example as it stores a list of values, to be summed up at a later stage. Because we know that "1" is the only value we ever emit, we can reduce the amount of storage space required by defining the values as a list of bytes.
class AnotherEmitter
{
private Dictionary<string, List<byte>> intermediate; // constructed elsewhere
public void Emit(string key, byte value)
{
if (!intermediate.ContainsKey(key))
intermediate.Add(key, new List<byte>());
intermediate[key].Add(value);
}
}


There's no requirement for the Emitter's keys and values to be backed by an object in memory, this is just a simplification to explain the MapReduce concept.

Let's start this program up and run it. We take our library and divide it into slices. For each slice we create a Mapper and an Emitter. For each book in the slice we call the Map() function, which calls EmitIntermediate() for every word it finds. Once all the slices have been mapped, we inspect the Emitters. There's one Emitter per slice, and it contains all of the words for all the books in that slice.

Here's a concrete example of our entire library:
SLICE:0
BOOK:"Fight Club"
WORDS:"Sticking feathers up your butt does not make you a chicken"
BOOK:"The Bible"
WORDS:"In the beginning God created the heaven and the earth"
SLICE:1
BOOK:"Still Life With Woodpecker"
WORDS:"who cared which crossed the road first, the chicken or the egg"
BOOK:"FIASCO"
WORDS:"would not be reported till years down the road"
SLICE:N
BOOK:*
WORDS:*


When we example our intermediate values for the Emitter on slice 0, we find "the" 3 times (all from the Bible) and "chicken" just 1 time. Unsurprisingly, not a lot of cross over between Fight Club and the Bible. The Emitter on slice 1 has slightly different results: "the" comes in 4 times (3 times from Still Life, and 1 time from FIASCO), but "chicken" is there just 1 time (thanks again to Still Life).

What we need now is something that will aggregate our intermediate results so that we get 6x "the"s and 2x "chicken"s. In the next post, it's time for the Reduce step.

* Lies to children: if we expand our example to become more robust, we could end up retrying to Map() a book that previously failed, even retrying an entire slice with a brand new Mapper if the compute node failed.

Friday, February 26, 2010

Edgar Frank

Why are relational databases so ubiquitous? We are taught about them in Computer Science lessons at university; we see job advertisements requiring SQL skills all the time; almost every project I've touched in my professional career has used a DBMS of some sort, to various degrees of efficacy.

I will admit that to some extent my exposure is biased and self-fulfilling: I took the course, someone heard I was interested in developing the skills, I started looking to reuse and improve my skills, now every project wants to leverage existing skills and a subset of those would hire people that knew no other language but SQL.

But why? Who is designing all these applications? Are they really considering the actual requirements, or are they following the approach prescribed by DBMS vendors: use a relational database. Heck, use our DBMS.

Most database systems worth their salt offer two primary features: a reliable transactional system and a way to protect data integrity; ACID and EI/RI/DI.

Even if we don't use explicit transactions, or even consider them in our designs, the ACID properties (atomic, consistent, isolated, durable) stop simultaneous updates from rendering our entities in an inconsistent state.

Entity, referential and data integrity are enforced by the DBMS, but we need to design the structure of the database to take advantage of them. We declare entity integrity by defining primary keys (no two rows are duplicates), referential integrity by defining foreign keys (no unresolved dependencies) and data integrity by defining column types such as int and datetime.

SQL Server also offers us the ability to store massive amounts of data, indexes to seek right into the bits of data that you need, and a powerful query optimiser that will join related bits of data in the most efficient way possible. It also allows the data to be restored to arbitrary points in time in the past. With failover clustering it even attempts resilience against hardware failures. But despite its prevalence, I know there necessarily exists something (in concept at least) more efficient for every given task.

My perceived problem in DBMS systems lies in their applicability to general problems - in short their ubiquity. People don't question whether or not they need a DBMS; instead they question which vendor to purchase them from. Understanding everything a database does isn't something everybody wants to do, but more and more people are available in the market to build your company's new idea into a tangible product. A recruitment agent has a much easier job selling a labelled skill than selling some undefined/unquantifiable genius. A man (or woman) can make a living by writing SQL code for another man (or woman). It is everywhere. But why re-invent the wheel?

Well, when you don't know how large your application will have to scale up to at design time, you can take the bet that it won't scale, you can save some time and money by hiring a SQL developer off the street, and you will probably deliver something that's functionally correct if you do it right. But down the line, usually when you become a victim of your own success, it will become the bottleneck, and seemingly impossible to extract from the code that's been written around it...

Monday, February 22, 2010

iSCSI Target

I have wanted to setup my own SQL Server failover cluster for quite a while, but never really understood how: a long time ago you had to pick special hardware off a list provided by Microsoft and you needed to run a special edition of Windows Server and you needed Fibre Channel. In short, a recent university graduate wasn't going to get anywhere.

It all seems to have changed though. iSCSI has emerged as an alternative to Fibre Channel, which means that you use your gigabit network to communicate with the SAN instead of some cheeky fibre optic cables (all those years ago it still wouldn't have helped to know that you could use copper cables too). Windows Server 2008 ships with the iSCSI initiator built-in, although you can download it for free from Microsoft if you're running Windows Server 2003 or earlier.

My first mission was simple. Get it working. I chose to run iSCSI Enterprise Target under Ubuntu, with a 16GB USB flash drive formatted in NTFS. This would be my SAN. Note: you need to choose the correct block device name so that you don't end up destroying hardware or losing data. For my flash disk I used /dev/sdc1 (maybe an Ubuntu guru could tell me why /dev/sdc without the number didn't work 100%) and yours will almost certainly be different. [edit:on a fresh install of karmic, i had to run sudo apt-get update first, otherwise it complained that the package couldn't be found]
sudo apt-get install iscsitarget
Once the installation was complete, I edited the service configuration file
sudo gedit /etc/ietd.conf
Adding in the following two lines:
Target iqn.2010-02.com.extellect:ta.tb.tc.td
Lun 0 Path=/dev/sdc1,Type=fileio

After saving the file and closing the editor, I restarted the Ubuntu machine, and powered up the Windows Server.

I clicked on Start > Administrative Tools > iSCSI Initiator then I clicked yes to the questions about firewall, and thought about setting the service as a dependency of the SQL Server service so that the drive would be available whenever SQL Server was running.

Looking at the iSCSI Target properties window:
Discovery tab: Add portal: aa.bb.cc.dd port 3260
Targets tab: automatically populated thanks to discovery: iqn.2010-02.com.extellect:ta.tb.tc.td and the status is "Inactive"
Click the Log on button, then OK.
The status is now "Connected"

At this point take a look at the Ubuntu machine to see the connection
sudo cat /proc/net/iet/session

Back to the Windows machine, this time opening Computer Management > Storage > Disk Management
I saw my flash drive but the disk said it was offline - right click and choose online from the context popup menu. Now I saw a familiar (and happy) sight:
Disk 2 Basic 14.95 GB Online - CRUZER (F:) 14.95 GB NTFS Healthy (Primary Partition)

Opened up SQL Server Management Studio and created a new database, choosing F: as the default data and log location. Created a table, inserted and then selected "Hello, World!"

Checked on the Ubuntu machine and there was the mdf and ldf!

Sunday, February 14, 2010

Sequential vs. Random I/O

I mentioned here that sequential I/O was good and random I/O was bad. Bad is the wrong word to use. Random I/O is comparatively less efficient than sequential I/O when accessing the same amount of data. The point to take home is that when a hard disk head is seeking to a different location, it isn't reading and returning data. In the sequential I/O scenarios there is very little seeking and the drive can quickly return all the data in a single stream. In contrast, random I/O causes the disk to spend half its time seeking and it can only return half the amount of data in the same amount of time as the sequential scenario.

Here are two performance considerations for designing and administering a SQL Server database and server:

Index seeks and scans use random and sequential I/O respectively. The performance gained by index seeks is in the amount of data that needs to be accessed. Finding 1 row in just 4 random I/Os is much quicker than scanning through 10000 I/Os worth of data sequentially. There is a point at which scanning is quicker, and that's when the tables are small.

Log files are accessed sequentially, so it makes sense to put each database's log file onto its own disk for the best performance. Sharing one disk between many different log files will lose you the performance benefit.

Tuesday, February 09, 2010

Caution: Assembly Under Test

What do we test? What can we test? What should we test? Three subtly different questions with radically different answers in the general case.

We start by answering the questions that weren't asked: why do we test and why should we test?

Testing increases developers' confidence to introduce new changes without breaking existing functionality. Anecdotally, it also reduces overall project completion time because less time is spent in the testing phase because bugs are found and dealt with sooner. It would be a tall order to define all the reasons we should test in one paragraph of one tiny blog.

There are usually ancillary reasons why we do test. Someone else forces us to, or we choose to, or we need some experience for the CV. People's motivations are endlessly creative. Heck, I can imagine many a political reason for coding unit tests.

We can test pretty much anything, but it doesn't mean we should. As we become more familiar with various testing tools and APIs we increase our range of what can be tested, but we should focus on the application of our knowledge to what we should be testing.

What should we test? Best question yet. From the point of view of the person who's paying you to develop software: we should test things that make our project more attractive than a (perhaps imaginary) competitor's project that doesn't employ unit testing. Do we deliver it faster? Do we deliver it with less bugs? Is everyone on the team happy? Did we get to do a little bit of CV++ along the way?

What do we test? In an ideal world, we test the objects we write. Seeing as I come from an object-oriented background, I'd say we test the state and behaviour of objects. We test state by inspecting the values of properties. We test behaviour by inspecting the return values of methods, and sometimes by ensuring that we're interacting correctly with other objects.

  • State: make assertions against expected and actual property values
  • Behaviour: make assertions against expected and actual method return values
  • Interactive behaviour: make assertions against the objects with which our objects interact - what we're testing is that we're actually interacting with them in the expected manner

Mock objects are the flavour of the week when it comes to testing interactive behaviour - we set up expectations on a mock object, interact with it, and then validate our expectations. They are there so that we can test our object under test is correctly interacting with it's surroundings.

Stub objects don't fit anywhere in this list of tests because we don't assert against them... we only use stub objects in order to allow our tests to compile and execute. We choose to use stub objects because we can control them; control their dependencies. They are an important part of any isolation framework. We don't want our logging method to keep writing disk files or sending e-mails while our tests run, but we don't want it to throw NullReferenceExceptions either. We definitely don't want to code our application to continuously check if the logger is null (this would be tedious and error prone), nor would we want to introduce special branching logic to not perform logging under test conditions. We could code up our own NullLogger : ILogger that does nothing but implements all ILogger's events/properties/methods to do nothing, but this would be fragile and often a waste of time when functionality is added to or removed from the ILogger interface a month down the line. Good frameworks like Rhino.Mocks allow dynamic stubs to be built on the fly without you ever having to care about creating and maintaining your own NullLogger implementations. Saving us time, saving us money.

Just because we can test state and behaviour (even interactive behaviour) doesn't mean we should. There needs to be justification. Does it make the application development experience quicker and better? Does it reduce overall costs? Does it allow you to sleep better at night (no, not because you are so exhausted from fighting fires and fixing production bugs!)? Are the tests valid? Will they help someone newly joining the team to understand what's going on? You can be as creative as you want in your reasoning, but without proper justification, you may just be wasting time writing and running tests.

Sunday, February 07, 2010

Unblock

Windows 7 is pretty security-conscious. I downloaded the latest Rhino.Mocks zip, extracted the .dll and .xml, referenced them into a new Visual Studio test project, and CTRL+R,CTRL+T'd to run the test. Fail! Test run deployment issue: 'Rhino.Mocks.dll' is not trusted.

Unbeknownst to me, there's a new option in Windows 7 (visible from the file's properties window) "this file came from another computer and might be blocked to help protect this computer." Smart. And there's an unblock button nearby. Click it.

It doesn't end there. I extracted the .xml file too, so that I'd have access to the Intellisense code documentation comments. This file needs to be unblocked too. Even if you've unblocked the .dll, the error message will persist: 'Rhino.Mocks.dll' is not trusted. Lies lies lies. Repeat step 1 to unblock the .xml.

Finally, Visual Studio will likely be working with cached assemblies, so the original .dll and .xml that were copied into the bin directory before the intial test run deployment failure are probably still there (they're still blocked). Delete them and let Visual Studio pick up the fresh unblocked files, and you'll be on your way again in no time.

Confidence & Unit Tests

Why do we waste our time with these stupid unit tests, some might ask. Well, if your system's requirements can change on a dime then I'd say (besides the initial measure-twice-cut-once rule) that the next most important thing you can do is have an automated suite of tests that can ensure your "tiny" "quick" code change doesn't have massive negative impact on the rest of the system. An automated build that runs unit tests covering most of the code base gives developers confidence to change the relevant pieces of source code and get quick turnaround times for new features and bug fixes. Yes! Quick turnaround times because the code was all already under test. I hate to say it, but "agile".

Measure Twice, Cut Once

Unit testing offers software developers a parallel insight into the world of physical construction: measure twice, cut once. Whether you're in charge of putting together a wooden tie-rack or a sky scraper, it's slightly more than just inconvenient to find that someone in the process delivered the wrong sized artifact because they were rushing and only looked over their work once. A second pair of eyes, a buddy check, even double checking on the developer's own part can reduce the occurrence of these kinds of errors. Unit tests are a great tool for software engineers in this regard. We code up the business logic once, and then we get to test it from an alternate perspective (please, for heaven's sake, don't mimic the exact flow of logic in your unit tests: even if there's a bug your tests will erroneously pass!)

Friday, February 05, 2010

ConditionalAttribute("DEBUG")

This is quite a neat concept that I only partially grasped until now. When you apply this attribute to a target (either a method or class) it is compiled just as any other custom attribute is treated. Look at the assembly in a tool like .NET Reflector and you'll see it there, exactly as you defined it. Bog standard.

The clever part is what happens next, whether you're referencing the target yourself (perhaps even from the same assembly in which it is defined), or someone else is referencing it by linking to your assembly as a 3rd party library.

The compiler recognizes the target has a conditional attribute, and it checks its list of defined symbols to see if there's a match. If there's a match, it proceeds as normal, but if there's no match, the compiler modifies the caller to leave the method call (or the custom attribute declaration) out of the newly compiled assembly entirely. Simple as that. In this way, the caller (and this is the important part) will never know that the target method or attribute class exists. This can have great performance implications: code that's compiled with the DEBUG symbol defined (or whatever symbol you choose for your project) will make all the calls, which could include very verbose tracing, but code without the symbol will have the method calls (and custom attributes) elided. In the case of method calls, not only the method call is skipped, but any logic for building up arguments is also skipped. For example:

ConditionalTargetClass conditionalTargetClass = new ConditionalTargetClass();
conditionalTargetClass.conditionalTargetMethod(Factorial(2000));


becomes

ConditionalTargetClass conditionalTargetClass = new ConditionalTargetClass();

There are a few limitations:
When applying the attribute to a method, that method return type must be void.
When applying it to a class, the class must derive from System.Attribute.

Another example:
[Conditional("DEBUG")]
public class DebugAttribute : Attribute { /* EMPTY */ }

[Debug]
public class Program { /* EMPTY */ }


becomes

[Conditional("DEBUG")]
public class DebugAttribute : Attribute { /* EMPTY */ }

public class Program { /* EMPTY */ }

Wednesday, February 03, 2010

Method Scenario ExpectedBehaviour

What do you call your unit test methods? How do you logically structure them? I've been over this before, so it's time for a little refinement. One unit test assembly for every assembly under test. One test fixture for every class under test. Many test methods for every method under test. And to make everything easy to read, every test method should follow this pattern (not my idea, but I wish it was):

Method_Scenario_ExpectedBehaviour

For example if we're testing a new Security class with an Authenticate method, we might create the following test methods inside the SecurityTestFixture class:

Authenticate_UserNotFound_ThrowsAuthenticationException
Authenticate_IncorrectPassword_ThrowsAuthenticationException


Of course no application development should be driven by negative tests - at first you'll want to test that your authentication process works, then later down the line you'll want to make sure that it's robust. So there would likely be more test methods like:

Authenticate_Success_ReturnsPrincipal

Note how easy it is for someone to read the test method name and understand what it was doing. That's on purpose. Also note how the scenarios aren't long and complex, and only one logical unit of work is tested per test method. That's on purpose too. It means it's easy to pinpoint the root cause of failure.

Fake: Stub vs. Mock

When you're writing unit tests, you'll occasionally find yourself in the situation where you need to create some kind of fake object and pass it into your object/method under test. What you do with that object next is what determines whether it's a mock object or just a stub. If your test makes an assertion on the fake (i.e. what methods were called on it, and with what arguments) then that fake is a mock object. If the test does not make any assertions on the fake, then it's just a stub.

In short:
all mocks are fakes
all stubs are fakes
no mock is a stub
no stub is a mock

Many so called "mocking frameworks" actually allow you to dynamically create both mocks and stubs, so it would be better to call them isolation frameworks as their primary function is to let you test your own objects in isolation by mocking/stubbing out external dependencies.

Tuesday, January 26, 2010

Visual Studio Shortcuts

We all make typogarphical mistakes. Oops. Put the caret between the 'a' and 'r' and hit Ctrl+T.

This shortcut allows you to move a single letter one position to the right every time you use it. More precisely, it swaps the characters on the left and right of the caret, then advances the caret one character to the right. The advancement of the caret means you can keep hitting Ctrl+T to effectively move a single letter right through a word.

Sometimes I'll type in a class name without declaring the corresponding import statement at he top of my C# source file. Visual Studio adds a visual clue (small red box under the final character of the type name) and if you move your hands off the keyboard and wiggle the mouse around a bit, Visual Studio will display a context menu to choose the namespace. If you want to keep your hands on the keyboard, hit Ctrl+Shift+F10 and then hit the up and down arrows to select the appropriate namespace from the list.

Some other shortcuts can be found here: http://www.microsoft.com/downloads/details.aspx?FamilyID=e5f902a8-5bb5-4cc6-907e-472809749973&displaylang=en

Tuesday, January 12, 2010

Split Range - Partitions to Filegroups

So I asked the questions: How do I know which filegroup any given partition is mapped onto? How do I know which is the new partition when a range is split on a new boundary. And which partition keeps the data and which is removed when two partitions are merged? The first question is answered by looking at the sys.destination_data_spaces table. The second is answered in this post, and the third will be the subject of a later post.

We can set up a partition for this exercise with the following SQL:
CREATE PARTITION FUNCTION pfLeft(INT) AS RANGE LEFT FOR VALUES()
CREATE PARTITION SCHEME psLeft AS PARTITION pfLeft TO (FG1)


Our diagnostic SQL, which we run repeatedly, is:
SELECT destination_data_spaces.*
FROM sys.destination_data_spaces
INNER JOIN sys.partition_schemes ON sys.destination_data_spaces.partition_scheme_id = sys.partition_schemes.data_space_id
WHERE sys.partition_schemes.name = 'psLeft'
ORDER BY destination_id ASC


For a partition function with no boundaries (i.e. all rows fit in one partition) we see that the first (one and only) destination_id maps to the data_space_id of FG1.
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 2

If we set the next used filegroup to FG2 and then split the partition on the boundary of 0 we see:
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 3
65601 2 2

The interesting bit is that FG2 is now used for the first destination_id. When the existing partition was divided, the LEFT sub-partition (including value 0) was assigned to the new filegroup. Let's try it again by setting the next used filegroup to FG3 and splitting on 100:
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 3
65601 2 4
65601 3 2

Again, it's the LEFT sub-partition (including value 100) of the partition that was being divided that is assigned to the new filegroup. If we try this one more time with a split on -100, I'd hope to see FG4 take up the LEFT sub-partition of our left-most partition.
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 5
65601 2 3
65601 3 4
65601 4 2


Lets do everything from the start, but this time we'll use:
CREATE PARTITION FUNCTION pfRight(INT) AS RANGE RIGHT FOR VALUES()
CREATE PARTITION SCHEME psRight AS PARTITION pfRight TO (FG1)


We'll also need to change the diagnostic code to show the psRight partition scheme:
SELECT destination_data_spaces.*
FROM sys.destination_data_spaces
INNER JOIN sys.partition_schemes ON sys.destination_data_spaces.partition_scheme_id = sys.partition_schemes.data_space_id
WHERE sys.partition_schemes.name = 'psRight'
ORDER BY destination_id ASC


Then we run all the splits in order: 0 -> FG2, 100 -> FG3, -100 -> FG4
destination_id data_space_id
-------------- -------------
1 2
2 5
3 3
4 4

The latest results are consistent with what we've seen previously, but now it's the RIGHT segment (including the boundary value) of the newly divided partition that gets moved onto the new filegroup.

In summary, if your partition function is defined as RANGE LEFT, then a SPLIT RANGE (x) alteration will split an existing partition and the LEFT segment will become the new partition on the NEXT USED filegroup; the other segment stays on its original filegroup. Conversely, if your partition function is defined as RANGE RIGHT, then the RIGHT segment becomes the new partition on the NEXT USED filegroup.

Wednesday, January 06, 2010

Splitting a PDF

I discovered a novel way of splitting a large PDF file into more manageable chunks: print out the range you want to the "print to file" printer in Ubuntu, making sure to choose the PDF output!

Tuesday, January 05, 2010

Horizontal Partitioning 1

When you want to speed up the physical reads on a SQL Server, you have two options: faster disks, or more disks. But only once you've determined that physical reads are a (potential) bottleneck do you get to make the choice.

If using faster disks is not enough to get your system back to an acceptable level of performance, then there's still the option of adding more disks. Merely adding disks won't - in itself - speed anything up. To realize performance gains you need to restructure your data. SQL Server allows you to partition tables, and also indexes. But as usual, the devil resides in the details.

Partition functions can only take one parameter. This means that the partition in which each row resides is determined by the value of just one column value in that row.

If your original table had a clustered index, you'll probably want to keep it. However, this has a big consequence: you will need to make the partition function congruent with the clustered index. SQL Server will complain if you leave the partition column out of the clustered index "Partition columns for a unique index must be a subset of the index key". It gets worse if you want a composite clustered index - you should be aware that in some cases SQL Server appears to store data internally sorted first by the partition column, and then by any other columns in the composite clustered index. If your original table was sorted by Col_1,Col_2 and you choose Col_2 as your partition column, then your table may be sorted internally by Col_2,Col_1. Actually, it's not this straightforward: I need some time to figure this out; it will be the subject of a later post.