Wednesday 27 July 2011

Constructing a task occupancy time series from discrete task start and end times

When finding my way around a new distributed system I often want to see a chart of the number of tasks running on a system at the same time but only have discrete task start and end time data. Even if time series data already exists, it’s often not broken down in a meaningful way – say by task type, or final status.

So: how do I build a time series like this from a set of discrete start and end times?

I tackle this in three steps. First, I build and sort lists of start and end times. Secondly, I process these two lists to get a task event dataset: every start takes a running count of tasks up, and every end takes it down. Finally I process that event dataset to build a time series based on discrete time points.

Building sorted lists of task start and end times is pretty straightforward. I typically dump start and end time data from the database, which might look like this:

2011-07-19 04:02:45.0730000,2011-07-19 04:03:21.6000000
2011-07-19 04:02:45.0730000,2011-07-19 04:03:44.7030000
2011-07-18 22:58:52.1800000,2011-07-18 22:58:52.6670000
2011-07-18 22:58:53.3500000,2011-07-18 22:58:57.7700000
2011-07-18 22:58:52.1800000,2011-07-18 22:58:52.7200000
2011-07-18 22:58:53.3500000,2011-07-18 22:58:58.4030000
2011-07-18 22:58:52.1800000,2011-07-18 22:58:52.7100000
2011-07-18 22:58:53.3500000,2011-07-18 22:58:57.7700000
2011-07-18 22:58:52.1800000,2011-07-18 22:58:52.7630000
2011-07-18 22:58:53.3500000,2011-07-18 22:58:57.7700000

and then use the excellent FileHelpers library to load them in. I make some assumptions about the data – the earliest start is earlier than the first and point, at least.

After that, sort the lists in place in ascending time order.

var starts = new List<DateTime>();
var ends = new List<DateTime>();
foreach(var task in tasks)
{
	starts.Add(task.start);
	ends.Add(task.end);
}

starts.Sort((a, b) => 
	Convert.ToInt32(new TimeSpan(a.Ticks - b.Ticks).TotalMilliseconds));
ends.Sort((a, b) => 
	Convert.ToInt32(new TimeSpan(a.Ticks - b.Ticks).TotalMilliseconds));

Merging the two lists to create an event dataset is a little trickier. I do it by processing either the start or the end list in order at any time, swapping between them when the timestamps in one jump ahead of the timestamps in the other. I keep a running count of tasks (the occupancy) – every start takes it up, and every end takes it down.

int currentIndex = 0;
int otherIndex = 0;
int runningCount = 0;
var currentList = starts;
var otherList = ends;
var incdec = 1;

var timestamp = new List<DateTime>();
var value = new List<int>();

while (currentIndex < currentList.Count - 1)
{
	runningCount += incdec;
	timestamp.Add(currentList[currentIndex]);
	value.Add(runningCount);

	currentIndex++;
	if (currentList[currentIndex] > otherList[otherIndex])
	{
		var tempIndex = currentIndex;
		currentIndex = otherIndex;
		otherIndex = tempIndex;
		var tempList = currentList;
		currentList = otherList;
		otherList = tempList;

		incdec *= -1;
	}
}

This generates a relative time series, with offsets from the number of tasks running at the start of the dataset. To constrain this I’d need to work out what that initial number of tasks is. Generally I’d count the number of tasks that started but didn’t end before my time series starts.

Finally, create a discrete timeseries with evenly spaced points. I just generate a list of timestamp points and for each take the nearest previous recorded occupancy value. Generally I’ll also take a highwater mark, as otherwise you lose a feel of any peaky behaviour within between points.

var index = 0;
var dt = new TimeSpan(0, 5, 0);
var current = timestamp[0];

while (current < timestamp[timestamp.Count - 2])
{
	var next = current.AddTicks(dt.Ticks);

	var highwater = 0;
	while (index < timestamp.Count - 1 
		&& timestamp[index + 1] < next)
	{
		index++;
		if (highwater < value[index]) 
			highwater = value[index];
	}

	Console.WriteLine(current + "\t" + value[index] + "\t" + highwater);
	current = next;
}

You’ll notice this is quite loose; I’m not really worrying about interpolating to get values at exact time series points, and so on.

And that’s basically it. When building your series you can fold in extra information – about the task type for example – and use that to show how the system treats different task types.

No comments: