Handling fixed width text with Regular Expressions (RegEx)

 

When most developers are faced with a fixed width text file, they reach for the String object.  While this is effective, it isn’t efficient.  .NET doesn’t handle strings that well, and use if SubString is memory intensive.  A better way is to use the RegularExpressions classes in System.Text.RegularExpressions.

A fixed width file is one where the columns are defined by the number of spaces consumed.  For instance, here is a list of the Big 10 (11? 12?), locations, and years founded:

University of Illinois          Champaign, Illinois         1867 
Indiana University              Bloomington, Indiana        1820 
University of Iowa              Iowa City, Iowa             1847
University of Michigan          Ann Arbor, Michigan         1817
Michigan State University       East Lansing, Michigan      1855
University of Minnesota         Minneapolis, Minnesota      1851
Northwestern University         Evanston, Illinois          1851
Ohio State University           Columbus, Ohio              1870
Pennsylvania State University   State College, Pennsylvania 1855
Purdue University               West Lafayette, Indiana     1869
University of Wisconsin–Madison Madison, Wisconsin          1848

The university is 32 characters, the location is 28 characters, and the year is 4 characters.  We can debate up and down the benefits of such a format, but it is what it is, and we often get them from legacy systems.

Instead of using the String.Substring object to get the values out, we can use the Match class in System.Text.Regular expressions.  When you use this class, you get back a Match object, that has a collection of the matches (shocker that) found in the intersection of the expression and the input.

Here is an example program that loads the file, and uses an expression (note that format) to break up the file into a collection, basically an array.  Notice that there isn’t a single String in the project other than the pattern itself.  To run the program, save the above formatted text into a file called “BigTen.txt” on your C drive.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;

namespace BigTen
{
    class Program
    {
        static void Main(string[] args)
        {
            StreamReader sr = new StreamReader(@"c:\BigTen.txt");
            string pattern = @"^(?<school>.{32})(?<location>.{28})(?<joined>.{4})$";
            Regex re = new Regex(pattern);
            while (sr.Peek() != -1)
            {
                Match match = re.Match(sr.ReadLine());
                Console.WriteLine(match.Groups["school"].Value.TrimEnd());
                Console.WriteLine(match.Groups["location"].Value.TrimEnd());
                Console.WriteLine(match.Groups["joined"].Value.TrimEnd()+"\n");
            }
            sr.Close();
            Console.ReadLine();
        }
    }
}

Of course, there are downsides to regular expressions.  They are difficult to debug, and the formatting is arcane.  For this, however, they make for an excellent solution, and for formatting of the expression is quite readable.  Only one expression is used, so it is easier than some to debug.  I think it is a good solution to the problem at hand.  Give it a try!

Comments (1) -

Doug
8/31/2010 4:24:34 PM #

Cool, didn't know you could do that. I've always done it the inefficient way. I blogged a PowerShell version.

www.dougfinke.com/.../

Comments are closed

Bill Sempf

Husband. Father. Pentester. Secure software composer. Brewer. Lockpicker. Ninja. Insurrectionist. Lumberjack. All words that have been used to describe me recently. I help people write more secure software.

profile for Bill Sempf on Stack Exchange, a network of free, community-driven Q&A sites

MonthList