Posted July 9, 2008 by Ben Burkert in projects with tags , .

Regular expressions are a developer’s best friend. Seasoned programmers can wield regular expressions to extract structured information from often near random input. And Ruby’s explicit syntax for regular expressions makes adding a little order to your data chaos a pinch. Turns out, the syntax works well for the opposite as well — creating random data from simple expressions.

Randexp allows you to use regular expression to generate a random string that matches the regular expression. Say you have a model with a serial_number property that validates against a regular expression.

/XX\d{4}-\w-\d{5}/

With our regular expression, we can generate random strings that always matches this expression using the generate (or gen, for short) method.

/XX\d{4}-\w-\d{5}/.generate    #=> "XX3770-M-33114"

The generate and gen methods are added to the Regexp class when the regexp gem is required, and construct a Randexp object with the regex’s source, which is ‘reduced’ into a string. The Randgen class is used to generate the actual random values, which can be extended to allow for more complex expressions, which covered later.

Right now, there is support for the single character matchers: word(\w), whitespace(\s), and decimal(\d), along with literals and multiplicity operators(*, +, ?, {}). One caveat though, most expressions raise errors when combined with the * or + operator.

/Aa{3}h*!/.gen
# => RuntimeError: Sorry, "h*" is too vague, try setting a range: "h{0,3}"
 
/Aa{3}h{3,15}!/.gen
# => "Aaaahhhhh!"
 
/(never gonna (give you up|let you down), )*/.gen
# => RuntimeError: Sorry, "(...)*" is too vague, try setting a range:
 "(...){0, 3}"
 
/(never gonna (give you up|let you down), ){3,5}/.gen
# => "never gonna give you up, never gonna let you down, never gonna give
 you up, never gonna give you up, "

The exception being the word matcher, which is treated as a random word. If a specific length or range is given for a word matcher, a word of suitable length is generated.

/\w{10}/.gen  # a word with 10 letters
# => "Chaucerism"
 
/\w{5,15}/.gen
# => "cabalistic"

This is still a bit cryptic, but the [:method_name:] syntax can be used to clean it up a bit, which calls the class level method of the Randgen class.

/[:word:]/.gen
# => "deutomala"
 
/[:sentence:]/.gen
# => "Antiphonically electrotellurograph chromatype proczarist plumet"
 
/[:paragraph:]/.gen
# => "Sesquioxide conationalistic paragoge dingus unsteadfast tenophyte
 goetic phytonomy hebephrenia rix uninjured biventral.  Householdry clunk
 amateur ramekin baronet chirotonsory mythical hobbist semblative
 cubonavicular outbrother templeward thaumatology velutina dharmasmriti
 kassak.  Persecutor wudu bertie deputative carburant."

Extending Randgen

You can add class level methods to the Randgen class which can be used within your regular expression using the [:xxx:] syntax.

class Randgen
  def self.serial_number(options = {})
    /XX\d{4}-\w-\d{5}/.gen
  end
end
 
/[:serial_number:]/.gen
#=> "XX3770-M-33114"

Under the Hood

There are two major steps involved in generating the random string. First, the regular expression is converted into a nifty little s-expression with the Parser class that is stored in the Randexp instance.

Randexp.new("(a|b)\\w*").sexp
# => [:union,
           [:intersection,
             [:literal, "a"],
             [:literal, "b"]],
           [:quantify,
             [:random, :w],
           :*]]

This sexp is then ‘reduced’ to the random output by the Reducer class, which walks the sexp constructing the string.

Dictionary

Randomly generated words are not actually generated. Instead they are picked from a dictionary of words loaded from your local words file, which typically holds thousands of words. So it’s got plenty to choose from. The words are also mapped by size, allowing you to generate words of a specific length, or within a range.

/\w{2,6} \w{10,20}, inc/.gen
#=> "mold forethoughtfulness, inc"

Note Right now randexps looks for your words file in /usr/share/dict/ or /usr/dict/. This works on OSX and most *nix distros, although I had to create a symlink on gentoo. Windows users are S.O.L., unless there’s a way to get a words file with cygwin.

Installation & Use

It’s published on github’s gemserver for the time being, it will be on rubyforge soon as well. Here’s the command to install it from github’s gemserver. Note You must be running the latest version of rubygems for this to work.

gem sources -a http://gems.github.com/
gem install benburkert-randexp

Load randexp from irb with the following.

gem "benburkert-randexp"
require "randexp"

Raison d’ĂȘtre

I started writing randexp because of another gem I was working on, can_has_fixtures, yet another alternative to fixtures. CHF replaces fixtures by generating pseudo-random data for model instances. By itself, randexp probably isn’t very useful, but combined with a model generator it can be quite helpful.

User.fixture(:employee) {{
  :first_name => (first_name = Randgen.word).capitalize,
  :last_name => (last_name = Randgen.word).capitalize,
  :username => username = "#{last_name}#{first_name[0, 1]}",
  :email => /#{username}@(corp|subsidiary|partner)\.com/.gen,
  :ssn => /\d{3}-\d{2}-\d{4}/.gen,
  :addr1 => /\d{2,4} (\w+ ){1,3}(street|lane|way), \w+, \d{5}/.gen,
  :records => (1..5).of { Record.generate(:employee) }
}}
 
User.generate(:employee).ssn
  # => "735-50-9234"
User.generate(:employee).addr1
  # => "8829 yearbook way, unconvenable, 29290"

Pretty cool, especially when you start extending the Randgen class.

I’m about to start rewriting CHF due to a few nasty bugs caused by single table inheritance models on DataMapper edge. It will probably be DataMapper specific because DataMapper is now our ORM of choice. (you can follow along here, but right now it’s just vaporware).

comments:

5 Comments »

  1. Going to use this to generate random words. I have a purposely non-secure password feature that generates a string of ten random characters, but a random word of ten characters might be more memorable.

    Comment by R. Elliott Mason — July 16, 2008 @ 10:26 pm

  2. Oh snap I’m on Windows. Vista. 64bit.

    Comment by R. Elliott Mason — July 16, 2008 @ 10:32 pm

  3. Is the availability of a dict file the only block on Windows support? Why not simply include a dictionary file with the gem? There are plenty available at textfiles.com , and thousands of words would still be less than a few hundred KB.

    Comment by ab5tract — September 2, 2008 @ 2:58 pm

  4. The problem is the typical dict file on osx or *nix will drastically increase the size of the gem. Were looking for a good solution. We may add some sake tasks or something in order to make it so you can get some words files. We’d like to have it so you can have files of verbs, nouns, adjectives, etc. We’re open to ideas if anyone has any

    Comment by Brian Smith — September 5, 2008 @ 8:28 pm

  5. To get the latest version of the gem, make sure you use “sudo gem install randexp” and not the instructions from above.

    Comment by Morgan Roderick — September 12, 2008 @ 9:56 am

RSS feed for comments on this post. TrackBack URL

Leave a comment