Thursday, January 28, 2010

Random Sentence Generator

I’ve always been fascinated with language and linguistics… specifically grammar and sentence structure. I was one of those nerds in high school who actually enjoyed diagramming sentences.

Sentence Diagram

For kicks, about 15 years ago, I wrote a self-contained program in Visual FoxPro (VFP) to generate random sentences, but I lost track of it… it’s probably on an old computer gathering dust in the garage.

But a year or so ago, I Googled “random sentence generator” and came upon this website, which describes an old assignment in a Computer Science class at Stanford University.

The assignment was to generate random sentences based on a text file consisting of a grammar. This grammar is a template describing the various combinations of words to form the sentences.

For example, here is a very simple grammar file (Don’t pay any attention to the T-SQL /*…*/ comment delimiters… they’re only there so that websites that syndicate this blog render the text below in a consistent manner):

/*
{<start> <pronoun> <verb>;}
{<pronoun> I;you;he;she;it;we;they;}
{<verb> ran;played;drank <liquid>;}
{<liquid> water;whiskey;Pepsi;}
*/
This grammar file consists of 4 elements, whose definitions are between curly brackets ({}). Each element is defined by a list of possible expansions, each ending with a semicolon (;). These expansions can be individual words or phrases, or they can be (or contain) other elements, which will need to be further expanded.

Beginning with the <start> element, you generate a random sentence, continuously expanding the elements until there are no more to expand, as illustrated by the following steps:

/*
Start with <start>: <start>
Expand <start>: <pronoun> <verb>
Expand <pronoun>: they <verb>
Expand <verb>: they drank <liquid>
Expand <liquid>: they drank whiskey
*/
This grammar file above, though very very simple, is capable of generating 35 different sentences.

There are various grammar files available at this location. Many of them were written for fun by students over the years. If you wish to download any of them, just right-click and choose Save Target As… from the context menu.

For example, there’s a grammar file for generating a random Star Trek episode treatment, like so:

While ferrying a Drysilic ambassador to the nearest J.C. Penny outlet, Wesley Crusher is suddenly swallowed into a time vortex connecting to the year 325. In a flash of insight, someone decides that the best solution is by using Guinan’s hat as a weapon. And thus, the Enterprise continues undaunted in its mission.

And there’s one for generating a random James Bond film synopsis (I inserted a [sic] next to the misspellings… honestly, you’d think Stanford students would spell better!):

A demented Libyan terrorist plots to lower England’s standard of living by wrecking the global economy. Bond tracks his nemesis to a palacial [sic] estate in France and they play cards (cheating shamelessly). Within minutes, Bond meets a stunning CIA agent, whom he thinks is cute but too young. Afterwards, Bond is siezed [sic] by commandos and fed to pirannahs [sic]. Incredibly, he escapes by driving a speedboat over a waterfall. Finally, with only 007 seconds to spare, Bond causes SMERSH headquarters to self-destruct.

And there’s one for generating insults:

You cantankerous bucket of defective lizard scabs.

May a gruesome and sweaty group of South American killer bees gleefully vomit in the toxic dump you call home.

Anyway, once I discovered this website a year ago, I wrote a brand new Random Sentence Generator in VFP, using this grammar file template concept. And I had a little fun generating random insults and random Trek/Bond synopses and other things.

But I didn’t stop there. I put together a grammar file to generate sentences of my own. Over many weeks, I kept adding to it, making it very extensive, filled with dependent clauses, modal auxiliary verbs, conjunctive adverbs, and a host of other complex sentence constructions. I even amended it to handle past and future tense and plurals, which meant I had to handle irregular formations like sit/sat and man/men.

I can pretty much guarantee that you will never get the same sentence generated twice from my grammar file. I haven’t done the math (I’m not really sure how to), but it must be capable of generating literally quadrillions upon quadrillions of different sentences, from the very simple…

Michael sobbed.

…to the moderately complex…

Last week my daughter was upset to discover that all the gophers wanted to gracefully gulp down 42 gallons of armadillo soup.

…to the very complex…

This group of supermodels will become very sumptuous, moderately rich, and positively sleazy next Tuesday, during the time that my sister-in-law’s daughter will look decidedly tough and allegedly creepy, and next Monday, even though a stormtrooper and that surprisingly murderous nurse will haul peanuts across a terribly horrendous ocean, very heavyhearted kangaroos will move out of the pale blue forest.

So, now cut to present day… I recently wrote the same Random Sentence Generator in T-SQL, and it’s available here on my SkyDrive (along with my grammar file and a few of the various others mentioned above) to anyone who wants to fool around with it (in SQL2005 or later).

In order to use it, you must first create a table called, appropriately, Grammar:

use TempDB  /* Change to whatever database you wish */
go
if object_id('Grammar','U') is not null drop table Grammar
go
create table Grammar
(
Descript nvarchar(20)
,Element varchar(40)
,Expansion varchar(200)
)
go
create clustered index Ix_Grammar on Grammar (Descript,Element)
go
The table will house any number of grammar file definitions that you wish to load in. I’ve written a stored procedure called LoadGrammarFile to do this. The code below loads the various grammar files mentioned above (the Nonsense file is the one that I put together):

exec LoadGrammarFile 'StarTrek' , 'c:\grammar\trek.txt'
exec LoadGrammarFile 'Bond' , 'c:\grammar\bond.txt'
exec LoadGrammarFile 'Insult' , 'c:\grammar\insult.txt'
exec LoadGrammarFile 'Nonsense' , 'c:\grammar\nonsense.txt'
So, for example, the last line above loads a grammar with the description of Nonsense from the file c:\grammar\nonsense.txt.

So now that we have our table loaded with our various grammars, we can generate sentences to our heart’s content. The stored procedure called GenerateSentence will do just that, based on the grammar we desire:

exec GenerateSentence 'Nonsense'
/*
Sentence
---------------------------------
Ronald tried to escape from Peru.
*/
The GenerateSentence procedure also accepts a couple of optional arguments. For example, if you know your grammar file well, you can indicate that you want the sentence generation to start with a different element from the <start> element, like so:

exec GenerateSentence 'Nonsense', @Init='<confuciusoption>'
/*
Sentence
-----------------------------------------------------------
Confucius say, "Pessimism is like cabbage, only more sexy."
*/


exec GenerateSentence 'Nonsense', @Init='<bookmovieoption>'
/*
Sentence
-------------------------------------------------------------------------------------
Larry Coolidge's wife checked out a film entitled "Cancer And The Kumquats in Tokyo".
*/


exec GenerateSentence 'Nonsense', @Init='<liquidobject>'
/*
Sentence
-----------------------------
A couple of tomato milkshakes
*/

You can also see how the sentence is generated step-by-step by passing @Debug=1, and that will PRINT each step in the Text Messages window in SSMS:

exec GenerateSentence 'Nonsense', @Init='<liquidobject>', @Debug=1
/*
<liquidobject>
<quantity> <liquidcontainer>s of <liquid>
<digitmorethan0><digit> <liquidcontainer>s of <liquid>
6<digit> <liquidcontainer>s of <liquid>
62 <liquidcontainer>s of <liquid>
62 bowls of <liquid>
62 bowls of <fruit> tea
62 bowls of cantaloupe tea

Sentence
--------------------------
62 bowls of cantaloupe tea
*/
Finally, you can also specify how many sentences to generate by passing a @Quantity value (only a maximum of 100 sentences will be generated):

exec GenerateSentence 'Nonsense', @Init='<liquidobject>', @Quantity=10
/*
Sentence
---------------------------------
448 glasses of Clorox
An almond milkshake
16 buckets of NyQuil
Some blueberry-scented perfume
A few vats of Tabasco sauce
3 gallons of snot
A couple of bowls of apricot soda
Some vomit
846 tanks of antelope soup
7 vials of artichoke oil
*/
The code for the LoadGrammarFile and GenerateSentence procedures is below. Note that they make use of many of the string UDFs that I discussed in my last blog entry.

So have some fun and download it and play with it… Make it part of your daily routine… a “thought for the day” to get you going in the morning.

If you generate any wildly funny or especially profound sentences, I’d love to hear them.

/*----------------------------------------------------------------------------------*/

if object_id('LoadGrammarFile','P') is not null drop procedure LoadGrammarFile
go
create procedure LoadGrammarFile
@GrammarDescript
nvarchar(40)
,@FileName nvarchar(max)
as
begin

set nocount on

declare @SqlCommand nvarchar(max)
,@GrammarDef varchar(max)
,@ElementDef varchar(max)
,@ExpansionList varchar(max)
,@Element varchar(40)
,@Expansion varchar(max)
,@i int
,@j int
/*
Use Dynamic SQL to load the contents of the file into a string
*/
set @SqlCommand=
N'set @GrammarDef=(select *
from openrowset(bulk '''
+@FileName+N''', single_clob) x)'
exec sp_executesql
@SqlCommand
,N'@GrammarDef varchar(max) output'
,@GrammarDef output

/*
Take care of odd stuff seen in various Stanford grammar files
*/
set @GrammarDef=replace(@GrammarDef,'> s ','>s ') /* Plurals */
set @GrammarDef=replace(@GrammarDef,'> ''s ','>''s ') /* Possessives */
set @GrammarDef=replace(@GrammarDef,'> "''s" ','>''s ') /* Possessives */

/*
Clear out any previous contents for the
desired Grammar description from the table
*/
delete Grammar where Descript=@GrammarDescript

/*
Parse through the grammar definition until done
*/
set @i=0
while 1=1
begin
set @i=@i+1
/*
Find the next Element Definition between {} delimiters
and convert any CR or LF or TAB characters to spaces
*/
set @ElementDef=dbo.StrExtract(@GrammarDef,'{','}',@i,0)
set @ElementDef=ltrim(rtrim(dbo.ChrTran(@ElementDef
,char(13)+char(10)+char(9)
,' ')))

if @ElementDef='' break /* We're done! */

/*
Get the name of the Element within the Definition
and get the list of Expansions for that Element
(in other words, the rest of the Definition contents)
*/
set @Element=dbo.StrExtract(@ElementDef,'<','>',1,4)
set @ExpansionList=';'+ltrim(substring(@ElementDef
,len(@Element)+1
,len(@ElementDef)))
/*
Parse through the List of Expansions until done
*/
set @j=0
while 1=1
begin
set @j=@j+1
/*
Find the next Expansion between semicolon delimiters
and include those delimiters in the result.
Note: We do this because a blank Expansion between semicolons
is perfectly valid
*/
set @Expansion=dbo.strExtract(@ExpansionList,';',';',@j,4)
if @Expansion='' break /* We're done! */
/*
Now get rid of those semicolon delimiters
*/
set @Expansion=ltrim(rtrim(replace(@Expansion,';','')))
/*
There may be some Expansions where we actually do want a semicolon
and we represent that with a double-colon (::), so change
those to actual semicolons
*/
set @Expansion=replace(@Expansion,'::',';')
/*
Now we can finally insert the Expansion into our table
*/
insert Grammar values (@GrammarDescript,@Element,@Expansion)
end
end
end
go

/*----------------------------------------------------------------------------------*/

if object_id('GenerateSentence','P') is not null drop procedure GenerateSentence
go
create procedure GenerateSentence
@GrammarDescript
nvarchar(40)
,@Quantity int = 1
,@Init varchar(40) = '<start>'
,@Debug bit = 0
as
begin

set nocount on

declare @Counter int
,@Sentence varchar(max)
,@Element varchar(500)
,@Expansion varchar(500)
,@Fragment varchar(500)
,@FragText varchar(500)
,@FirstChar char(1)
,@i int
,@p1 int
,@p2 int
,@p3 int

declare @SentenceBucket table (Sentence varchar(max))

set @Counter=0
while @Counter<case when @Quantity>100 then 100 else @Quantity end
begin

set @Counter=@Counter+1

/*
Infinitely loop in creating sentences until a valid one comes along
*/
while 1=1
begin

/*
Initialize
*/
set @Sentence=@Init
if @Debug=1 print ltrim(@Sentence)

/*
Perform substitutions of all Elements until done
*/
while 1=1
begin
set @Element=dbo.StrExtract(@Sentence,'<','>',1,4)
if @Element='' break /* We're done! */

/*
Get random value for the Element
*/
select top 1 @Expansion=Expansion
from Grammar
where Descript=@GrammarDescript and Element=@Element
order by newid()

/*
And put it into the sentence
*/
set @Sentence=stuff(@Sentence
,charindex(@Element,@Sentence)
,len(@Element)
,@Expansion)

if @Debug=1 print ltrim(@Sentence)
end

/*
If the sentence is valid then we're done!
Note: Sentences are valid 99.9% of the time
*/
if charindex('***',@Sentence)=0 break

/*
Otherwise, loop around again
*/
if @Debug=1 print 'Invalid sentence... Restarting...'
end

/*
Temporarily surround punctuation with spaces
*/
set @Sentence=replace(@Sentence,'"',' " ')
set @Sentence=replace(@Sentence,',',' , ')
set @Sentence=replace(@Sentence,';',' ; ')
set @Sentence=replace(@Sentence,'.',' . ')
set @Sentence=replace(@Sentence,'-',' - ')

/*
Handle plurals, verb tenses, and adjectives/adverbs
Examples: [man|men] --> man
[man|men]s --> men
[sit|sat] --> sit
[sit|sat]ed --> sat
[happy|happily] --> happy
[happy|happily]ly --> happily
*/
while 1=1
begin
set @p1=charindex('[',@Sentence)
if @p1=0 break /* We're done! */
set @p2=charindex(']',@Sentence,@p1)
set @p3=charindex(' ',@Sentence+' ',@p2)
set @Fragment=substring(@Sentence,@p1,@p3-@p1)
set @FragText=case
when @p3-@p2=1
then dbo.StrExtract(@Fragment,'[','|',1,0)
else dbo.StrExtract(@Fragment,'|',']',1,0)
end
set @Sentence=stuff(@Sentence
,charindex(@Fragment,@Sentence)
,len(@Fragment)
,@FragText+' ')
if @Debug=1 print ltrim(@Sentence)
end

/*
Handle PROPERIZE(: :) directive
*/
set @Fragment=dbo.StrExtract(@Sentence,'PROPERIZE(:',':)',1,4)
if @Fragment<>''
begin
set @FragText=substring(@Fragment,12,len(@Fragment)-13)
set @Sentence=stuff(@Sentence
,charindex(@Fragment,@Sentence)
,len(@Fragment)
,dbo.Properize(@FragText,0))
if @Debug=1 print ltrim(@Sentence)
end

/*
Handle "a" and "an" (i.e. "a elephant" -> "an elephant")
Note: Add a space at the beginning of the sentence in case
there's a potential word "A" at the beginning
*/
set @Sentence=' '+@Sentence
set @i=0
while 1=1
begin
set @i=@i+1
set @p1=dbo.At(' a ',@Sentence,@i)
if @p1=0 break
set @FirstChar=left(ltrim(substring(@Sentence,@p1+2,len(@Sentence))),1)
if @FirstChar in ('a','e','i','o','u')
set @Sentence=stuff(@Sentence,@p1+1,2,'an ')
if @Debug=1 print ltrim(@Sentence)
end

/*
Get rid of surrounding space around double quotes
in preparation for next step
*/
set @Sentence=replace(@Sentence,' " ','"')

/*
Get rid of all leading/trailing spaces
And capitalize the first letter of the sentence
*/
set @Sentence=ltrim(rtrim(@Sentence))
set @Sentence=upper(left(@Sentence,1))+substring(@Sentence,2,len(@Sentence))

/*
Capitalize any word that comes after the first
of a pair of double quotes
*/
set @i=-1
while 1=1
begin
set @i=@i+2 /* Every other double quote */
set @p1=dbo.At('"',@Sentence,@i)
if @p1=0 break
set @FragText=ltrim(substring(@Sentence,@p1+1,len(@Sentence)))
set @Sentence=left(@Sentence,@p1)
+upper(left(@FragText,1))
+substring(@FragText,2,len(@FragText))
end

/*
Fix all the close quotes
Get rid of any double-spaces
And clean up any spaces before punctuation
*/
while charindex(', "',@Sentence)>0
set @Sentence=replace(@Sentence,', "',',"')
while charindex('. "',@Sentence)>0
set @Sentence=replace(@Sentence,'. "','."')
while charindex(' ',@Sentence)>0
set @Sentence=replace(@Sentence,' ',' ')
while charindex(' ,',@Sentence)>0
set @Sentence=replace(@Sentence,' ,',',')
while charindex(' ;',@Sentence)>0
set @Sentence=replace(@Sentence,' ;',';')
while charindex(' .',@Sentence)>0
set @Sentence=replace(@Sentence,' .','.')
while charindex(' - ',@Sentence)>0
set @Sentence=replace(@Sentence,' - ','-')
while charindex(' ".',@Sentence)>0
set @Sentence=replace(@Sentence,' ".','".')

/*
Just in case, again get rid of leading/trailing spaces
*/
set @Sentence=ltrim(rtrim(@Sentence))

/*
And save it in our temporary file
*/
insert @SentenceBucket values (@Sentence)

end

/*
Send back our sentence(s)
*/
select Sentence from @SentenceBucket

end
go

3 comments:

  1. That is really cool Brad, thanks for this. I think that you might even set up a website for web designers, who at the moment use Lorem Ipsum generator to fill their application prototypes. You have covered Bond and Star Trek websites, the sky is the limit :)

    ReplyDelete
  2. @Piotr: Thanks for your comments! (And your blog looks interesting... I've added it to my list).

    ReplyDelete
  3. Fantastic... great for replacing confidential information in databases when you need to develop offsite

    ReplyDelete