Computational / Poetry Thesis Blog Post / Stanza 三: Haiku
A wise man once told me that if you train an n-gram model for generation with too much data, it will hurt. Bad.
We’re talking Kurzweilian singularity ➡ grey goo ➡ ??? ➡ profit! kind of hurt.
That’s the way I’ve felt over the last month or so, thinking about my thesis; there were so many directions I could go in, so many theoretically intriguing and clever avenues to venture.
In order to head off my progressive channeling of Don Giovanni, I got back to fundamentals, and rediscovered what it really meant to be a haiku.
俳句 – haiku
“Ignorance of other cultures is the currency of ours.”
As much a law of the universe as entropy, any genuine artifact of a culture will inevitably become completely bastardized. Much the same fate has befallen the noble haiku, now distilled to a metrical exercise of 5-7-5. (Then again, who am I to judge, like I of all people could bemoan the loss of 17th-century Japanese culture). What appealed to me from traditions lost were two missing pieces— kigo and caesura —that come together to further constrain the project, making the problem at hand more well-defined.
季語 – kigo (season word)
梅がかに / ノット比の出る / 山路仮名
scent of plum blossoms on the misty mountain path a big rising sun
Traditional Japanese haiku have this remarkable ability to evoke a sense of place, as in this classic example from Bashō. In a post-modern context, though, such naturalistic references might best be supplanted by a pop-culture reference. Here’s my take:
Yeah, that definitely speaks to me more than some plum trees and rocks in the fog. From the start, I imagined that this program would provide a web interface, where people could generate poems about a keyword. Since I suspect most people would like to see poems about people and places, getting relevant text about named entities is actually a major concern.the earth is crying our only hope is Al Gore it’s gettin hott in hrrrr
休止 – caesura
春雨や / 小磯の戸外 / ヌルルほど
This example by Buson illustrates the caesura, or break, of traditional haiku. Whether its a dash, period, colon (semi- or otherwise): old-school haiku have a grammatical and indexical turn in them. A good haiku uses the two resultant shards to play off of each other through stylistic and symbolic contrasts, embracing a sort of proto-Hegelian dialectic. Pragmatically, this is pretty good news. Not only does the break provides a pseudo-grammatical structure to rest upon, but it allows for different semantic relationships to be explored. For instance, given a theme word or kigo, the former half could contain synonyms and metonyms, whereas the latter might be all antonyms.spring rain — small shells on a small beach glittering
ボックスの話 – Natural Language Processing
After a solid week of hacking on the project, a working prototype of the haiku module started to emerge. Most importantly, I’m happy to report, it kinda works: the system generates valid haiku form just fine, and is cognizant of semantic relationships to the point of only needing parameter tuning. Allow me to present the most reasonable and least-embarrassing results from initial testing:sunny cheerful day - gloomy nimbus on Wall Street today as stocks
Stranger still, after generating this, my program decided to move to New York City to make it big, where it received mixed reviews on his interludes to the oppression of the working-man’s clock cycles.
Introducing the Cast
Keats is composed of several different NLP modules, each optimized for different tasks, which are combined using some pretty slick Ruby glue code. Here are the components used at the moment:
CMU Pronouncing Dictionary
As I mentioned before, the CMU Pronouncing Dictionary was the technical inspiration for my thesis project in the first place. It offers both phonological and prosodic information on tens of thousands of words. I indexed the flat-file into a MySQL database, which cached calculated information like number of syllables and vowel geometry, along with a delightful IPA string representation.
WordNet
Perhaps one of the best-known NLP resources out there, Princeton’s WordNet” is an semantic index of staggering breadth and depth. Not only does it have hundreds-of-thousands of entries, but for each entry, there is an extensive relational graph for everything from synonymy to verb frame. In order to integrate it into my project, I wrote a script to transform the Prolog database into a MySQL relational database, and cross reference it with the existing phonological information from the CMU Dictionary.
PCFG
After entertaining some ideas about language model training on a corpus of haiku, I decided the most reasonable way to approach it, at least for now, was with my trusted friend, the Probabilistic Context-Free Grammar. To get a feel for what it could do, I hand-wrote a series of transformation rules like S ➡ Adj N V or S ➡ Adj Adj N V N. From there, I would partition syllables for the line and find a candidate word from the WordNet SynSets.
n-Gram Modeling
Building off a small sample of a New York Times text corpus, I started to work with trigram and larger n-gram language models to produce meaningful (or at least reasonable) sequences between target keywords. This is what generated “on Wall Street / today as stocks” in the example poem. There was a rule in the PCFG that replaced M with this Markov model output.
All Together Now
Given the kigo, or theme word, I look up all of the semantic pairs associated with it in WordNet, whose correspondent has an entry in the CMU Pronunciation dictionary. After ranking the candidates (now randomly, but later by phonological features), it will partition syllables and generate a PCFG for a sequence before and a sequence after the split. Using the candidate rankings, it will fill in the slots as necessary, until the grammar is satisfied and the poem is complete.
At this point, I’m feeling pretty good about my progress so far. After some fine-tuning and figuring out better ways to rank candidate words, I should have a pretty robust engine that will produce not only passable haiku, but with slight modification, Fibonacci poems as well. As for the Limerick part of the project, I’m hoping that something will click in the next month, so I’ll have some idea of where to start with that.
Comments
No comments yet.
Sorry, the comment form is closed at this time.