In my previous work, I investigated whether adding Rhetorical devices to a computer generated piece of text could make the text appear as if it were written by a human. Today, I took a much more typical approach, namely, a graph based approach to natural language generation.
A graph can be built up from a piece of text if distinct words are considered to be the nodes of the graph, and two nodes n1 and n2 are adjacent in the graph if word n2 appears directly after word n1. Since the same word may appear after another word more than a single time in a piece of text, this becomes a weighted graph, with the weights equaling the number of times n2 appears after n1.
One can generate new text by selecting a node to start with (perhaps randomly) and then traversing the graph. New words can be generated either by selecting the adjacent node with maximum weight or by choosing an adjacent node at random. In experimenting I’ve found that choosing the new word at random 1% of the time produces output that is both interestingly varied yet distinctly human.
To ensure that each node has adjacent nodes, one needs a corpus of text large enough. Making use of the excellent NLTK corpora, I built the word graph from the works of Jane Austen, the authors of the King James Bible, William Blake, Sara Cone Bryant, Thornton W. Burgess, Lewis Carroll, G.K. Chesterton, Maria Edgeworth, Herman Melville, John Milton, William Shakespeare, Walt Whitman. For fun I also included the texts of all previous US president’s inaugural addresses.
I then limited the output text to 140 characters to see what it would be like if all those authors decided to work together to tweet.
Here are some example tweets:
mathematician Gentlemen he was a very much as the same time to be the earth and all that he said unto the world and a good and in a few minutes
cruised on the other side of my dear I am sure I have been the first and said the Lord GOD Behold I was the sea and that the whole of it was not
Unscrew the man of our God and of this day of thy God hath not be in his hand of him and with the day and it is the most of that I shall be so
The next question is from where one could obtain a modern corpus so as to enable replication of the speech of today. Twitter, with its amazingly privacy-free API could offer this. I may never have to speak again. I’ll just generate a new sentence every time it’s needed. 🙂 Since conversations also exist on Twitter, it may be possible to use Twitter to understand and then generate specific text given a context. Who knows?
Anyway, heres some graph building code:
def build_graph(self, text): last_word = "" for w in self.tokenizer.tokenize(text): if w not in self.graph: self.graph[w] = Node(w) if last_word != "": self.graph[last_word].addEdgeToNode(str(self.graph[w].node_id)) last_word = w