Fun_People Archive
1 Dec
"Semantic Forests" and U. S. patent number 5,937,422

Content-Type: text/plain
Mime-Version: 1.0 (NeXT Mail 3.3 v118.2)
From: Peter Langston <psl>
Date: Wed,  1 Dec 99 15:11:01 -0800
To: Fun_People
Precedence: bulk
Subject: "Semantic Forests" and U. S. patent number 5,937,422

X-Lib-of-Cong-ISSN: 1098-7649  -=[ Fun_People ]=-
X-http://www.langston.com/psl-bin/Fun_People.cgi
From: owner-cwd@vorlon.mit.edu

	Jacking in from the "Sticks and Stones" Port:

	By Suelette Dreyfus
	Special Correspondent
	CyberWire Dispatch

"Semantic Forests" doesn't mean much to the average person. But if you say
it in concert with the words "automatic voice telephone interception" and
"U.S. National Security Agency" to a computational linguist, you might just
witness the physical manifestations of the word "fear."

Words are funny things, often so imprecise. Two people can have a telephone
conversation about sex, without ever mentioning the word. And when the
artist formerly known as Prince sang a song about "cream," he wasn't talking
about a dairy product.

All this linguistic imprecision has largely protected our voice
conversations from the prying ears of governments. Until now.

Or, more particularly, it protected us until 15 April, 1997 - the date the
NSA lodged a secret patent application at the US Patent Office. Of course,
the content of the NSA patent was not made public for two years, since the
Patent Office keeps patent applications secret until they are approved,
which in this case was August 10, 1999.

What is so worrying about patent number 5,937,422?  The NSA is believed to
be the largest and by far most well-funded spy agency in the world, a
Microsoft of Spookdom. This document provides the first hard evidence that
the NSA appears to be well on its way to creating eavesdropping software
capable of listening to millions of international telephone calls a day.
Automatically.

Patents are sometimes simply ambit claims, legal handcuffs on what often
amounts to little more than theory. Not in this case. This is real. The U.S.
Department of Defense has developed the NSA's patent ideas into a real
software program, called "Semantic Forests," which it has been lab testing
for at least two years.

Two important reports to the European Parliament, in 1998 and 1999, and
Nicky Hager's 1996 book "Secret Power" reveal that the NSA intercepts
international faxes and emails. At the time, this revelation upset a great
number of people, no doubt including the European companies which lost
competitive tenders to American corporations not long after the NSA found
its post-Cold War "new economy" calling: economic espionage.

Voice telephone calls, however, well, that is another story. Not even the
world's most technically advanced spy agency has the ability to do massive
telephone interception and automatically massage the content looking for
particular words, and presumably topics. Or so said a comprehensive recent
report to the European Parliament.

In April 1999, a report commissioned by the Parliament's Office of
Scientific and Technological Options Assessment (STOA), concluded that
"effective voice 'wordspotting' systems do not exist" and "are not in use".

The tricky bit there is "do not exist". Maybe these systems haven't been
deployed en masse, but it is  looking increasingly like they do actually
exist, probably in some form which may be closer to the more powerful topic
spotting.

Do The Math
============

There are two new pieces of evidence to support this, and added together,
they raise some fairly explosive questions about exactly what the NSA is
doing with the millions of international phone calls it intercepts every
day in its electronic eavesdropping web commonly known as Echelon.

First. The NSA's shiny new patent describes a method of "automatically
generating a topic description for text and sorting text by topic." Sound
like a sophisticated web search engine? That's because it is.

This is a search engine designed to trawl through "machine transcribed
speech," in the words of the patent application. Think computers
automatically typing up words falling from human lips. Now think of a
powerful search engine trawling through those words.

Now sweat...

Maybe the spy agency only wants to transcribe the BBC Radio World News, but
I don't think so. The patent contains a few more linguistic clues about the
NSA's intent -  little golden Easter eggs buried in the legal  long grass.
The "Background to the Invention" section of every patent application is
the place where the intellectual property lawyers desperately try to waive
away everyone else's right to claim anything even remotely touching on the
patent.

In this section, the NSA attorneys observed there has been "growing
Interest" in automatically identifying topics in "unconstrained speech."

Only a lawyer could make talking sound so painful. "Unconstrained speech"
means human conversation. Maybe it's been "unconstrained" by the likelihood
of being automatically transcribed for real time topic searching.

Here's the part where the imprecision of words - particularly spoken words
- comes in. Machine transcribed conversations are raw, and very hard to
analyze automatically with software. Many experts thought the NSA couldn't
go driftnet fishing in the content of everyone's international phone calls
because the technology to transcribe and analyze those calls was too young.

However, if the NSA didn't have the technology to do automatic transcription
of speech, why would it have patented a sifting method  which, by its very
own words, is aimed at transcripts of human speech?

As Australian cryptographer Julian Assange, who  discovered the DoD and
patent papers while investigating NSA capabilities observed: "Why make tires
if you don't have a car? Maybe we haven't seen the car yet, but we can infer
that it exists by all the tires and roads."

One of the top American cryptographers, Bruce Schneier, also believes the
NSA already has machine transcription capability. "One of the Holy Grails
of the NSA is the ability to automatically search through voice traffic,"
Schneier said.  "They would have expended considerable effort on this
capability, and this research indicates at least some of it has been
fruitful."

Second, two Department of Defense academic papers show the U.S. developed
a real  software program, called "Semantic Forests," to implement the
patented method.

Published as part of the Text REtrieval Conference (TREC) in 1997 and 1998,
the Semantic Forest papers show the program has one main purpose:
"performing retrieval on the output of automatic speech-to-text (speech
recognition) systems."  In other words, the U.S. built this software
*specifically* to sift through computer-transcribed human speech.

If that doesn't send a chill down your spine, read on.

The DoD's second prime purpose for Semantic Forests was to "explore rapid
Prototyping" of this information retrieval system. That statement was
written in 1997.

There's also an unambiguous link between Semantic Forests and the NSA
patent, it's human and its name is Patrick Schone.

Schone appears on the NSA patent documents, as an inventor, and the Semantic
Forests papers, as an author and he  works at Ft. Meade, NSA's headquarters.

Specifically, he works in the DoD's "Speech Research Branch" which just
happens to be located at, you guessed it, Ft. Meade.


Very Clever Fish
================

The NSA and the DoD refused to comment on the patent or Semantic Forests
respectively. Not surprising really but no matter, since the Semantic Forest
papers speak for themselves. The papers reveal a software program which,
while somewhat raw a year ago, was advancing quickly in its ability to fish
relevant data out of various document pools, including those based on
speech.

For example, in one set of tests, the scientists increased the average
precision rate for finding relevant documents per query from 19% to 27% in
just one year, from 1997 to 1998. Tests in 1998 on another set of documents,
in the "Spoken Document Retrieval" pool were turning up similar stats around
20-23 per cent. The team also discovered that a little hand-fiddling in the
software reaped large rewards.

According to the 1998 TREC paper: "When we supplemented the topic lists for
all the queries (by hand) to contain additional words from the relevant
documents, our average precision at the number of relevant documents went
from 28% to 50%."

The truth is that Schone and his colleagues have created a truly clever
invention. They have done some impressive research. What a shame all this
creativity and laborious testing is going to be used for such dark,
Orwellian purposes.

Let's work on the mental image of that dark landscape.  The NSA  sucks down
phone calls, emails - all sorts of communications to its satellite bases.
Its computers sift through the data looking for information which might
interest the U.S. or, if the Americans happen to be feeling generous that
day, their allies.

Now, whenever NSA agents want to find out about you, they pull up a slew of
details about you on their database. And not just the run-of-the-mill
gumshoe detective stuff like your social security number, address, but the
telephone number of every person you call regularly, and everything you have
said when making those calls to 1-900-Lick-Me from your hotel room on those
stop overs in Cleveland.

And here's the real scary stuff:

The NSA likely already has a file on many of us. It's not a traditional
manilla file with your name typed neatly on the front. It's the ability to
reference you, or anyone who matches your patterns of behavior and contacts,
in the NSA's databases. Now, or in the near future, this file may not just
include who you are, but what you *say*.

British Member of the European Parliament Glyn Ford is one of the few
politicians around who is truly concerned with the individual's right to
privacy. A driving force behind the European Parliament's STOA panel's two
year investigation into electronic communications, Ford is worried that the
NSA  possesses technologies that are "potentially very dangerous" to privacy
and yet have no controls over their activities.

The Australian aboriginal activist and lawyer Noel Pearson once said that
that the British gave three great things to the world: tea, cricket and
common law. If unchecked, the NSA and its sister spy agencies in the UK/USA
agreement may use this technology to lead an assault on the most important
of those gifts and the common law tenet "innocent until proven guilty" may
be the first casualty.

How ironic: one Blair wrote '1984' as fiction, and another is helping to
make it fact.

= = = = = = = = = = = = = = = =

An Australian-American writer, Suelette Dreyfus was educated in the UK
and US, studied at Oxford University and Columbia University in New York,
where she won the prestigious Teichmann Prize for excellence and originality
in writing.
Fun_People Archive 1 Dec"Semantic Forests" and U. S. patent number 5,937,422

Fun_People Archive
1 Dec
"Semantic Forests" and U. S. patent number 5,937,422