![]() |
| Boston Public Library |
There are many in the "court of public opinion" who really do not have an informed opinion on electronic texts. That is fine; it is a niche. There are many scholars in universities, professionals working with texts, explicating texts and publishing books on interpretation of our textual heritage who also do not have a grounded understanding of electronic texts. There is a generation gap, perhaps a dawdling before, no, during the paradigm shift, perhaps a dawning awareness that a really important train has left the station and one was not on it. There was too little intellectual capital available to acquire a ticket. These people need to be informed; they will also bring apologists for their desperate dawdling into the discussion.
Essays have been written to suggest that wide access to electronic text through indexing will be ruinous to scholarship and to scholars. I refer to Nunbery's "GBS: A disaster for Scholars," "Chronicle," August 2009. I remember being shocked by the grotesque headline which will have influenced public opinion for all those who did not read the article or read but did not understand what they were reading. I can only answer: "When a scholar has the book on the laptop, there is no need for metadata." The goal is to have the book - metadata is for catalogs and pedants parading in the "Chronicle."
The "Disaster for Scholars" proposition can seem plausible only to those who have not really examined the vast power and success of electronic methodologies. It seems plausible to some because it happens to be a "corporation" doing the indexing; and not universities or some even less competent agency. I suspect Nunberg has not made the shift from the OPAC to the INDEX. Sever issues have to be covered, so let us start slowly. Card catalogs and Union Catalogs were the great work of the late 19th and the first 80 years of the 20th c. We could rename the time the age of bibliography if we did not already have more telling names for the time. Doubtless in its execution, the 700+ volumes of the pre-1956 imprints is a spectacular achievement. When it was finished in 1981, it had already been replaced by OPACS and WorldCat. We can say without exaggeration that WorldCat represents 150 years of work by uncounted thousands of professionals.
Look up a name and you will find what that person has written and in which library the book / article / video etc. can be found - very meticulous metadata on our written heritage. For all its splendor, WoldCat gives only references - it does not presume to open books for you. Even when there is a link to a public e-text, that attitude of WorldCat is here it is, you start reading.
Catalogs of metadata only give the illusion of order. The list of books, since they are "mere lists." cannot assign any relative importance of one book over another. We can sort alphabetically, we can sort by date. We can sort by any field in the "mark record."
That capability has evolved since the card boxes of the 19th c., in Europe, from the acquisition list of the 15 c.
I feel certain that the Google people do not want to play in the catalog game. When you are working on statistical measures of th significance of text snippet items into the range of billion pages of text, cataloging 11 million items seems small potatoes.
OPACs and WorldCat privilege a world view where the measure of all things are alphanumeric ordered lists.
Google can crawl WorldCat, I should think. But Google really wants to crawl the texts not the citations. If you want citations, do this: google WorldCat and search for your citation. Or google Hollis. If you want books opened to your search terms, use Google Books.
Structural problems emerge. Google cannot take 150 years to scan all texts. From 2004 to 2011 12 million texts were scanned. In the "12 milllion index" any three search terms will "triangulate" text snippets. I not only get the book and the title page, which should tell me everything I need to know, to guage the standing of the book in the tradition, I also get the book opened to a "relevant" page with subsequent bookmarks marked on the scroll bar. So what just happened here? Normally I would have gotten a PhD in Near East History, I would have written a dissertation on Ottoman History, I would have consulted bibliographies and would have picked a series of books to consult researching some point. I would take one of the books, consulted the index, and I might have read at the book for a day or two taking notes. ventually I might have picked one or two paragraphs for quotations in some descriptive analysis.
Google presents me with the book that I may have picked based on my experience in the field, based simply on my query. It is possible that my query was a random action that did not result in a random response.
Let us say that I am a sudent of French History, duly degreed and habilitated, and I have pumped up against the Sultan and politics in the Levant periodically. Lets say I am working on the wars of Louis XIV and Marlborough and I need some quick information of Eugene of Savoy aand his commutes between the south-eastern front and the western front. I know enough about the western front to realize that real deep engagement there would take some time. You say, take the time or be damned as a charlatan. Point taken. But what about the Khans of Crimea? What about Ottoman expeditions into Persia? What about Max of Bavaria? Clearly historians have to specialize and forays into adjacent fields are dangerous. Yet they are also necessary. The question is not what is tolerated by the silverbacks of whatever gender who rule the areas of specialization; the question is can a index of 12 million texts provide entry into fields for non-experts?
I cannot become expert in all traditions that tagentially touch my area of expertise. Even in areas where I know what I am doing there are surprises and serendipity and improvisation. Yet I can, however, try queries and get some bits of relevant literature. Uncharitable souls have called this "sideways entry." [Naumberg] One has popped into the sacred chamber of some discipline without presenting credentials at the door and undergoing a 2 month to 2 year initiation. Is that bad? After all, I have a sacred chamber of my own where I am licensed take tomes from the shelf. What if Google slips me some tomes under the table?
There are other problems since the triangulations of snippets is not a precise as one might wish. Above I have sketched out a hypothetical based on an ideal universe in which perfect e-texts of printed books, some quite old and tightly bound and badly discolored, are created easily. This is not the case. In the Google Books project, time was important.
Workers needed to be trained to scan, d-base people needed to be set on task to get some handle on the vast amounts of data. Mistakes were made on all levels. Individual pages were not scanned or scanned badly. Bad things were done on the meta-data level and bad things were done by the OCR algorithms which often moved out of the 99.909 percentle accuracy.
From the Google perspective all this is regrettable and to be fixed, but not all that important. Although I am comfortable with OPACs, I tend to agree; in order to attain critical mass where the index can begin to pay off, scanning has to be done on a large scale. Were academics running the project since 2004, we might have a test database of a couple of hundred thousand books, and the metadata might still be a mess. Remember Google indexes websites of the world and is used to mucking about in some dingy alleys. For all the messy information spread all over the world, Google finds the hits.
The saving grace of Google Books is that it will give you a book open to a page of potential interest with more bookmarks a click away. Having the book on the screen you can go to HOLLIS and read the Mark record, if that float your boat. Or you can try to find a review in JSTOR, or you can just read the thing, download it and print it if it is in the public domain - or go to ILL and have them get a copy by mail.
Numberg is upset because of shoddy dating. Get over it - what matters is not a list of references (you know where to get that) - what matters is having the book on the screen. Some familiarity with standard OCR errors is desirable in the early phase: "start plus 7." You will find the same problems with JSTOR and they are at "start plus 30." It is quite possible to regale humanist with JSTOR errors like the 10,000 instances of "6t6" and reference to Heinrich B6ll.
There is some confusion with category labeling, granted, yey Google is making an index based on "content." If you must have lists of books based on subject categories affixed by catalogers, OPACs can give you that. But if you want to look at the book, go to the index.
The GBS metadata is a silly quibble cribbed from similar articles that have been on blogs since 2006 and news only in that the "Chronicle" has given it a catchy headline.
The fact is that universities have always lagged behind in technology. It was IBM that presented electronic methodologies for texts to universities in the 60's, 70's and 80's and it was tough slogging as early adopters were marginalized; that is certainly true in the humanities, even today. In the sciences, DEC and HP did fantastic creative work to present universities with computers and electronic devices. The contemporary mantra of the "necessary greed of corporations" is used to denigrate the contribution of the private sector in research. Without the private sector universities would still have disputations on the efficacy of dunking heretics in water.
Without pretending to go into a history of everything, I would just remind those of short memory or those who have not had a chance to accumulate memories, that it was the Department of Defense, the DOD, that laid the groundwork for the web; it was the people who gave us the Bomb, Korea, Viet Nam, Granada, Panama, "plausible deniability" as a concept [...] et al. who also gave us e-mail and ftp. The military-academic-industrial complex and its modern manifestations have contributed significantly to all manner of intellectual work and not all of that work is to be tarred with the same brush. Some of this work is overtly beneficial. And more important without the private sector, research would atrophy dramatically. Google supports on campus research - results are showing up in the search window in links to maps, time lines and vocabulary plotting by centuries. I personally find it interesting that in-house research outpaces academic work which huffs and puffs to catch up.
There is a ubiquitous skepticism today as the waves of public opinion wash over us every day, and polarization and spin have obscured all recognition of the "public good." Any honest work with texts today has to factor in electronic representations. Google's work in the last 7 years has been so monumental that it has taken some by surprise along the lines: "My goodness I am completely behind the 8-ball." The arbiters of opinion, to mask a lack of actual experience with the topic, feel they have to balance pros and cons. Take a sober look. Do not rush things. Yet, a library of 12 million books, delivered all over the world to anyone who can bring up a search window does not have pros and cons. Taking a measured pace by settling on 3 line snippets is not deliberate and sober; it is silly in the sense of operating without a clue.
The whole notion of "snippets" comes from a time long ago, before our exponentially growing mountain of texts became somewhat manageable through electronic representation. People who do not understand this should get out of the way and not insinuate their lack of understanding into the discussion.
At least, larger snippets would let the library of 12 million books become more useful in the short run and have its functionality not be hostage completely to legal wrangling. Perish the thought that the perspective of researchers adept at using electronic methodologies should insinuate itself into legal wrangling. Nor should innovative thinking throw the good ship "Chronicle of Higher Education," purveyor of opinion extraordinary off its even keel. That would be amateurish, pointless and self-indulgent, legally speaking, and suspect public policy, and also bad high-brow journalism. We have all the time in the world because our lives are good thanks to billable hours and a captive audience.
Judge Chin could have told us what we found out in March 2011 in March 2010, and the deal could have been renegotiated by now and the library open for business. Given that we live in a world where time is money, one should consider that an extra year of retainers and billable hours for the lawyers while the judge deliberated could probably paid off al the authors in their weight in gold and scanned all books in Portuguese and Danish. After all, this is not about Google, as Judge Chin and the opponents seem to want us to think - it is about the books, getting books onto our screens. Prof. Grimmelmann has hit the buckler of truth not far off the hub when he says quite unlegalistically: "...it's scandalous to think that copyright owners might have a right to hold up the creation of good indices. They should be beating down Google's doors demanding to be indexed, and the sooner the better [interview with William McGeveran]." Fortunately the legal wrangling has not held up the indexing due to the courageous actions of Google and the libraries - no need to scoff - Google and libraries around the world have built an index of 12 million books - not with timidity.
Courage to Act While Others Doze.
It is indeed a courageous act on the part of Google and the libraries to look past an antiquated notion of copying and to take a chance on its vision of an index of all words ever. Has anyone asked: "Why did the libraries, five at first, now in the hundreds, agree to partner with Google?" Why would they engage in clearly illegal activities - in the minds of some? Libraries understand books. Libraries know their values, historically as well as commercially. no one can fool a librarian into granting that some history of something written in the 40's that sold 300 copies is precious intellectual property in the financial sense. Granted it must be cited and looked at and that as efficiently as possible. Let those with valuable intellectual property come forward and receive their rewards as the market dictates. Let the rest be easily accessible in a text base open to all.
"Copying" as the court and jealous rights holders see the concept revolves around unlicensed production processes competing with legitimate production creating "things" that can be sold. In that universe, only "things" in high demand become victim of "copying." For example, back when the Louis' were ruling France it made economic sense to typeset the works of Voltaire multiple times and to sell the pirated copies to meet the demand. In that world it was important to separate out the real printings from unauthorized printings. The reasons were not just economic but also concerned the integrity of the text - no need to go into detail on this. Laws were developed in the 18th c. to make sure there was a legal basis for intellectual property. This did not mean, of course, that piracy did not thrive.
Today we think of piracy less with books, more with music CD's, film DVD's, designer labels, software and trademark iron-on graphics for T-shirts. There is too little money in books; perhaps that will dawn on the print-on-demand bottom fishers and the assemble-print-sell industry living off the self-less work of wiki authors.
The original thoughts of the founding fathers to limit copyright so ideas in written form could eventually enter the public domain really make no sense in a world where it is important to protect markets in the billions from anyone who can draw a caricature of a duck. To have books dragged in the wake of cartoon characters speaks volumes about the station of books in our society.
Those who fight for commercial rights have been trained in areas of knowledge not to be found in dusty stacks. The knowledge of those wing-tipped paladins are "all-electronic" so specific sentences can be mobilized in a nanosecond to smite an opponent. All-electronic meets all-rhetorical to chart the future.
The unreflected and unquestioned analogy of Mickey and Donald to copyright of books should be an embarrassment to the legal profession for not separating different areas. The fact that it still is not and was not when recorded music and cartoon characters were lobbied through congress was that, at the time, books were irrelevant. Procedural excuses for a massive cock-up by Congress egged on by the courts are of no interest. There were only a few of us actually scanning books in the 90's, and the emphasis of scanning was concentrated on putting the artistic canon, poetry, drama, epic as well as discursive text from the Greeks to the 19th c. on line. No legal issues with pre-1923 publications.
One could argue that authors opened the door to scanning when they did not challenge the entries of their books in the card catalog. As soon as a book is listed in a catalog, or several catalogs and bibliographies, there is a tacit agreement that the content of the book will be copied in many forms:
1. the photographic memory - a small percentage of humans have the capacity to remember and reproduce every word they ever read.
2. There are many people who have a well developed memory capacity, who will be able to remember in detail specific formulations.
3. The excerpeters have copied large swatches of text - through history - before chemical, mechanical or electronic copying. There has been a lively trade in copying books by hand.
Again, generally, it did not make much sense to copy something one could buy, but for important out-of-print books, a copy is quite valuable.
The world of libraries and the world of ranges of books and shelves upon shelves is a profoundly non-commercial world. It is also a world that is undergoing profound change. That change has been ongoing for the last 40 years driven be electronic technologies.
When I was an undergraduate in the 1960's (BA German, 1968, UNC) There was a set of large green folios - 750 volumes, 50 yards end-to-end - the catalog of pre-1956 books, 11 million entries. At the time it was the resource of last resort. In the hierarchy of bibliographies in various fields, the "Union Catalog" provided a sort of umbrella or backstop. It was a list of every book ever published. Yet the act of printing a catalog brings well known problems. At some point yearly updates need to be consolidated. And most important, the world does not stop. The catalog to list all books before 1956 was designed to replace previous Library of Congress catalogs. It started printing five volumes a month in 1968 and was finished in 1981. The OCLC electronic catalog, ironically, was started in 1971 and was fully functional in 1981.
The 60's, 70's were the last decades when bibliographic research was still a major area of technical studies for "all" humanists. The consolidation of library catalogs into OCLC led to one stop shopping. Once the name of the author has been determined, a simple electronic query, 2 seconds of typing, would locate the books. The citation can be copied electronically reducing one of the chief sources of error in humanistic research - bad hand copies from printed catalogs.
It is instructive to bring in the printed BIP, "Books in Print" catalog into the discussion. Every bookstore had at least one copy of the mare than a foot thick volume. It was published each year with monthly updates. At bookstores, short lines would form as customers waited to consult the catalog. Then would come the search through the update pamphlets. Time moved more slowly back then. Books would take years to publish and a month here or there did not matter much. It was acceptable to wait till the book showed up on the shelf labeled "recent acquisitions" next to the circulation desk. I am afraid that the laws which govern publishing are still stuck securely in the previous generation a short 30 years ago.
In the 70's, early 80's, the printed catalogs became a relic of a time gone by. Now one would simply log on and execute queries. Software would add citation to bibliographic software on the local PC and print nicely formatted lists on the local printer; and then, off to the stacks. The same became true for new books as bookstores hooked into the network. The Guide to Periodical Literature befell a similar fate. Thus, phase I of digitization was lists of books and articles - alphabetically at first - to be found in the stacks.
Index on a Diet of Snippets
The index is hamstrung since there can be no significant feed-back of mouse-clicks from snippets, not to mention the waste of everyone's time in even clicking on the snippet view.
There are no real competing interests that may experience unfairness as I view the landscape; the library of 12 million books is standing in the cloud, indexed, ready to go. Millions of dollars stand ready to pay authors what amounts to a pittance to be sure for copyrights that are actually worth much less than a pittance. Should some out-of-print diamond-in-the-rough, assert itself in the market of electronic library books, more than a pittance in fees could accrue. However, the books would have been worth less than a pittance minus the cost of cataloging, storing, lending, re-shelving and re-binding for ever, had it not been for the investment of Google and the world's great libraries to start scanning the books. Libraries have been maintaining the books on their shelves and lending to researchers at not inconsiderable costs as a favor to authors with little thanks forthcoming. Where would the rights of the authors be had generations of researchers since 1923 not read and quoted books; some authors have at least garnered some little fame if not fortune. They should have become copyright lawyers and not written books.
The arrangement of Google and the libraries, which could not have been achieved with out the motivated and intelligent cooperation of the libraries, does not represent a power grab by Google, but a sensible arrangement with a corporation of not yet five years old at inception and libraries who need technological options to deal with serious space problems and general problems of logistics. Libraries generally understand that they have no future without a genuine embrace of electronic technologies. The digitization of out-of-print and out-of-copyright is not just a no-brainer, it is the only option of libraries in face of the continuing flood of in-print printed matter pouring in the loading dock; this is a fact; every librarian in charge of a significant institution will attest.
Judge Chin seems to have missed this point somewhere between his deference to Rule 23 and 480 legal briefs spinning in every direction save the one that makes sense for the logistics of library books. Not at the time, nor now, is there anyone even remotely able to match Google's investment and to apply Google's energy to drive it through or to accomplish the sophisticated indexing work and interface building work. Seven years later, there are any number of Johnny-come-latelies who would like to propose to start money raising schemes from foundations or the government. Wise up - the work has been going on for seven years and is continuing into the next round.
The only potential competing "interest" are the unrealistic dreams of a Harvard librarian, and expert on the book, ok, the chief expert on the book, and the chief librarian of Harvard, if you will, who has maneuvered himself into a rhetorical box where he now stands, paradoxically, in the way of the greatest library ever seen. As a historian, that must give him great joy. Prof. Darnton is pitting his unrealistic dream against many thousands of hours of work already done all over the world and ongoing as we speak. More illegitimate competing interests come from parties who either abandoned scanning or never had any intention to do anything in this area except to make sure any effort fails.
Actually it does surprise me that a judge, educated at Princeton, surrounded by his electronic sources of the law, can delay the greatest electronic text project for the rest of us without feeling the need to hurry along a bit, or at least to apologize for dawdling. His work may go down as the greatest instance of procrastination in history since "A la recherche de temps perdu." Indications are he has pulled a rather emaciated rabbit out of the hat at 3 seconds till the clock struck 12.
I fully expect the legal logic used by Judge Chin to arrive at the denial to whiz past me at the speed of light, not pausing to illuminate; although, thanks to Prof. Grimmelmann, I get the sense of having grasped the essentials, which really only fueled my outrage at the procedures. I admire and appreciate Prof. Grimmelmann's calm and sober analysis, spiked with pride that some of the arguments of his students, arising out of his seminars, have been cited in the ruling. I cannot realy say for certain if Prof. Grimmelmann's obvious disappointment with the ruling is due to the outcome or if it is limited to the obvious lack of depth in treating the relevant points of law in a thin 48 half-pages. His calm professionalism is especially reassuring when compared to the slightly hysterical semi-legalistic spin of S. Vaidhyanathan and others. [See: Peter Batke, Google Book Search and its Critics, Lulu.com, 2010, 305 pp. and http:humancomp.orgxxx]
Yet, it seems that quiet confidence and thorough legal research is no match for expressions of "concern" when it comes from "A few of the author objectors, who would like to see Google razed to the ground and Mountain View sowed with salt, ... [Grimmelmann, Laboratorium, March 2011]." The same goes for the international objectors who have hired Philadelphia lawyers to boilerplate them a pot-puree of copyright legalese, and who are motivated by very narrow national [read: unashamed nationalistic] interests. They may well be forced by Google's innovations to consolidate their differences into a generally anti-Google, anti-American agenda. That would represent some progress for Europe. It would have been illuminating for Judge Chin to notice which nations were present and which nations were absent.
The notion that it is no longer about Europe, the Europeana, the BNF, the US or Japan, but about a comprehensive, all inclusive index of all languages of all nations has eluded the European legal eagles squatting on their non-revenue producing rights.
I do live in confusion when faced by legal logic, but confusion is normal for that arena when there is another logic, real-world logic, which is clear as a bell. Prof. Grimmelmann sounds that bell periodically, especially when he emphasized the importance of indexing for knowledge retrieval. Judge Chin's legal blinders would not allow him to peek into the world to appreciate the problems of delivering books to readers or the opportunities offered by indexing technologies beyond some academic citations from early in the litigation and some smattering of more recent law review work. This is, after all, a legal proceeding.
One source of my confusion is the importance Judge Chin seems to attach to the 500 expressions of concern by "interested parties" - the majority, quoth the judge, came down against the settlement - that have crossed his desk. I should think that these expressions of "concern" are neither a complete, an authoritative or a statistically significant sample of the situation on the ground. There has been quite a bit of spin in the press against the settlement. There has been a coordinated mobilization in Universities against the settlement - strange and counterintuitive since they - as text researchers would be the chief beneficiaries of a comprehensive index of 12 million books. There has been a mobilization of authors and publishers against the settlement pointing to a schism within the group: the main group of authors and publisher having litigated and settled and re-settled over years of expensive and demanding legal work since 2006 and vociferous dissenters who have submitted five page briefs or sent impassioned letters in January 2010.
It is not surprising that a class as diverse as "authors" should have all sorts of extreme cases and emphatic opponents to whatever is proposed by the professional association. One can expect authors to be verbal. These are not Wall-Mart clerks, diverse as that class may be.
Of course there are also the king's-ransom-an-hour copyright professionals who are always ready to repurpose arguments form their last successful campaign. Where is the logic in following the "concern" of opponents who have so much outrage and so little skin in the game.
In addition, many of the objections quoted by Judge Chin do not make legal arguments, but are quite rhetorically ordinary whines about their trodden on rights, the inconvenience of opting out or the fear of rights to be trodden on in the future.
Does Judge Chin or the USDC expect those of us who own at least one or two ISBN numbers and like the thought of indexing 12 million books to float a letter across his desk? Should every chief librarian save Robert Darnton have sent such a letter? Is it too late to write tonight? I suspect many would have written - certainly more than 500 had we known that would be a deciding factor. Is he running some sort of all-star-team fan voting as when the people of Baltimore were mobilized for years to make sure Cal Ripken got to play short-stop?
Now I feel foolish for having trusted that legal principles such as balancing an "overwhelming universal benefit" with an "active protection of authors from material harm" would triumph against arguments holding fast to antiquated notions of protecting the illusion of value of artifacts that have lost their value long ago. The out-of-print books were affixed with complicated numbers and put into the stacks of libraries where only the initiated experts could find them, because they were only of "academic" interest, of interest to students and researchers who could not be expected to pay for assimilating and citing sources. There is no consideration of "benefit to all" at all as a legal principle with heft - only a few lines in a peripheral piece of background. Let us consider the absurd notion that the library of 12 million books would actually be done away with to be replaced by a new effort by a cooperation of Darnton and Vaidhyanathan which would start scanning in 2012 with 10 years to come up to 12 million, and good luck with that. How many authors could have received benefit in the years till 2022 and did not nor will ever because they died.
There is only one outfit in town that can produce 12 million indexed books. The ladies and gentlemen of the US Judiciary should get their act together and find some way to let the doors be opened. The actual parties in the settlement, with plenty of input from dissenters and with plenty of changes to adress concerns, have invested millions in good faith efforts to make this happen and are denied by three half pages on the impossibility of releasing claims of no actual value because of Rule 23(b). Ladies and gentlemen - the law can be bent - the law can be changed - now is the time to start. The citizens of Paris who stormed the Bastille to free the prisoners did not petition the Director General to release the prisoners on a per case basis. That event is considered the beginning of democracy in France. Allons enfants!
My position will be hopeful: that the consequences of the ruling will not be as dire as the opponents might want or as dire as the supporters and the queued up users might fear. I do fear that clever legal tactics and cagey judges toeing the legal minimum line and the built-in lethargy of the proceedings will delay for another decade.
Overarching this essay will be outrage at the poverty of legal procedure and reasoning in the face of technological innovation and in face of a technological project that will be the next wonder of the world. It seems that in the legal mind of the sitting judge the "concern" about Google outweighs the quite compelling interest of people all over the world to become users of the greatest library the planet has ever known. Let us try to see where that "concern" comes from, what the legal weight of "concern" can be and what the "concern" is costing us.
