Wednesday, February 20, 2013

The Aaron Swartz Notes


Zentralfriedhof Vienna
The other day I was at the Starbucks close to North Hall, and I was sharing a table with some budding lawyer types; we were all working on our laptops. The conversation of the law students meandered to Aaron Swartz. I tried not to listen, but I did get some smatterings. It seems that one of the law students had a friend who had a sister whose roommate used to hang with the crowd at the Safra Center. She dated a friend of Aaron's who would talk about Aaron's work from time to time. Somehow snippets passed up the chain to one of my table mates. In general, from what I could glean from the conversation, there was no great sympathy for web pirates, Wiki-leaks, music ripping or otherwise. I pricked up my ears when one said: "Barry said that Lisa heard that Aaron was trying to fix JSTOR." It seems the theory was that he had designed a system for delivering meta-data and content that would be far more effective than what JSTOR is doing now. He had tested his code and wanted to try a full run with all the data. So Lisa had heard from Gwendolyn, if that was her real name.

The death of Aaron Swartz was a great shock  I had not heard of Aaron Swartz until his most public death. I personally have not followed the whole "pirate" movement in computing. As I am approaching 70 and feeling comfortable in my environment, I am willing to let young people do the heavy lifting to get the world to conform to their vision. I did that myself in the 60's and 70's; now I am concerned with the areas where I have specific expertise. I no longer expect the world to change to suit my view. I can only contribute what I know and hope it will not be covered with dust completely.

This does not mean that I will not engage in polemic. It has been my discovery that explaining the world as I understand it is merely an elaborate self-deception. I go with Peirce a hundred years ago, loosely paraphrased, there is no perspective from which to understand the whole, the unfathomable aggregate.  The certainty with which I make judgments is not based on understanding of even a significant part of the aggregate, but merely based on the training I received in obsolete methodologies and the bits and pieces of new insight I was able to add along the way.

When I learned of the circumstances of Aaron Swartz's death, I could only shake my head at the damage control of the various agencies involved. The DOJ maintained the hard line: a crime is a crime, the law is the law, and so it is. The institution JSTOR, the plaintiff, the victim of the crime, immediately pulled in the claws and expressed genuine concern.

Yet, the fact remains, a young life had ended. The cause is obscure. The nature of the crime is obscure. Had Aaron hacked into BankBoston and withdrawn a couple of thousand dollars, the nature of the crime would be plain, off to court and off to jail. Had Aaron hacked into Sony and downloaded 100 gig of music, the crime would also be clear although there would be many who would challenge the legitimacy of the recent law. Yet, why would someone download 1,500 academic journals? Perhaps they could be posted on a file sharing program as the prosecutors maintain. But that does not seem realistic. Those sites can be shut down and would have a short life at best. In addition, the JSTOR users are mostly working under institutional license agreements and are not likely to get their articles from p2p sites. MIT is not going to cancel its contract with JSTOR, although inquiring minds would like to know just how much MIT spends on JSTOR.

So the question remains why would Aaron commit a crime that a prosecutor can take up to 20 counts for this and that, totaling 30 years?

So I did the unpardonable, unpardonable in my world, I intruded into the conversation. "So what did Aaron want to fix on JSTOR?" I asked. It seems that Lisa had taken some notes and shown them to Barry who knew one of the budding lawyers. The main fix was a "big-data" approach to JSTOR. The first problem was that all the worthwhile data is in graphics; pdf files. The text under the graphics is not up to big-data standards, loose as they are. Some of the high-profile English language journals are acceptable, but the non-English is bad beyond belief. The friend Barry had said he had read from Lisa's notes: "Google '6t6 JSTOR' and you will see." The lawyer said he didn't really get what Barry meant, JSTOR was not on his radar except as an example of bad plea bargaining. Incidentally, would wish more lawyers used more JSTOR.

So I intruded further. I can show you what Barry meant. I turned my computer around and Googled "6t6 jstor." Up came 30,000 pages of JSTOR references to something called "6t6." Brows wrinkled. Here is "Biichner" only 3,000. Try B6ll and you get 6000 reference to the German author Heinrich Böll. I smiled as seraphically as possible through my thick white beard. Its kind of a joke, well, I said, a private joke between me, myself and I. To get a foreign term in your JSTOR query, you must know how an OCR program (Google this!) that is set to set to English only, distorts French and German diacritics. "6t6" in French refers to ete with two e-acute accents and can mean the noun summer as well as the past participle of to be. You know the French.

I can see I am overplaying my hand with the lawyers, so I ask: "What was the big-data angle Aaron had in mind? Well, they were not sure, but Barry had seen a diagram of a double umbrella and lots of squiggles. One set of lines was going from a footnote to a text in some other article and one line was going from a text paragraph to all the articles that quoted that text. He said that Barry said he heard that Aaron was planning to link in text bases of primary materials. So you could ask for all the articles in JSTOR that quoted the first line of Aristotle's Metaphysics that thing about all humans wanting knowledge. Or any line anywhere in Aristotle. So you could see what of Aristotle was actually quoted through time. Or you could find all the Hegelians who quoted Marx in the first 3 pages of their article. So basically, he thought it was reprehensible that JSTOR was walling off the data and delivering only graphics of 10 to 20 pages or sometimes a bit more, in addition to forcing users to do weird queries to get even just that, and not to mention, according to Barry, gouging people without an institutional login.

I just said: "Wow!" and started rubbing my eyes.

The law students unplugged and ran off to class, and I had grist for the mill. Did the last five minutes really happen?

I had found out about the diacritics problem a couple of years ago. I had put out some queries on the web. I was sure that the problem was on my side. All the people who responded to my queries either agreed the problem was on my side or did not understand the question. Surely, the problem could not be on the JSTOR side. So I systematically queried JSTOR for mutant diacritics and could get consistent hits on all sorts of computers. So really, no problem, on my level of use, no dysfunction. I can work around the errors and not lose any hits.

But there is a problem. It goes to the root of the stewardship of electronic data in our time. Data is walled off under the pretense of managing delivery and insuring financial sustainability. The people behind the wall have vested interest in the status quo and in the cash flow; their time of innovation, in the case of JSTOR, dates to the early 90's.

In its effort to deflect blame from itself in the tragic death of Aaron, JSTOR clearly laid the blame on the justice system or anywhere else for that matter:

"We have had inquiries about JSTOR’s view of this sad event given the charges against Aaron and the trial scheduled for April. The case is one that we ourselves had regretted being drawn into from the outset, since JSTOR’s mission is to foster widespread access to the world’s body of scholarly knowledge. At the same time, as one of the largest archives of scholarly literature in the world, we must be careful stewards of the information entrusted to us by the owners and creators of that content. To that end, Aaron returned the data he had in his possession and JSTOR settled any civil claims we might have had against him in June 2011." [JSTOR About]

The statement is balanced and conciliatory, and I hope I am not in violation for posting the four sentences. Should that be the case, I will concentrate on only 19 words that disturb me. What incenses me is the blatant myopia: JSTOR thinks it is doing a good job. Selling itself to ITHIKA is no doubt a good job, well done. Creating one of the most obscure interfaces for non-institutional scholars is no doubt also a good job. To quote from the text above: "we must be careful stewards of the information entrusted to us by the owners and creators of that content."

I realize that at my age an even disposition is important for survival. The "owners and creators" of some 150 years of scholarship have entrusted JSTOR to be careful stewards. Yet Ira Fuchs did not know enough about OCR in 1993 to make sure his staff used software that would recognize a French acute accent. Nobody has noticed for the last 20 years. And the fix would be so easy. This point is not about ideologies or about level of free access, or pre-1923 materials sold for cash, it is about the competence to claim the title steward. Careful steward, maybe, vigilant steward, perhaps, competent steward, not really, steward with vision, not in the least.

It was the "careful stewardship" that caused the alarm bells to ring at JSTOR, belatedly, but ring they did. The "careful steward" was required to detect breaches of security, even by a legitimate user, even of the public domain materials they had hamstered. The withdrawal of their suit and the withdrawal of the Massachusetts DA is to the credit of JSTOR, although from a PR perspective it was a no-brainer.

To claim "careful stewardship" of 30,000 6t6's in their database is hardly justified. It is laughable. I will grant the institution the title of careful stewardship of its licenses and its cash flow. Improving the product is not important for cash cows in the non-profit sector. Yet there is hope that this sad event will bring about needed change. When I was at Princeton, there were a few of us working for small salaries but with open agenda to innovate. I fear the "non-profit" front ends for academic computing of the elite universities may not have disruptive innovation in their playbook. This will become obvious even to the MBA's ensconced in the management of universities or to their masters. They do have the full force of the law riding shotgun all the way to the bank. Oh how I would have loved to have seen Aaron's actual plans for the JSTOR data. Alas his secrets are safe with Barry and Lisa's notes long lost in some wastepaper basket in Cambridge.

I could imagine he would have re-OCR'd all the pages, using the latest software that can even deal with German Gothic script. I would have loved to have seen his query screens and his data flow. But I would really like to have seen his links from footnotes to text and the typology of linked nodes that could have emerged from that. I guess I could design an interface that traces text quotes through time, from the Greeks to the 200-year knowledge explosion from the early 19th century to the early 21st century.

Every generation gets some of the old stuff and some of the new stuff. My generation, or my cohort, model year 1947 got computers. I was aware of computers my first year of college 1964, I did not start learning to program until 1975, while I was working on my dissertation. Some said I was just avoiding the writing. Frankly, I did not care; I was excited. I was one of the first at my school to submit a dissertation printed on a computer, a room full of computer. I was the first to take over our seminar table to burst the continuous forms for the six copies of my 300 page dissertation coming off the high speed line printer.

But enough deep background and lingering in the past. How did I get from stacks of form feed to reading about Aaron Swartz more than 30 years later? My career, as it was, bridged the gap between old, venerable humanistic scholarship which revolved around dusty library stacks and reams of handwritten notes on one side and shiny new networked computers and reams of computer printed notes on the other. It started with my MA thesis done on an IBM mag-card typewriter (feel free to Google unfamiliar concepts) and found me in an office at Princeton in the 1990's working on French and Hebrew manuscripts preparing network delivery for parchment and papyrus written up to a thousand years ago.

Things were moving fast; it was hard to keep up with all the developments in hardware and software, as well as understanding the new conceptual landscape of databases, query languages, and programming options to name just the biggies that preoccupied me and took all the brain power I could divert from my middle-aged hormonally driven body. My fall from understanding, on the hands-on level, pretty much what was going on in computing and lecturing on a wide range of topics, to knowing only about a small stake bounded by Access and sql on one side and perl and PHP on the other was quick, abrupt and precipitous. My work mutated from polymorphism that adapted to whatever technological situation presented, to refashioning print based scholarship and primary sources for electronic devices. My training in pre-computer humanities and my apprenticeship in computer technology starting in the 1970 prepared me for this new role. There was no lack of work and there were more and more of us doing this kind of work.

The national research library, of which Firestone at Princeton is a representative example, is a magnificent intellectual edifice, flaws and all, built by generations since Carnegie and others poured millions into library structures nationwide. As we saw our task at the end of the 20th century: every printed page in Firestone would eventually have to be in electronic form; time stamp: 1994, almost 20 years ago.

JSTOR was one of the developments launched while I was starting work at Princeton. I was not involved, having become specialist in a small but deep specialization; would wish they had asked me about OCR. But I was cheering loudly from the sidelines. I did have a problem with the conception. Before coming to Princeton I had been involved, peripherally, with the effort to digitize the journals of the JHU Press. My interest had always been, ever since I had learned programming, in indexing and searching electronic texts. That is where I saw the chance for humanity and for humanists to get an edge on the tyranny of texts, over our inadequate short-term memory and the unreliable long-term memory.

I was fairly horrified that the JHU people saw the computer screen as a convenient place for typesetting by other means. Of course they were merely following a large and influential group of humanist techies who were interested in typesetting for the screen and only in typesetting for the screen (whatever else they may have professed). This track is still active today and subsumes everything under markup in its various forms. [NOTE: It reigns supreme to the point that The Folger Library is publishing XML version of its print edition. The applause is great for reasons not entirely obvious to me.] Of course, my contention, accompanied with extreme hand waving, was that scholars wanted to load secondary materials into their personal indexing engines. That was greeted with complete and baffled non-understanding.

Alas, JSTOR, had the same approach. They would deliver graphic images of articles, e.g. the individual pages from journals; their search technology, however, would be based on optical character recognition, i.e. the electronic text of the article being displayed graphically. In the 90's of the previous century, I had no real problem with that. I was not a heavy user of JSTOR and I was as glad as anyone to get around the photocopier. I assumed work would continue as work continued on my projects.

Twenty years later we are in the age of big data. MIT is one of the hot spots in big data. For example, at a lab at MIT, they have wired a small apartment with video and sound, every room, every hall way, every bathroom, in order to track the language development of a baby. Google: Deb Roy. The child is five now. That is 1825 days of up to 90,000 hours of video. The goal was to capture the genesis of individual words produced by the child and to track the influence of the adults on the child in a normal family setting. The analysis and the tools created to analyze that data are way beyond spectacular. The example of the path toward the child saying "water" for the first time made me a complete believer. This is the Skinner box of our day; everybody is in the box and the cameras are running.

Ira Fuchs, the chief scientist of JSTOR in the 1990's was not really conceiving big data back then. He was concerned with library logistics: can we get the journal articles to the office of a researcher electronically? The answer is yes; the tool is PDF. It is a thoroughly 20th Century answer. The chief goal was to become self-sustaining with some six-figure salaries for managers along the way.

In 2005, Google started to scan library books and delivering the books and the electronic texts as well (to some degree, at least). If you read the preceding blogs you will see the state of that project. The problems is, to reprise, copyright legislation of the last 20 years was not designed to deal with academic scholarship. The District Attorney is not allowed by law to make subtle distinctions. Yet no one is willing to step up to carve out exemptions for 100 year of articles on Shakespeare written by authors long dead. Perhaps Google will create the big data model for humanities through the ages and lubricate it through Congress and box it through the courts.

Yet, JSTOR should be the laboratory. Re-ocr the pages, start mapping the links based on the generations of citation practices, map in the public domain texts already available and let us start tracing the paths of ideas from the past into the present. Alas, the squatters privileges of institutions like JSTOR make it more likely we shall know more about the language acquisition of a young boy than about the transmission of the ideas of our artists, philosophers and scholars through the ages. That latter task will be done in graphic images in bundles not greater than thirty until further notice.