Like most people who decide to make a tool, I have a problem. I lose things. And I worry about losing things.
For years my solution to this was piles. This pile is taxes. That one is a task (renew my passport). The ones on my desk, in front of me, are meant to be paid attention to now (though they seldom are).
I’ve tried to apply the same solution to my (seemingly continuous) work and play on computers and it’s kind of worked, but I can already see the stress on the system. I don’t have a good “this is in front of me” pile in the first place, and the connections between the piles are more than I can deal with spatially, and as the documents age their relations become less clear to me. Worse, particularly at work, massive amounts of the data I have are just inherited–I don’t really know what they are, they were organized by someone(s) else in ways that I don’t understand, often with lots of duplication, often with seemingly-related things nowhere near each other and with unclear names. Crazily, at work I doubled-down: after my most recent promotion I have a lot of responsibility over a lab, and I got curious as to what was inside. There are 3.5″ floppy disks, ZIP disks, Jazz drives, old laptops, old desktops, HUGE amounts of printed information, and so much of it we–I–don’t know what it is in the first place, or if it’s useful.
At home, the scanned problem is there too, and is where I first started poking around with how to handle all this. I have some vague ideas of how you can do things like keyword extraction on text (finding, for example, words/concepts that are more common in some document than in your general corpus) but immediately saw a problem: OCR, particularly on scanned documents, just isn’t as good as I expect it to be. Without supervision, I don’t particularly trust it, and beyond a simple spelling and grammar check on the OCR result I didn’t see myself having time to curate everything.
There I chanced on the idea: what if I ran a bunch of OCR on the same source? What if I could not only automatically run many OCR engines on it now, but later understand that I’d grabbed an OCR update and run that too? And automatically re-analyze the content–not just with spell- and grammar-check, but with whatever text analyzers I can come up with/learn about/find? Could I diff the results from those? Can I meta-analyze this thing’s pseudo-understanding of some document as it changes over time?
Could I do the same for images? Video? Code? Build logs (at work, some products I’ve worked on output something like 12MB of build log. It’s insane)?
The architecture of the thing seemed clear:
- A database, started with a file index. First passes give you metadata–file types (I still wonder if some of the more-exotic ones are easy to handle or not; JPGs and ZIPs can be concatenated, for example, because of their header rules), creation and modification times, size, maybe an overall hash, and location/URL may well be important too so keep that.
- Then tools to run on these–string extraction, OCR, color histograms. Grabbing “understandable” abstractions from the raw data.
- Then tools to run on the results those produce–spelling check, keyword extraction, named entity recognition, heuristics, innumerable kinds of distance/difference metrics.
- All automatically run on new files or as new tools are added. Maybe even prioritizing where a file is less-well-analyzed and a new tool may apply to it.
- And then visualization–a field I love but little-understand. You’ve generated a lot of data about your data, now what do you do with it?
For years it wasn’t clear to me if it should be an application or a something like a virtual appliance. I also was certain that things like this had to exist–Google search must do some of it, the Internet Archive must do some of it, the NSA almost certainly does something like it on some unimaginable scale. Computer forensics software must do some of it. I just didn’t know what it was called. So I thought, and occasionally Googled, and thought some more. My notes show it pretty clearly:
- in February 2015 I make a ton of Natural Language Processing notes, with headings on Cluster Labeling, Document Clustering, Document Classification, Terminology Extraction, Concept Mining, N-grams, Automatic Content Extraction, Computer-assisted Reviewing, and Controlled Natural Language.
- In October 2015 I apparently hear about Semantic, and connect this with some wiki work that I’d been playing around at the time, and start hoping that a Semantic Wiki or other type of Semantic Content Management System can became a Knowledge Management System.I actually distinctly remember at the time that I was interested in the applications for Systems Engineering, and having your content store do things like automatically verify that Part A and Part B both share interface C, except scale. I got super excited when I learned that Semantic Wikis were in use at JSC for the EVA Wiki.
I also started to see that this is big business–I took a note about HP IDOL, with the comment “this is kind of interesting too, though it’s easy to make an awesome-sounding brochure and hard to make awesome-working software.”
- Around the same time I started looking for “Software Archaeology,” inspired by the idea of the “programmer-archaeologist” in Vernor Vinge’s novels A Fire Upon the Deep and A Deepness in the Sky. I don’t get anywhere with it, though, because Google thinks I want software tools written for archaeologists to use, not tools to study and discover the history of some code base. I do make a note about MarkDeep though.
- Much later–probably December 2016, but Google Docs doesn’t have functionality equivalent to Git Blame so I can’t easily be sure (without doing some API stuff I don’t want to do right now)–I check out what makes the Internet Archive tick, and find its many GitHub repos. This satisfies me, but I don’t make many conceptual notes other than that Tesseract appears to be the current standard for free, open-source OCR.
- After that, the phrase “OCR for sound” pops into my head while I’m showering, and I stumble across Kaldi, speaker diarisation (wiki link) and a Python thing called Bob that can recognize voices, though I think that’s mostly for authentication? Not really sure.
- Not long after that, I think up the Forensics link, stumble through to the other side of that, and find out about Tika, Solr, XLTSearch, http://docfetcher.sourceforge.net/en/index.html, and https://github.com/mirkosertic/FXDesktopSearch, which is where I start wondering if “Desktop Search” is what I’ve been after all this time (though my nearby note is “…am I really thinking about a CMS? Hmm. maybe not.”
These give me a general impression: a lot of this sort of thing is being done in Java. A lot is being done as full-up VM appliances or other holistic solutions instead of as an application, though there are applications (and configurable ones!) out there.
After installing Python on my desktop computer (which I really use almost entirely as a media and gaming PC, so this is a departure, but it’s nice to have the big 27″ monitor covered with windows) I start to fiddle with a design of my own then immediately decide not to. So I reach back out to Google and start looking for customizable things already available (though I have some above/previously) already and find Open Semantic Search.
That website pushes a lot of my keyword buttons, so I bit. I’m going to see how it handles some of the data I have at home, and if I can make some of the customizations I’ve been thinking of. More to come on that.