What is the optimal way to express and store facts in plain text that are standalone, modular, self-documented, interoperable, reusable, recombinable, unambiguous, easy to write, easy to read and easy to data mine?
A book.:-) With an index and bibliography.
- Al Pasternak
But what if you want to view all the facts in all the books (and other documents) in the world as a single database that can be integrated, data mined and inferenced with perfect precision and no ambiguity or confusion? Current books and book indexes are not machine-readable by current natural language understanding algorithms, and probably won't be for several decades.
- Sean McBride
How many facts can a typical human being hold in his or her head at once or process per second? What if a device could access all the facts in the world simultaneously and connect all the dots for any situation instantly?
- Sean McBride
I really must get to the Baroque Cycle soon.
- Sean McBride
I'd always go with natural language. Even if all facts (or statements claimed to be facts, like most "facts" in your mind are) could be stored "unambiguously", the facts themselves would contradict each other (there's no authoritative source for "the truth"). If a system can't handle natural language, it will be useless to answer any questions beyond the absolute basics.
- Meryn Stol
P.S. I don't think I answered your question. I don't think it can be answered. You could write a whole essay about your requirements alone.
- Meryn Stol
Meryn - regarding authority on facts: we trust facts asserted by highly reputable publishers more than those asserted by any random source, right? Devising authority systems for facts that are expressed with consistent semantic markup rules shouldn't be too difficult.
- Sean McBride
Regarding natural language: the problems in developing natural language understanding systems that work reliably are enormous. The issue will be solved eventually, but probably not for at least a few decades, in my opinion. Adding just a bit of simple semantic markup to texts -- and especially to reference and factual information -- gets over this hurdle in one swift leap.
- Sean McBride
Meryn - great link on Controlled Natural Language. You post many useful links and pointers on this subject, along with Deepak and Roger Chen.
- Sean McBride
Why do we want all the facts in the world to be machine-readable, interoperable, modular, self-documented, inferencable, etc.? So that we can apply automated methods of knowledge discovery on them to uncover hidden patterns, connections, networks, factors, influences, trends and causal relations.
- Sean McBride
Sean, I think that can be done without specially prepared data. There's lots to infer from plain English, certainly when combined with semantic analysis (e.g. recognizing person's names, places) and one's own - or aggregate - delicious tags.
- Meryn Stol
Meryn - do you have any pointers to working programs or demos?
- Sean McBride
This would be a good test of the abilities of a natural language processor: scan, say, William Shakespeare's Hamlet and graph all the social relations among all the characters in the play as a set of well-formed sentences or semantic assertions. I don't think this can be done with current technology.
- Sean McBride
Sean, I said "I think that can be done", that's not the same as saying that it's being done. :) I think you can do a lot if you combine the power of different tools/algorithms out there... I think it's not so much a matter of shortcomings in technology, but rather that no one is applying current technology to such specific problems. Of course this requires some "stitching" (dare I say "scripting"?).
- Meryn Stol
noticed I could use a feature: I like the comments/discussion.
- LeaNder
"William Shakespeare's Hamlet and graph all the social relations among all the characters in the play as a set of well-formed sentences or semantic assertions." I tried to do all kind of things with Shakespeare's plays. You want an "expanded" players list so to speak?
- LeaNder
The social network analysis of Hamlet. I would like an AI program that would generate a list of well-formed semantic triples describing all the relations of each of the characters in any literary work with all the other characters, for instance: x [father] y, x [father of] y, x [lover] y, x [murderer of] y, x [murdered by] y, etc. I doubt that kind of thing will be doable for another few decades. One would have to manually encode the triples for the time being. (One could obviously also apply the same method to describing the relations among real persons in any given historical situation.)
- Sean McBride
{.William Shakespeare^Hamlet [triples] Claudius [lover] Gertrude | Claudius [nephew] Hamlet | Claudius [wife] Getrude | Gertrude [first husband] Hamlet | Gertrude [husband] Claudius | Gertrude [husband] Hamlet | Gertrude [lover] Claudius | Gertrude [second husband] Claudius | Gertrude [son] Hamlet | Hamlet [father] Hamlet | Hamlet [lover] Ophelia | Hamlet [mother] Gertrude | Hamlet [personality trait] impulsive | Hamlet [personality trait] indecisive | Hamlet [personality trait] melancholy | Hamlet [personality trait] rash | Hamlet [student at] University of Wittenberg | Hamlet [uncle] Claudius | Laertes [father] Polonius | Laertes [sister] Ophelia | Ophelia [brother] Laertes | Ophelia [father] Polonius | Ophelia [lover] Hamlet | Polonius [daughter] Ophelia | Polonius [son] Laertes}
- Sean McBride
Dod: in my opinion, average people are never going to use RDFa to express facts about the world in an informal and spontaneous way in daily communications. We need a much simpler interface and set of conventions that might map to RDFa. For instance, Friendfeed already knows who I am if I am a Friendfeed user. If I want to say I like, say, Neal Stephenson, {.Neal Stephenson+} embedded right here in a comment should do. At the beginning of a Friendfeed comment or message, one could drop the braces.
- Sean McBride
.Neal Stephenson^Cryptonomicon+ // I just said I like Neal Stephenson's Cryptonomicon, and this is a comment after the statement following two slashes. Format: *creator^*work
- Sean McBride
Sean: At the risk of appearing ludditey, I must say that I am missing the point of this semantic thing. It seems to be a degradation of our language, even if it isn't meant to be.
- eggsy
nice list, Sean. But I think I was more obsessed with trying to visualize the "larger picture". Meta-subjects, movement of characters, yes, relations, but also their functions regarding the whole. How does he manage to keep the attention from the first minute. How does he heighten terror by relaxing people with a jokes to make the following terror even stronger. Basically contrasts, movements, patterns = dramaturgy. After a while you start to notice what a complex universe a play is.
- LeaNder
Eggsy - the point of the Semantic Web is to make it possible for computers to perform complex and powerful operations on the conceptual content of human works. The implications of the technology could be quite staggering for human civilization. I recall a period when most people didn't get HTML either -- before HTML and the Web revolutionized the world. All of these markup languages have their roots in Charles Goldfarb's vision of SGML (Standard Generalized Markup Language). Do you know any of the history?
- Sean McBride
Leander - of course; there are much more complex dimensions of a play than mapping out the social relations among characters. I chose the most simple and elementary level of approaching Hamlet as a means to get a handle on the subject. We have to crawl before we can walk and run. Social relations (especially family relations) are easy to pick off. Mental and emotional states are much more difficult to represent in a formal ontology.
- Sean McBride
Leander - I am guessing, by the way, that one might need 10,000 or more semantic assertions about Hamlet to represent the conceptual content of the play adequately.
- Sean McBride
Sean: No, I admit I am ignorant in this regard. But if I hear you right, the Semantic Web is in scope entirely outside the realm of spoken language. But I wonder if it is possible for that separation to endure.
- eggsy
Leander - what makes art work: whatever it is, it's bigger than the sum of the conceptual parts, I think. But it would be interesting to give computers the ability to reason about and draw inferences from the conceptual content of works of art, to make connections between any given work of art and the totality of the culture in which it is embedded.
- Sean McBride
The idea here, eggsy, is to add to natural language the minimum amount of formal syntax required to enable machine-reading and machine-inferencing of texts. Then some very powerful capabilities come into view for understanding what is going on in the world.
- Sean McBride
{.play; William Shakespeare; Hamlet [character] Gertrude [son] Hamlet} An assertion about the semantic content of Shakespeare's Hamlet that is standalone, modular, machine-readable, reusable, recombinable, etc. The format: {.play; *author; *title [*property] *value [*property] *value}.
- Sean McBride
"...bigger than the sum ..." Obviously. For me Shakespeare is quite a good example anyway. I simply have to admit that it would be highly fascinating to have the most diverse knowledge (concordances, variants. quartos, folio, prompt books, sources, theater historical details ..) in one meta text. Admittedly your lists started to make very much sense from the point they were in triples, they somehow stopped to be lists and started to be knowledge.
- LeaNder
Leander - you get the vision. Merge all the reference texts about a single topic (for instance, Shakespeare or a single Shakespeare play) into a single knowledgebase that is easily machine-inferencable in sophisticated and powerful ways. The bigger vision: merge all the information in the world into such a knowledgebase. Data mine everything all the time. Give the system the power to learn and rapidly evolve on its own.
- Sean McBride
@sean: we can use more standard conventions for shorthand writing of objects. If you add a freebase urn name space, you can write fb:neal_stephenson. Still, when I want a more complex item, I'd write RDFa that gets rendered to "Neal Stephens's Cryptonomicon" yet it's about fb:cryptonomicon and has the foaf:maker fb:Neal_Stephenson. The fact that's it's harder to write, all we need is a wysiwig editor letting me mark something and pick from menus what it's about. When html started, we only had text editors.
- ĎÚβĨŐÚŚ Dod
@eggsy: I agree that semantic languages are indeed "a degradation of our language", that's the nice thing about RDFa. The natural language part of it could be, e.g. "My admiration to Einstürzende Neubauten knows no bounds" or even a really poetic equivalent in impossible-to-translate Russian, and yet - there's an RDFa "undercurrent" saying "the paragraph's subject is interested in fb:einsturzende_neubauten" (which is an understatement, but is still true and machine-readable).
- ĎÚβĨŐÚŚ Dod
Dod - why can't we make this much easier for non-programmers and let a smart processor handle the hard work of generating verbose RDFa code? If I can say that I like Neal Stephenson's Cryptonomicon with {.Neal Stephenson^Cryptonomicon+} why require me to say more? There was a time when SGML purists thought HTML was too primititive and simple. And then a time when HTML purists thought Wiki Markup was too simple. I say, the simpler the better. Simplicity is the holy grail. More and more work for less and less effort. Junior high school students should be able to master semantic markup in under an hour.
- Sean McBride
Your proposed notation disregards existing standards and efforts. There are many choices and combinations (the ones I've adopted are a matter of my own opinions and intuition), but why invent your own? I'm not sure I understand your syntax and I don't have documentation to read or existing apps to play with. I also have problems with semantics, e.g. why mormon is a "class" and not a simple OPV fact, either foaf:belief or foaf:member (if Mormon means "The Church of Jesus Christ of Latter-day Saints" - a different freebase URL)
- ĎÚβĨŐÚŚ Dod
Dod - human progress is often about disregarding existing standards and efforts and rebuilding systems from the ground up, with perhaps some references to older systems. Folks in computer science especially tend to become attached to particular cultures and ways of doing things and resist the formation of new cultures and standards. I think there should be a much simpler and easier way for ordinary people and non-specialists to do semantic markup than RDF. I am not all impressed by RDF standards and conventions, and am convinced that they will never be adopted as an everyday markup system by most people. But you may well be correct that I haven't clearly articulated my ideas about a more simple standard. :)
- Sean McBride
Dod - in my suggested markup scheme, RDF triples are just one microformat among several dozen (and perhaps several hundreds down the line). The scheme is highly flexible and provides for many variant ways to express the same factual propositions. Regarding classes (which I call categories): probably 90% of the most useful human knowledge can be expressed as simple category/instance pairs. For instance, {.nation /i Sweden} states that Sweden is an instance of the category nation where {.*category /i *instance}. In NML one could also say {.category; nation; Sweden} where "category;" is a formal format and head tag {.category; *category; *instance}. The {.*category /i *instance} format, you will notice, is also an opv triple: {.*category=*object /i=*property *instance=*value}. But users don't need to keep these esoteric matters in mind.
- Sean McBride
Inside an NML editor you could dispense with the braces and opening period, which are used to mark the boundaries of NML propositions in any plain text. <nml>nation /i Sweden</nml> also works, but is more verbose.
- Sean McBride