Away Mission: SemTech 2010 - Summary
Almost 1200 people attended this year's Semantic Technology Conference (SemTech 2010) in San Franciso. That was a big jump from last year and probably due in equal parts to the move to San Francisco and also the growing use of SemTech all over the web. Remember, this is a recession year.
Where do we find RDF and SemTech? The US government and maybe European governments. Semantic annotations are being added to many public databases. Also, the latest HTML5 draft has RDFa support in it. And who uses Semantic Technology now on the Web? The BBC, Newsweek, the NY Times and even companies like Best Buy have everyday applications running on SemTech. In short, Semantic Technology is everywhere. (Visit the earlier Away Mission that describes the basics of the Semantic Web here.)
The event bills itself as the world's largest conference on semantic technologies. It is also focused on the commercial aspects of developing semantic technologies and incorporating these into social, business and government processes.
One new aspect of the conference, which showed the pervasiveness of Semantic Technology, was the large number of vertical industry tracks. There were tracks for Open Government, Enterprise Data Management, Health-care and Life Sciences, SOA Governance, Publishing, Advertising and Marketing, and Semantic Web Technologies.
The other interesting trend was the focus on automated metadata extraction and the automation of ontology and taxonomy building. This was reflected on the expo floor and in the number of presentations discussing this new level of automation.
While there were few vendors - many of last year's vendors had been snapped up as divisions of larger companies - the offerings were richer and more mature. This was echoed in the conference by the frequent discussions of monetization and business models. I'll address that in a few paragraphs.
SemTech 2010 ran Monday through Friday, with intro sessions Monday evening and tutorials all day Tuesday. The Friday tracks changed a bit from primarily vendor-sponsored user sessions of previous years. For one example, there was a hands-on tutorial on Drupal 7 and RDFa open to any conference attendee. The slides from that session are on SlideShare at http://www.slideshare.net/scorlosquet/how-to-build-linked-data-sites-with-drupal-7-and-rdfa
Drupal is a widely-used open source platform for publishing and managing content. Version 7 is almost released and it will use RDFa for storing site and content metadata. Drupal 7 will have RDFa enabled by default for metadata on all nodes and comments. The key point here is the increasing commonality of RDFa functionality in new websites and site building tools. That emerging ubiquity will build the Semantic Web of meaning that Tim Berners-Lee spoke about almost a decade ago.
Just in the last few months, a new working group was formed for RDFa at the W3C with the goal of defining additional structure in HTML files for additional metadata. The group is working on linking RDFa into HTML5 and XHTML5 as well as the Open Document format used by OpenOffice. As an international example, the UK's e-government effort plans to use RDFa throughout its web pages. This allows the data to be read within a context and used with that context preserved.
The Open Graph used within the FaceBook web site is a simplified form of RDFa and the adoption of Open Graph by other social web sites allows RDF-based sharing of semantic information. For more info in Open Graph, check out this presentation on SlideShare, which is similar to one made at SemTech 2010:
I should mention here that several of the technical presentations from SemTech 2010 and 2009 are on SlideShare and can be searched for with the string "semtech" to find them.
Purchasing SemTech Companies
A few weeks before this year's SemTech, Apple purchased SIRI, one of the bright stars from SemTech 2009. SIRI's intelligent semantic agent for desktops and mobile devices, which mixed location-based services, mashups, and voice recognition with an inference engine, was a hit with the SemTech crowd and I had been looking forward to a forward looking progress report.
From the Semantic Universe review of the SIRI sale: "SIRI was in the vanguard of leveraging the AI-based apps and applications that relate to the semantic web developed by SRI" and "one of the core and most innovative technologies coming out of SRI - which led the 'CALO: Cognitive Assistant That Learns and Organizes' DARPA project. Maybe what's most important about SRI and CALO's Active Ontologies innovation... was to create passive inferencing that... lets one reach conclusions and drive to actions based on those conclusions." More Semantic Universe reviews of the SIRI sale can be found here: http://www.semanticweb.com/on/apple_buys_siri_once_again_the_back_story_is_about_semantic_web_159896.asp
Catch the SIRI video embedded there - or go directly to http://www.youtube.com/watch?v=MpjpVAB06O4.
Of course, just after this year's SemTech event, Google bought Metaweb, the force behind the important FreeBase global semantic data store. On the company blog, Google wrote: "Working together, we want to improve searching and make the web richer and more meaningful for everyone."
They also wrote: "Google and Metaweb plan to maintain Freebase as a free and open database for the world. Better yet, we plan to contribute to and further develop Freebase and would be delighted if other web companies use and contribute to the data. We believe that by improving Freebase, it will be a tremendous resource to make the web richer for everyone. And to the extent the web becomes a better place, this is good for webmasters and good for users."
One immediate benefit of the acquisition, Metaweb has increased the frequency of Freebase down-loadable database dumps from quarterly to weekly.
You can visit the YouTube intro to MetaWeb: http://www.youtube.com/watch?v=TJfrNo3Z-DU&feature=youtu.be
And, for more detail, view the "Introducing Freebase Acre 1.0" video: http://www.youtube.com/watch?v=dF-yMfRCkJc
SemTech Code Camp
The FreeBase presentation was one of the best at the free Semantic Code Camp held one evening during SemTech 2010. Delivered by Jamie Taylor, the Minister of Information at Metaweb Technologies, it describe the project that has over 12 million topics which can be queried by SPARQL and other semantic protocols. It gets the information from Wikipedia, the government, SFMOMA, MusicBranz, and other sites providing semantic data. It acts as a Rosetta Stone between identifiers in different systems such as Netflix, Hulu, Fandango, etc. with 99% accuracy. All this runs on a Creative Commons license.
You can find that FreeBase presentation here: http://www.slideshare.net/jamietaylor/sem-tech-2010-code-camp
Other Code Camp presenters included Andraz Tori from Zemanta, which can read in most text from many domains and give back standard tags and categories, Wen Ruan of Textwise, which uses faceted search to find and use subsets of a data collection, and Tom Tague of OpenCalais, a project of Thomson Reuters, which can categorize and tag the people, places, companies, facts, and events in content.
The Code Camp was a beginner's track on how to use Semantic Technology and an open Unconference track. A nice aspect was that anyone registering through the Silicon Valley Semantic Technology Meetup could attend for free and also get an expo pass to SemTech and share in the preceding reception and free margaritas.
Google Rich Snippets
Google announced "Rich Snippets" just over a year ago in May 2009. Today, if webmasters add structured data markup to their web pages, Google can use that data to show better search result descriptions. This is not fully implemented yet, but extra snippets are being captured and are showing up in Google search results. One example of this is the RDFa used in the display of Google video selection results or receipes that display calories and prep time on ther first line.
Kavi Goel and Pravir Gupta of Google described the simple markup vocabulary used by Google for enhanced indexing. Currently, this can be in a micro format or RDFa. They explained that they will add to the vocabulary as more domains are added to the Rich Snippets ontology. They will also be expanding the vocabulary to include over 40 different languages, and will use the FaceBook Social Graph to link friends and locations.
They have just released a testing tool for webmasters and content developers. One reason for this: most web sites were not doing the snippet markup correctly. The tool will show which elements are visible and what properties are listed. If nothing shows at all, then the site is not using snippets correctly.
Among the more common mistakes are the use of hidden text and markup in-line, as Google search ignores this. Alternatively, the person doing the markup needs to be very clear in terminology. They cited ambiguities like differences in ratings versus reviews or votes versus counts. They expect these to be cleared up with practice. They also mentioned that global use of snippets was growing at 4x the US rate from 10/09 to 06/10.
Web 3.0 Web Sites
The session on Building Web 3.0 Web Sites discussed best practices in implementing SEO, RDFa and semantic content for Web 3.0 sites.
The presenters - John Walker, Mark Birbeck, and Mark Windholtz - discussed how to transform current content into semantically-rich web pages. They emphasized the need to improve and maintain data quality, both for the main content and for the tags and metadata that will enhance access and combination. In addition they recommended the following:
- use emerging standards like HTML 5, RDFa (and 'Google Rich Snippets'), WURFL, etc.
- establish robust user personas
- try to improve metadata throughout the site
- use existing ontologies or establish one or more as needed
- use web namespaces and URIs
- implement Triple stores and use REST web architecture
If working within a company, try to break down application silos and free the data. They suggested making data accessible by having triple stores run 'on top' of existing databases. It's much less expensive to do this and needs less effort to start. Finally, it may be easier for IT departments to accept semantic extensions when the underlying data resides in traditional databases maintained by the usual staff.
Automatically Distilling Semantic Content
Several presentations focused on the emerging techniques for dynamically extracting meaning and metadata and mapping text information into taxonomies and ontologies. These include combinations of statistical and rules-based metadata extraction and auto-tagging and auto-classification to link related content.
Some of this is being facilitated by running semantic projects that are building standard vocabularies and standard ontologies. Examples of this are TextWise and FreeBase, as well efforts now taking place in industry verticals.
One presentation with a huge title was The Use of Semantic Analytics to Achieve Data Transparency Using Government Standards for Information Sharing, by Bill Stewart of Jeskell. Stewart noted that almost half the work week of a knowledge worker is spend finding needed information -- 19 hours a week! Globally, we are collecting 15 petabytes every day and 80% of new data is unstructured. For the demo in his session, he showed the Languageware Workbench finding work roles from text information on the Katrina recovery effort and distilling out an entity list of working teams. The rule set had been built over a two day period parsing the text record.
In Assistive and Automatic Tools for Tagging and Categorizing Anything, Seth Maislin did a survey of automation techniques. Although the goal is to save costs by getting human intervention out of loop, Maislin pointed that domain expert intervention may be needed to review auto-indices and auto-built ontologies. Maislin suggested doing this toward the beginning of a project to build trustworthy associations. Building the index, with or without the aid of a taxonomy, is the place for human involvement. In MedLine, humans assist categorization by accepting or declining initial results, then reviewing and correcting by experts, added by active social tagging by users. He also recommended using open source taxonomies to start and extend your own.
Nepomuk in KDE SC 4
Another example of the pervasiveness of SemTech is the Nepomuk project which provides a standardized, conceptual framework for Semantic Desktops. Part of the new tech in KDE SC 4, Nepomuk shares data from several desktop applications using RDF metadata. There are C/C++ and Java implementations.
Core development results will be publicly available during the project via a webbased collaboration platform. The most prominent implementation is available at http://nepomuk.kde.org.
Contributors to the project were in various expo booths, including the folks from DERI.
Here's part of the Goal statement from the Nepomuk project website:
"NEPOMUK intends to realize and deploy a comprehensive solution - methods,
data structures, and a set of tools - for extending the personal computer
into a collaborative environment, which improves the state of art in online
collaboration and personal data management...."
Although initially designed to fulfill requirements for the NEPOMUK project, these ontologies are useful for the semantic web community in general. These basically extends the search process with a local desktop RDF store and links data from various applications that use these KDE ontologies. The ontologies are open source and are used by Tracker in GNOME.
The NEPOMUK ontologies are available from the following Web page:
In the last 2 years, the event organizer, Wilshire Conferences, has organized SemTech and the related Enterprise Data World event with a 'lunch on your own' policy except for the tutorial day. That, and the conference tote bag, makes it a bit less expensive. All attendees got an early conference CD but conference web site hosts password access to updated presentations. By the way, that single lunch on the tutorial day was great! The Hilton kitchen staff produced tasty and eye-pleasing food. And the move from San Jose to the SF Hilton resulted in a better facility overall - an excellent AV team kept all the rooms functional and all but one room had a net of extension cords to power attendee laptops.
I have to say I found SemTech sessions more satisfying and less academic this year and more about real tools and products. I'm happy that they plan to return the SF Hilton in 2011 hope to attend.
Howard Dyckoff is a long term IT professional with primary experience at
Fortune 100 and 200 firms. Before his IT career, he worked for Aviation
Week and Space Technology magazine and before that used to edit SkyCom, a
newsletter for astronomers and rocketeers. He hails from the Republic of
Brooklyn [and Polytechnic Institute] and now, after several trips to
Himalayan mountain tops, resides in the SF Bay Area with a large book
collection and several pet rocks.
Howard maintains the Technology-Events blog at
blogspot.com from which he contributes the Events listing for Linux
Gazette. Visit the blog to preview some of the next month's NewsBytes