Thursday, September 13, 2007

Search + single point of entry + availability = succesful knowledge management!

In my very first post on this blog i mentioned the importance of search systems/capabilities when you want to have a successful knowledge management system. Forget about categories or sorting and agreeing to one format for all knowledge, I predict it will not work for you if you go down that road! Instead think multiple systems and formats for storage, and focus on single point of entry for availability and search! Does this sound like something you know? Google! It is not without reason that Google "won" when compared to old indexing search sites!

I have two agendas for this search tool/search system/search engine investigation: I am looking for something useful for an enterprise and on the other hand I want to check out the open source posibilites so I can have something to play with at various home/friend projects! The main differences is money and how many systems/data sources the search can crawl/index and interface to. No matter which agenda you have, you should be able to get inspired from this list of requirements:
  • Index ViewVC websites, which can be protected by shared login credentials.
  • Crawl text and pdf documents on websites.
  • Must scale well for many documents!
  • Must be gentle/tunable and handle errors gracefully.
And this nice-to-have feature list:
  • Administration of who(users, public, ip-based) has access to search information from different sources.
  • Index/search multimedia formats, pictures and video, similar to Blinkx and Google images.
  • Handle searches with foreign charsets, eg. danish æøå.
  • Crawl docs on FTP sites, eg. anonymous login.
  • Crawl new and old Microsoft Office documents, such as Word, Excel and Powerpoint.
  • Crawl Windows shares.
  • Crawl WebDav.
  • Interface to and crawl Microsoft Sharepoint sites.
  • Interface to and crawl Lotus Notes databases, at least through web enabled databases.
I started by looking at Creating Google Custom Search Engines (Google CSE) and Google Custom Search Business Edition (CSBE). These are not free services for the requirements I have, so I have decided not to spend more time with these. Snips from the Google CSBE website:
Custom Search Business Edition is great for public websites that have a lot of web-based content that needs to be easily searchable.

Google desktop search also does not fit what I am looking for, so moving on.

Google has some other products which looks very interesting, Google search appliance (GSA) and Google OneBox. OneBox can supposedly interface to many systems (CRM, ERP, etc) and you can get your own developed module. Take a look at the different GSA products, or use the feature matrix for the different versions of GSA. GSA or OneBox is definately very interesting, especially for the large enterprise, who might want to save ressources and spent some money to get what is probably the best search tool in the world! But I dont have any of those Google tools availble to me right now and probably never will, at least not for private or community usage!

So I kept searching ;-) and I quickly became fond of the incredible details and amount of information available at Search Tools for Web Sites and Intranets (http://www.searchtools.com/).

I found several open source search tools which seems to fit a fair amount of my requirements and nice-to-have features above, so I would like to give the folllowing a try: OpenWebSpider, ASPSeek, mnoGoSearch, DataParkSearch and Swish-E.

It was not crystal clear to me which could in fact index Microsoft Office files, but at least Swish-E and ht://Dig seemed capable.

I owe to say that I tend to stay away from Java and PostgreSQL based systems as I have little or no experience in running those for a while!

Some of the open source search tools are available in the FreeBSD ports collection (of which I am a huge fan) so those will be the ones I test: DataParkSearch, Swish-E and mnoGoSearch!

Other urls I visited during this initial search tool investigation:
http://www.searchenginewatch.com

What is a search engine? http://www.techweb.com/encyclopedia/defineterm.jhtml;jsessionid=DB3VMBYCAINF4QSNDLRCKHSCJUNN2JVN?term=search+engine

Read about Wikia, see:
http://www.informationweek.com/blog/main/archives/2007/08/will_google_be_1.html

No comments: