Copenhagens Setech: knowledge management

Showing posts with label knowledge management. Show all posts

Wednesday, October 17, 2007

Trying OmniFind Yahoo Search

When looking for Windows search util solutions, I stumbled upon OmniFind, which seemed too good to be true:
Install it in 3 clicks, configure it in minutes.
Free, searches up to 500,000 documents.
Search both the enterprise and the Internet from a single interface.
Incorporates open source Apache Lucene technology to deliver the best of community innovation with IBM's enterprise features.
But OmniFind was exactly like that! Downloading, installing, configuring, testing indexing a website and a filesystem location, all done in 15 minutes!

The server OS requirements are not my favorite, but for the enterprise it makes sense, and expected when it comes to IBM. Their favorites are of course Redhat and Suse. Too bad for me, my favorite Linux being Debian, and of course i always vouch for FreeBSD.

32-bit Red Hat Enterprise LinuxVersion 4, Update 3
32-bit SUSE Linux Enterprise 10
32-bit Windows XP SP2
32-bit Windows 2003 Server SP1

Some notes from the testing so far:

Indexing filesystems, with .doc, .xls, works like a charm, and the search results can be browsed "as html" and "cached". Very useful!

OmniFind installs as its own webservice, on a port of your choice. I changed the search page appearance with company logo and disabled all the Yahoo links. All very simple from the OmniFind admin control panel!

Searching for a string inside any word, you should add a wildcard. For example you should search "regression*" to make sure you locate occurrancies of "regressions".

Reindexing seems to be something you have to wrap into your own scripts, and schedule them, eg. with at jobs.

You can use scripts to start or stop a crawler.
Crawler management scripts allow you to schedule and execute start and stop crawler actions, or start and stop a crawler from the command line.

Cleaning the index for documents that should not be crawled is not so friendly. It seems you have to delete the entire source, eg. website, then crawl it again. It can be tiresome if it is a big website.

The language pack should be installed before you start crawling your big sources, as you will have to do it all over again when then language pack has been installed.

Crawling protected websites was possible, i have tested https:// protected by basic authentication, it worked fine. Crawling formbased authentication, as a company portal document handling system, should also be possible:

HTML form-based authentication
Form name (optional)
Example: loginPage
Form action
Example: http://www.example.org/
authentication/login.do
HTTP method: POST or GET
Example: POST
Form parameters (optional)
Example: userid and myuserID

So far, I am very pleased with OmniFind, I recommend everyone give it a try. OmniFind might be the single point of entry for knowledge search that your organization need to bring knowledge from many sources to life and use!!

Tuesday, October 16, 2007

Intranet and file system search tools on Windows

Recently I have looked into challenges and requirements for search tools for knowledge management. In my testing, I have been focussing on tools that could run off a Unix box, indexing serveral sources of information. Testing those tools are still undergoing.

Now I have another use for search tools, this time running off a Windows server. Requirements for eg. what sources to index are the same as the Unix tools still being tested.

Using the very good searchtools.com website, I found some interesting tools:

Mnogosearch Windows
Zoom search engine
Apache Solr
OnmiFind

So far I have setted up the Mnogosearch for Windows MSSQL with SQL Express 2005, but I still have to setup search integration into IIS. I have stalled this test, mainly because of the price! It is so very expensive, I could almost get a GSA mini instead. For testing the trial version indexing 1 kb of data from each file is okay, but its just too expensive to put more work into. Add to that, it seems that the Windows version is falling behind in releases, does not seem to be maintained very much.

I have not tested Apache Lucene Solr yet. It can become hard to test for me, as it is Java based, and I dont have a ready to run test environment for such testing. Reading on Solr, it should be able to index intranet, hopefully shared drives too, but i have to look at it!

OmniFind, like Solr, is based on Lucene, but seems like a better package for me to test. It is free, can index file system and sounds too good to be true:

Install it in 3 clicks, configure it in minutes.
Searches up to 500,000 documents.
Search both the enterprise and the Internet from a single interface.
Incorporates open source Apache Lucene technology to deliver the best of community innovation with IBM's enterprise features.

I have installed the Zoom search engine on my laptop, indexing the directory with some .doc, .txt, .cmd etc files, putting the result search page to an IIS webserver! Simple and working! In the free version Zoom will only index static files, and a max of 50 documents. This is annoying, I would rather have full version in eg. 30 days! Notes so far:

Cheap, $99 for pro, $299 for enterprise use.
Very easy setup
Search does not trigger documents which have the searched word in filename!
Can reindexing be automated?

Search tools, challenges and non-trivial requirements

I have listed some key challenges for my current usage of search tools:

Create a point of entry for search.
Link to relevant search query from a portal (eg. a operation status website).
Some knowledge should only be available to some people. This seems to the biggest hurdle!

Limiting knowledge/search only to some people could be solved in at least 2 ways:

Set up different indexer/crawler configurations, each searchable from different search prompt. Problem could be multiple crawls of the same info (load, storage, ressources)
Index/crawl everything once, and let the search box/website/frontend control who can see what. This would be preferred.

Listing non-trivial requirements which are not always availble:

Parse open office word and calc, (.odt and .ods), which is basically zipfiles with xml (unzip and parse eg. content.xml).
Crawling/indexing file sytems (shares/harddrives), setting a baseurl for how the searchresults will become browsable.
Reindexing must automated, eg. scheduled or cron'd.

Thursday, September 13, 2007

Search + single point of entry + availability = succesful knowledge management!

In my very first post on this blog i mentioned the importance of search systems/capabilities when you want to have a successful knowledge management system. Forget about categories or sorting and agreeing to one format for all knowledge, I predict it will not work for you if you go down that road! Instead think multiple systems and formats for storage, and focus on single point of entry for availability and search! Does this sound like something you know? Google! It is not without reason that Google "won" when compared to old indexing search sites!

I have two agendas for this search tool/search system/search engine investigation: I am looking for something useful for an enterprise and on the other hand I want to check out the open source posibilites so I can have something to play with at various home/friend projects! The main differences is money and how many systems/data sources the search can crawl/index and interface to. No matter which agenda you have, you should be able to get inspired from this list of requirements:

Index ViewVC websites, which can be protected by shared login credentials.
Crawl text and pdf documents on websites.
Must scale well for many documents!
Must be gentle/tunable and handle errors gracefully.

And this nice-to-have feature list:

Administration of who(users, public, ip-based) has access to search information from different sources.
Index/search multimedia formats, pictures and video, similar to Blinkx and Google images.
Handle searches with foreign charsets, eg. danish æøå.
Crawl docs on FTP sites, eg. anonymous login.
Crawl new and old Microsoft Office documents, such as Word, Excel and Powerpoint.
Crawl Windows shares.
Crawl WebDav.
Interface to and crawl Microsoft Sharepoint sites.
Interface to and crawl Lotus Notes databases, at least through web enabled databases.

I started by looking at Creating Google Custom Search Engines (Google CSE) and Google Custom Search Business Edition (CSBE). These are not free services for the requirements I have, so I have decided not to spend more time with these. Snips from the Google CSBE website:
Custom Search Business Edition is great for public websites that have a lot of web-based content that needs to be easily searchable.

Google desktop search also does not fit what I am looking for, so moving on.

Google has some other products which looks very interesting, Google search appliance (GSA) and Google OneBox. OneBox can supposedly interface to many systems (CRM, ERP, etc) and you can get your own developed module. Take a look at the different GSA products, or use the feature matrix for the different versions of GSA. GSA or OneBox is definately very interesting, especially for the large enterprise, who might want to save ressources and spent some money to get what is probably the best search tool in the world! But I dont have any of those Google tools availble to me right now and probably never will, at least not for private or community usage!

So I kept searching ;-) and I quickly became fond of the incredible details and amount of information available at Search Tools for Web Sites and Intranets (http://www.searchtools.com/).

I found several open source search tools which seems to fit a fair amount of my requirements and nice-to-have features above, so I would like to give the folllowing a try: OpenWebSpider, ASPSeek, mnoGoSearch, DataParkSearch and Swish-E.

It was not crystal clear to me which could in fact index Microsoft Office files, but at least Swish-E and ht://Dig seemed capable.

I owe to say that I tend to stay away from Java and PostgreSQL based systems as I have little or no experience in running those for a while!

Some of the open source search tools are available in the FreeBSD ports collection (of which I am a huge fan) so those will be the ones I test: DataParkSearch, Swish-E and mnoGoSearch!

Other urls I visited during this initial search tool investigation:
http://www.searchenginewatch.com

What is a search engine? http://www.techweb.com/encyclopedia/defineterm.jhtml;jsessionid=DB3VMBYCAINF4QSNDLRCKHSCJUNN2JVN?term=search+engine

Read about Wikia, see:
http://www.informationweek.com/blog/main/archives/2007/08/will_google_be_1.html

Wednesday, September 12, 2007

Starting a blog, handling knowledge management

Welcome to my blog, thanks for visiting!

I have started this blog to improve my knowledge management system! In short this blog will contain all information i feel like saving! For more details of the entire system, see later.

The need for an improvement to my knowledge mangement came up this month, when I got a new job! At my new job I can no longer commit/checkout my personal Subversion or CVS repositories. And I dont have access to Firefox, so I am also missing my bookmark sync-and-sort plugin!

Things you wont find here are real personal information or notes that are confidential, which will have to stay on my PC or in a special Subversion repository for that.

So to summarise my knowledge management system as of today, it consist of the following:

Ideas/readme/snip commits used to be saved in Subversion, but probably this will go into this blog from now on!
Personal scripts will still go into personal Subversion repositories, as it is easier to deploy to servers. Snippets from those will go to the blog when appropriate.
Howtos/working notes probably will stay in the appropriate CVS/Subversion repositories for a while. This is not optimal for sharing with more than a few people, so snippets will be in this blog!
Pictures go to appropriate flickr accounts: personal or family, available to anyone, family or friends.
Videos unfortunately can not be put into flickr. A place like flickr, with video power like youtube would be nice! Any ideas.
E-mail will probably move more and more into gmail, as that will hopefully be availble anywhere i ever need it.
The few websites i help o webmaster, are saved in a Subversion.
Instant messaging logs are not central or searchable, this would be nice to see.
I dont contribute to any particular Wiki anywhere, neither do I have one of my own.
I dont contribute to a particular forum, neither do I have one of my own.
I have not yet started using VoIP or mobile technology beyond low-tech personal use.
Daily top urls to visit (bookmark management) will stay in Sync-and-sort for now, but should not grow into a mess like recently. Instead i will post on this blog, including my thoughts of a particular url. I have a few ideas for better bookmark management so i dont have to use sync-and-sort.
Book reviews and notes will move from Subversion to this blog.

Search capabilities within all systems is of great importance, and if I was to share knowledge i tend to say good search possibility is the most important requirement of a knowledge system! Otherwise you will risk the system never gets used.

For my own setup above, a generic search across all systems is not available to me, I have to search each of the knowledge system parts in what ever I can. This is one of the reasons i prefer any format that is text based, because then at least i can grep for one word. I would really like to have a single point of entry search engine which can crawl any of the above! Limiting access to see and perform searches within certain data would be paramount! I am not aware of a product that can do this. For the enterprise at work we will take a look the Google search appliance, but for my personal usage i hope to find something similar that is available in some open source project?

The IBM quickr approach is appealing to me, at least from a coorperate knowledge sharing point of view. It seems perfect for Notes environments, but unfortunately i have not had a chance to try it out yet! I wish there was an open source alternative with similar functionality i could play with. A google search got me to Sun portal server but it 1) it might not be what i want and 2) has some pretty hard technical requirements for me to get started, so i will probably never know about the first issue.

I dont know how other technical people cope with the difficulties of handling job and personal knowledge management systems? Undoubtedly it must raise problems with regards to people loosing their notes if they change job or job position, and it goes without saying that you can not mirror work knowledge mangement systems off for your personal usage! As work and personal life keeps merging, this issue will keep popping up.

Copenhagens Setech

Wednesday, October 17, 2007

Trying OmniFind Yahoo Search

Tuesday, October 16, 2007

Intranet and file system search tools on Windows

Search tools, challenges and non-trivial requirements

Thursday, September 13, 2007

Search + single point of entry + availability = succesful knowledge management!

Wednesday, September 12, 2007

Starting a blog, handling knowledge management

Links

Blog Archive

Live Traffic Feed

About Me