Imagine you digitalized a large amount of documents and have all this information in one corpus. Searching for specific information inside this container can be compared with searching the needle in the haystack. But one department of the pharma company Novartis had a strong need of a sophisticated, fast and reliable search.
Searching for specific information via a browser sounds quite easy. But what if you require state-of-the-art technology, user authorization, a full-text search for common metadata, highlighting of the search terms in the result pages, a quick visualization of the table of content for single documents – and all this of course combined with a very good performance?
As Canoo entered the stage, Novartis already got a corpus full of digitalized documents and a Solr index (provided by third-party institutions).
The first task was to develop an entry point for users by means of a web user interface. This UI contains a form in which any user may enter his keywords. The form triggers a search through the Solr index. Solr returns the matching books identified by their unique PDF file name. The file name is then used to fetch the PDF file and to display it in the browser.
Using the standard PDF viewer integrated into common web browsers was not an option as one customer requirement was to deactivate certain features like printing or saving, due to copyright reasons.
Thus, we had to look for an alternative to display the search results. Finally, the open source library PDF.js was chosen for the following reasons:
When the application was released for the first time, end-users reported some severe performance issues related to huge books (around 10% of the whole corpus): to work correctly, the PDF viewer on client side has to wait for the whole PDF being fully downloaded onto the target computer.
The following three options were discussed:
After having analysed the different options, we decided to go for page splitting. Each PDF was split by page. The performance improvement was tremendous. But now only one page is displayed – what about the other pages of a specific PDF file?
The page navigation within PDF.js depends on the amount of pages in the PDF file. By using the page splitting option only a single page is loaded and displayed. Thus, the page navigation integrated in PDF.js library got broken.
To overcome the page navigation problem we were forced to completely rewrite the PDF.js viewer component.
This custom-tailored and final version of the viewer comes along with zooming and page navigation over the whole PDF document. For keyword searching in a single page and highlighting, it uses the integrated web browser search engine.
Today, the users at Novartis are already happy with this application. One of the next steps will be to define how new documents can be uploaded and how these documents will be indexed so that they are ready to be searched.