Consulting

Imagine you digitalized a large amount of documents and have all this information in one corpus. Searching for specific information inside this container can be compared with searching the needle in the haystack. But one department of the pharma company Novartis had a strong need of a sophisticated, fast and reliable search.

 

Searching for specific information via a browser sounds quite easy. But what if you require state-of-the-art technology, user authorization, a full-text search for common metadata, highlighting of the search terms in the result pages, a quick visualization of the table of content for single documents – and all this of course combined with a very good performance?

 

Accessing the index via web browser

As Canoo entered the stage, Novartis already got a corpus full of digitalized documents and a Solr index (provided by third-party institutions).

The first task was to develop an entry point for users by means of a web user interface. This UI contains a form in which any user may enter his keywords. The form triggers a search through the Solr index. Solr returns the matching books identified by their unique PDF file name. The file name is then used to fetch the PDF file and to display it in the browser.

 

Standard vs. customized PDF viewer

Using the standard PDF viewer integrated into common web browsers was not an option as one customer requirement was to deactivate certain features like printing or saving, due to copyright reasons.

Thus, we had to look for an alternative to display the search results. Finally, the open source library PDF.js was chosen for the following reasons:

  • the JavaScript library can be easily integrated into any web project
  • as the library is open source, it can be adjusted to individual needs

The easiest way to deactivate the undesired features was to customize the library’s CSS/JavaScript code. Over the time, several additional options were realized to the viewer, making it really custom-tailored.

 

Optimising for better performance

When the application was released for the first time, end-users reported some severe performance issues related to huge books (around 10% of the whole corpus): to work correctly, the PDF viewer on client side has to wait for the whole PDF being fully downloaded onto the target computer.

The following three options were discussed:

  1. Minimizing the file size of each PDF document:
    The file sizes were minimized by reducing the resolution. The results were not satisfying.
  2. Customizing the PDF.js library:
    The principal idea was to modify PDF.js so that it only loads those bytes that are needed for displaying and highlighting. This would have caused several months of additional work for analysing and adapting the whole library.
  3. Page splitting:
    Page splitting is a more or less easy and fast way to reduce the load.

After having analysed the different options, we decided to go for page splitting. Each PDF was split by page. The performance improvement was tremendous. But now only one page is displayed – what about the other pages of a specific PDF file?

 

Rewriting PDF.js viewer component

The page navigation within PDF.js depends on the amount of pages in the PDF file. By using the page splitting option only a single page is loaded and displayed. Thus, the page navigation integrated in PDF.js library got broken.

To overcome the page navigation problem we were forced to completely rewrite the PDF.js viewer component.

This custom-tailored and final version of the viewer comes along with zooming and page navigation over the whole PDF document. For keyword searching in a single page and highlighting, it uses the integrated web browser search engine.

 

Today and tomorrow

Today, the users at Novartis are already happy with this application. One of the next steps will be to define how new documents can be uploaded and how these documents will be indexed so that they are ready to be searched.

 

Technologies used

  • Solr (for indexing)
  • Node.js (server programming)
  • AWS Cloud (Amazon Web Services, Database & Server)
  • Gulp.js (building tool)
  • Bootstrap (UI)
  • PDFBox (PDF handling)

 

X