When I was working at Agilytic, I took part of a POC to search, scrape and process public PDF documents. With my Elasticsearch experience, I also took the initiative to make all this data available for full text search. It was the first time I was working with Elasticsearch with another purpose than cyber security. I’ll invite you to read the outcome and the result of my work.
Searching in large text fields
Searching in large text is very far from cyber security needs. Indeed, logs are generally pretty short compared to what could contain a PDF document. There are plenty of parameters we can tweak according to the fields’ size. Fortunately, you’ll discover that even the default configuration is already good for an average performance.
You discovered that Elasticsearch is pretty efficient when it comes to search in very large texts. On this part I’m talking about highlighting with Kibana. You’ll see that Kibana at the time of writing isn’t made to search in large texts and causes many troubles. Kibana actually fits well for time series data and is particularly used for cyber security and monitoring purposes with relatively short fields.
Elasticsearch appears to be very efficient to search for different kind of data. It can definitely handle them all! However the other technology that comes with (e.g Kibana) shows some very strong limitations compared to time series data like cyber security. I would recommend anyone to build its own front-end application to search in large text fields, Kibana will definitely not fit the need here.