HTML Scraping in Haskell
Aloha! It’s that time of the year again where I try my luck at some Haskell programming, as that is one of my “skills” I want to improve. As a little project, I wanted to write a small HTML scraper for my university’s canteen so I quickly extract and show this information together with other canteens that are close by.
I chose this project because it provides a nice playground: the requirements are fairly short and low, there is no external pressure to get anything done, and I expected the result to be not too complicated. The ideal environment for experimentation.
Why is HTML scraping hard?
When you want to do HTML scraping, you usually don’t want to resort to simple tools like regular expressions. In my mind, that is for three reasons:
First, HTML sites are usually designed to be displayed to human users, and not to be mechanically scraped. As such, finding and extracting the correct pieces can be hard, and you ideally want to use tools that are semantically meaningful and easy to use in HTML—such as CSS selectors, or XPath queries.
Second, HTML is a complex beast. The whole HTML specification has 1441 pages, the part that deals just with parsing is 97 pages. In order to properly parse HTML, you need to know the details of the single elements. For example, you need to know that certain elements (such as <br> or <img>) do not allow for an end tag, or that elements like <li> implicitely close the previous instance. For example, the following two snippets are equivalent:
<p>Paragraph 1
<p>Paragraph 2
<ul>
<li>Item 1
<li>Item 2
</ul>
<!-- is valid and equivalent to -->
<p>Paragraph 1</p>
<p>Paragraph 2</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
Third, websites do not always use valid HTML, for a variety of reasons. This can range from “minor” things like a stray & (which would need to be escaped as &) to more complex problems with the tag structure.
Trying out new tools
In Python, the standard tool for HTML scraping is BeautifulSoup. It deals with HTML parsing and provides helpers to navigate the DOM and to extract information. In Haskell, I was new to this area so I had to search for library recommendations. I came across a lot of very old posts and outdated libraries, but then I found Scalpel as one modern recommendation. It provides a high-level interface and builds on TagSoup, which explicitely supports HTML5 and unstructured and malformed HTML. Armed with those, I was ready to go.
For reference, I am trying to scrape the website of our university’s canteen (Mensa am Adenauerring). I have attached a copy of the HTML as it currently is during my experiment.
I wrote the first scraper:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Text.HTML.Scalpel
import Text.Pretty.Simple
data MensaLine = MensaLine String [String] deriving (Show)
main :: IO ()
main = do
content <- readFile "mensa.html"
let scraped = scrapeStringLike content scrapeMensa
pPrint scraped
scrapeMensa :: Scraper String [MensaLine]
scrapeMensa = do
chroot ("div" @: ["id" @= "canteen_day_1"]) $ do
chroots ("tr" @: [hasClass "mensatype_rows"]) $ do
name <- text ("td" @: [hasClass "mensatype"])
meals <- texts ("td" @: [hasClass "menu-title"])
return $ MensaLine name meals
And lo and behold, we get:
Just [ MensaLine "Spätausgabe und Abendessen" [ "Spätausgabe 14:00 bis 14:30 an der Linie 2 Info zum Speisenangebot direkt an der Ausgabe" ] , MensaLine "Cafeteria11-14 Uhr" [ "Hähnchenschnitzel mit Brötchen[1,3,Ge,We]" , "Spinatstrudel[Ei,ML,We]" ] , MensaLine "[pizza]werkPasta" [] ]
… wait, we only got 3 out of the 13 lines? There must be something off.
Looking for mistakes
I was a bit perplexed at what I saw, because similarly to other Haskell stories, I wasn’t sure if the problem was me or in the library that I was using. Clearly, the website displayed fine, and I was able to scrape it using BeautifulSoup. So where was the error?
I played around with the library a bit more. I thought that maybe my use of chroot was wrong, and I tried going other ways to retrieve some information:
scrapeMensa :: Scraper String [String]
scrapeMensa = do
chroot ("div" @: ["id" @= "canteen_day_1"]) $ do
texts ("td" @: [hasClass "mensatype"])
Just [ "Linie 1Gut & Günstig" , "Linie 2Vegane Linie" , "Linie 3" , "Linie 4" , "Linie 5" , "Schnitzel-/Burgerbar" , "Linie 6" , "Spätausgabe und Abendessen" , "[kœri]werk11-14 Uhr" , "Cafeteria11-14 Uhr" , "[pizza]werkPizza11-14 Uhr" , "[pizza]werkPasta" , "[pizza]werkSalate / Vorspeisen" ]
That works, so let’s play more and see if we can inspect the HTML of the elements:
scrapeMensa :: Scraper String [String]
scrapeMensa = do
chroot ("div" @: ["id" @= "canteen_day_1"]) $ do
chroots ("tr" @: [hasClass "mensatype_rows"]) $ do
html anySelector
Just [ "<tr class="mensatype_rows">" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows"><td class="mensatype" style="white-space: normal !important;"><div>Spätausgabe und Abendessen</div></td><td class="mensadata"><table class="meal-detail-table"><tr class="mt-0"><td class="mtd-icon"><div><br></div></td><td class="first menu-title" id="menu-title-1123949381918231691" onclick="rateMeal('menu-title-1123949381918231691');"><span class="bg"><b>Spätausgabe 14:00 bis 14:30 an der Linie 2</b> <span>Info zum Speisenangebot direkt an der Ausgabe</span></span></td><td style="text-align: right;vertical-align:bottom;"><span class="bgp price_1">3,20 €</span><span class="bgp price_2">4,60 €</span><span class="bgp price_3">4,20 €</span><span class="bgp price_4">3,55 €</span><div style="clear: both;"></div> </td></tr> </table></td></tr>" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows"><td class="mensatype" style="white-space: normal !important;"><div>Cafeteria<br>11-14 Uhr</div></td><td class="mensadata"><table class="meal-detail-table"><tr class="mt-0"><td class="mtd-icon"><div><br></div></td><td class="first menu-title" id="menu-title-8124945221313286120" onclick="rateMeal('menu-title-8124945221313286120');"><span class="bg"><b>Hähnchenschnitzel mit Brötchen</b></span><sup>[1,3,Ge,We]</sup></td><td style="text-align: right;vertical-align:bottom;"><span class="bgp price_1">3,20 €</span><span class="bgp price_2">3,20 €</span><span class="bgp price_3">3,20 €</span><span class="bgp price_4">3,20 €</span><div style="clear: both;"></div> </td></tr> <tr class="mt-7"><td class="mtd-icon"><div><img src="/layout/icons/vegetarisches-gericht.svg" class="mealicon_2" title="vegetarisches Gericht"><br></div></td><td class="first menu-title" id="menu-title-5432811224311860652" onclick="rateMeal('menu-title-5432811224311860652');"><span class="bg"><b>Spinatstrudel</b></span><sup>[Ei,ML,We]</sup></td><td style="text-align: right;vertical-align:bottom;"><span class="bgp price_1">1,90 €</span><span class="bgp price_2">1,90 €</span><span class="bgp price_3">1,90 €</span><span class="bgp price_4">1,90 €</span><div style="clear: both;"></div> </td></tr> </table></td></tr>" , "<tr class="mensatype_rows">" , "<tr class="mensatype_rows"><td class="mensatype" style="white-space: normal !important;"><div>[pizza]werk<br>Pasta</div></td><td class="mensadata"><table class="meal-detail-table"><tr><td class="mtd-icon"><div><br></div></td><td colspan="1"><div style="display:block;background:white;">-</div></td></tr></table></td></tr>" , "<tr class="mensatype_rows">" ]
Seems like there is something wrong here, that cannot be just my misuse of the library.
Digging deeper into Scalpel’s source
This is the part that I was scared of, because I am not an experienced Haskell programmer, and digging into other’s people code is daunting. Yet, here we are: Scalpel’s source, at commit 134db02. Please be aware that I will describe what’s going on to the best of my understanding, which might not be 100% accurate.
Internally, Scalpel operates on a TagSpec, which is defined in Select.hs and combines three aspects:
- A TagVector, which is the underlying list of tags, as parsed by TagSoup.
- A TagForest, which is the hierarchical representation of the tags.
- A SelectContext, which contains auxiliary information for chroot, such as the index of the current subelement.
The main work happens in tagsToVector, which takes the output of the TagSoup parse (the list of Tags), and produces the TagVector. The TagVector is an “augmented” version of the tag list where each tag is annotated with its closing index—information that Scalpel then uses to build the hierarchy.
The algorithm that Scalpel uses is described in the comment above the function, but the gist is that Scalpel saves a stack of open tags per tag name. Whenever a closing tag is found, the last open tag with the same tag name is associated with it. At the end, all remaining tags that for which no closing tag was found are then added without a closing index.
We can see that this might cause issues if elements are implicitely closed, for example if a </table> closes a <tr>, or if a <li> closes another <li>. Then, the algorithm that Scalpel uses misses those “ending indices”, and records the tags without their correct span.
Exactly this happens in our case: by looking closer at the document structure and what tagsToVector produces out of it, we see that the source of our nonworking scrape is bad ending indices for our table elements. This is because the document misses some </tr> ending tags, which confuses Scalpel’s tree building algorithm. Browsers do not get confused, as the parsing specification states that </table> simply closes all inner elements (§ 13.2.6.4.9), which then produces a correct tree.
So we have a combination of two problems at hand: Scalpel doesn’t adhere to the WHATWG HTML parsing standard, and the website returns wonky (but not necessarily invalid) markup. And one of those is easier to fix than the other.
Getting to a good scrape
It turns out that TagSoup has it’s own implementation of a tree builder in Tree.hs, which seems to produce a working tree for our website, but isn’t without its problems. It also does not seem to follow the specification, but does so in a way that makes the output less wrong in our case.
Slightly frustrated with the situation, I turned back to Google and found zenacy-html, which specifically states that it implements the parsing machinery from the HTML standard.
I tested Zenacy, got a good DOM out of my HTML, and then went on to experiment with a small querying interface on top of it—inspired by Scalpel, and called MiniScalp:
module Main (main) where
import Control.Monad
import Data.Maybe
import Data.Text
import MiniScalp.Predicates
import MiniScalp.Query
import MiniScalp.Sources
import MiniScalp.Types
data MensaLine = MensaLine Text [Text] deriving (Show)
mensaScraper :: Scraper [MensaLine]
mensaScraper = chroot ("id" @= "canteen_day_1") $ do
chroots (tag "tr" @& hasClass "mensatype_rows") $ do
name <- chroot ("td" @: [hasClass "mensatype"]) text'
meals <- chroots ("td" @: [hasClass "menu-title"]) text'
return $ MensaLine name meals
main :: IO ()
main = do
scraped <- fromJust <$> scrapeFile "mensa.html" mensaScraper
forM_ scraped $ \(MensaLine name meals) -> do
putStrLn $ unpack name
forM_ meals $ \meal -> putStrLn (" " ++ unpack meal)
putStrLn ""
It works, my little monster is alive!
Conclusion
I don’t want to flame Scalpel or TagSoup, as they are made by programmers in their free time, offered open and free for everybody. Those developers are way more experienced than me and the libraries work fine for many people. It just happens that my use case triggered a bug that caused the result to be wrong.
Correct HTML parsing according to the HTML standard is tedious to get right. I wonder how far you would get with a slightly adjusted Scalpel algorithm, e.g. simply by implementing the implicit closing feature for the tags that require it. Maybe this is an experiment for future-me, though I cannot yet judge how much work this would actually require.
Like in my last Haskell story, it’s hard to for me to assess quickly whether I run into problems that arise from my lack of understanding, or into actual bugs in the libraries I use. I guess this is something that comes with more experience in the ecosystem and the language.
Finally, I have to say that it was a pretty joyful experience, altogether. While stumbling across bugs is never fun at the start, trying to figure out what was happening and finding a working solution in the end was very rewarding. To give this a positive spin, I would say that you learn more this way than by copy-pasting a few code blocks together that simply work and do what you want.
Addendum, one week later
I’ve managed to get a quick, hacky version of Scalpel that closes inner tags when encountering a </table>, so I believe it wouldn’t be too hard to implement. However, I also found issue 77 which describes the problem of implicitely closed tags appearing wrong, but where the author of the library also indicates that such “magic”, special-cased behaviour is not desired in Scalpel — at least not in the default parsing function.
The bottom line here probably is that I could have saved myself some time and trouble if I had made the connection between #77 and my faulty results earlier, but hindsight is 20/20.