• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

src/Text/H01-May-2019-3,7343,054

test/H01-May-2019-722530

LICENSEH A D10-Feb-20191.5 KiB3125

README.mdH A D01-May-201913.9 KiB263183

Setup.hsH A D01-Feb-201646 32

tagsoup.cabalH A D01-May-20192.1 KiB6761

README.md

1# TagSoup [![Hackage version](https://img.shields.io/hackage/v/tagsoup.svg?label=Hackage)](https://hackage.haskell.org/package/tagsoup) [![Stackage version](https://www.stackage.org/package/tagsoup/badge/nightly?label=Stackage)](https://www.stackage.org/package/tagsoup) [![Linux build status](https://img.shields.io/travis/ndmitchell/tagsoup/master.svg?label=Linux%20build)](https://travis-ci.org/ndmitchell/tagsoup) [![Windows build status](https://img.shields.io/appveyor/ci/ndmitchell/tagsoup/master.svg?label=Windows%20build)](https://ci.appveyor.com/project/ndmitchell/tagsoup)
2
3TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.
4
5The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. This document gives two particular examples of scraping information from the web, while a few more may be found in the [Sample](https://github.com/ndmitchell/tagsoup/blob/master/test/TagSoup/Sample.hs) file from the source repository. The examples we give are:
6
7* Obtaining the last modified date of the Haskell wiki
8* Obtaining a list of Simon Peyton Jones' latest papers
9* A brief overview of some other examples
10
11The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock!
12
13This library was written without knowledge of the Java version of [TagSoup](https://github.com/jukka/tagsoup). They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information.
14
15
16#### Acknowledgements
17
18Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann, Dino Morelli, Emily Mitchell, Gwern Branwen.
19
20
21## Potential Bugs
22
23There are two things that may go wrong with these examples:
24
25* _The Websites being scraped may change._ There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, it's only a few minutes work.
26* _The `openURL` method may not work._ This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use `wget` to download the page locally, then use `readFile` instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead.
27
28
29## Last modified date of Haskell wiki
30
31Our goal is to develop a program that displays the date that the wiki at
32[`wiki.haskell.org`](http://wiki.haskell.org/Haskell) was last modified. This
33example covers all the basics in designing a basic web-scraping application.
34
35### Finding the Page
36
37We first need to find where the information is displayed and in what format.
38Taking a look at the [front web page](http://wiki.haskell.org/Haskell), when
39not logged in, we see:
40
41```html
42<ul id="f-list">
43  <li id="lastmod"> This page was last modified on 9 September 2013, at 22:38.</li>
44  <li id="copyright">Recent content is available under <a href="/HaskellWiki:Copyrights" title="HaskellWiki:Copyrights">a simple permissive license</a>.</li>
45  <li id="privacy"><a href="/HaskellWiki:Privacy_policy" title="HaskellWiki:Privacy policy">Privacy policy</a></li>
46  <li id="about"><a href="/HaskellWiki:About" title="HaskellWiki:About">About HaskellWiki</a></li>
47  <li id="disclaimer"><a href="/HaskellWiki:General_disclaimer" title="HaskellWiki:General disclaimer">Disclaimers</a></li>
48</ul>
49```
50
51So, we see that the last modified date is available. This leads us to rule 1:
52
53**Rule 1:** Scrape from what the page returns, not what a browser renders, or what view-source gives.
54
55Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are two ways to get the page as it will appear to your program.
56
57#### Using the HTTP package
58
59We can write a simple HTTP downloader with using the [HTTP package](http://hackage.haskell.org/package/HTTP):
60
61```haskell
62module Main where
63
64import Network.HTTP
65
66openURL :: String -> IO String
67openURL x = getResponseBody =<< simpleHTTP (getRequest x)
68
69main :: IO ()
70main = do
71    src <- openURL "http://wiki.haskell.org/Haskell"
72    writeFile "temp.htm" src
73```
74
75Now open `temp.htm`, find the fragment of HTML containing the hit count, and examine it.
76
77### Finding the Information
78
79Now we examine both the fragment that contains our snippet of information, and
80the wider page. What does the fragment have that nothing else has? What
81algorithm would we use to obtain that particular element? How can we still
82return the element as the content changes? What if the design changes? But
83wait, before going any further:
84
85**Rule 2:** Do not be robust to design changes, do not even consider the possibility when writing the code.
86
87If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change.
88
89So now, let's consider the fragment from above. It is useful to find a tag
90which is unique just above your snippet - something with a nice `id` or `class`
91attribute - something which is unlikely to occur multiple times. In the above
92example, an `id` with value  `lastmod` seems perfect.
93
94```haskell
95module Main where
96
97import Data.Char
98import Network.HTTP
99import Text.HTML.TagSoup
100
101openURL :: String -> IO String
102openURL x = getResponseBody =<< simpleHTTP (getRequest x)
103
104haskellLastModifiedDateTime :: IO ()
105haskellLastModifiedDateTime = do
106    src <- openURL "http://wiki.haskell.org/Haskell"
107    let lastModifiedDateTime = fromFooter $ parseTags src
108    putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime
109    where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "<li id=lastmod>")
110
111main :: IO ()
112main = haskellLastModifiedDateTime
113```
114
115Now we start writing the code! The first thing to do is open the required URL, then we parse the code into a list of `Tag`s with `parseTags`. The `fromFooter` function does the interesting thing, and can be read right to left:
116
117* First we throw away everything (`dropWhile`) until we get to an `li` tag
118  containing `id=lastmod`. The `(~==)` and `(~/=)` operators are different from
119standard equality and inequality since they allow additional attributes to be
120present. We write `"<li id=lastmod>"` as syntactic sugar for `TagOpen "li"
121[("id","lastmod")]`. If we just wanted any open tag with the given `id`
122attribute we could have written `(~== TagOpen "" [("id","lastmod")])` and this
123would have matched.  Any empty strings in the second element of the match are
124considered as wildcards.
125* Next we take two elements: the `<li>` tag and the text node immediately
126  following.
127* We call the `innerText` function to get all the text values from inside,
128  which will just be the text node following the `lastmod`.
129* We split the string into a series of words and drop the first six, i.e. the
130  words `This`, `page`, `was`, `last`, `modified` and `on`
131* We reassemble the remaining words into the resulting string `9 September
132  2013, at 22:38.`
133
134This code may seem slightly messy, and indeed it is - often that is the nature of extracting information from a tag soup.
135
136**Rule 3:** TagSoup is for extracting information where structure has been lost, use more structured information if it is available.
137
138
139## Simon's Papers
140
141Our next very important task is to extract a list of all Simon Peyton Jones' recent research papers off his [home page](http://research.microsoft.com/en-us/people/simonpj/). The largest change to the previous example is that now we desire a list of papers, rather than just a single result.
142
143As before we first start by writing a simple program that downloads the appropriate page, and look for common patterns. This time we want to look for all patterns which occur every time a paper is mentioned, but no where else. The other difference from last time is that previous we grabbed an automatically generated piece of information - this time the information is entered in a more freeform way by a human.
144
145First we spot that the page helpfully has named anchors, there is a current work anchor, and after that is one for Haskell. We can extract all the information between them with a simple `take`/`drop` pair:
146
147```haskell
148takeWhile (~/= "<a name=haskell>") $
149drop 5 $ dropWhile (~/= "<a name=current>") tags
150```
151
152This code drops until you get to the "current" section, then takes until you get to the "haskell" section, ensuring we only look at the important bit of the page. Next we want to find all hyperlinks within this section:
153
154```haskell
155map f $ sections (~== "<A>") $ ...
156```
157
158Remember that the function to select all tags with name "A" could have been written as `(~== TagOpen "A" [])`, or alternatively `isTagOpenName "A"`. Afterwards we map each item with an `f` function. This function needs to take the tags starting just after the link, and find the text inside the link.
159
160```haskell
161f = dequote . unwords . words . fromTagText . head . filter isTagText
162```
163
164Here the complexity of interfacing to human written markup comes through. Some of the links are in italic, some are not - the `filter` drops all those that are not, until we find a pure text node. The `unwords . words` deletes all multiple spaces, replaces tabs and newlines with spaces and trims the front and back - a neat trick when dealing with text which has spacing at the source code but not when displayed. The final thing to take account of is that some papers are given with quotes around the name, some are not - dequote will remove the quotes if they exist.
165
166For completeness, we now present the entire example:
167
168```haskell
169module Main where
170
171import Network.HTTP
172import Text.HTML.TagSoup
173
174openURL :: String -> IO String
175openURL x = getResponseBody =<< simpleHTTP (getRequest x)
176
177spjPapers :: IO ()
178spjPapers = do
179        tags <- parseTags <$> openURL "http://research.microsoft.com/en-us/people/simonpj/"
180        let links = map f $ sections (~== "<A>") $
181                    takeWhile (~/= "<a name=haskell>") $
182                    drop 5 $ dropWhile (~/= "<a name=current>") tags
183        putStr $ unlines links
184    where
185        f :: [Tag String] -> String
186        f = dequote . unwords . words . fromTagText . head . filter isTagText
187
188        dequote ('\"':xs) | last xs == '\"' = init xs
189        dequote x = x
190
191main :: IO ()
192main = spjPapers
193```
194
195## Other Examples
196
197Several more examples are given in the [Sample.hs](https://github.com/ndmitchell/tagsoup/blob/master/test/TagSoup/Sample.hs) file, including obtaining the (short) list of papers from my site, getting the current time and a basic XML validator. All use very much the same style as presented here - writing screen scrapers follow a standard pattern. We present the code from two for enjoyment only.
198
199### My Papers
200
201```haskell
202module Main where
203
204import Network.HTTP
205import Text.HTML.TagSoup
206
207openURL :: String -> IO String
208openURL x = getResponseBody =<< simpleHTTP (getRequest x)
209
210ndmPapers :: IO ()
211ndmPapers = do
212        tags <- parseTags <$> openURL "http://community.haskell.org/~ndm/downloads/"
213        let papers = map f $ sections (~== "<li class=paper>") tags
214        putStr $ unlines papers
215    where
216        f :: [Tag String] -> String
217        f xs = fromTagText (xs !! 2)
218
219main :: IO ()
220main = ndmPapers
221```
222
223### UK Time
224
225```haskell
226module Main where
227
228import Network.HTTP
229import Text.HTML.TagSoup
230
231openURL :: String -> IO String
232openURL x = getResponseBody =<< simpleHTTP (getRequest x)
233
234currentTime :: IO ()
235currentTime = do
236    tags <- parseTags <$> openURL "http://www.timeanddate.com/worldclock/uk/london"
237    let time = fromTagText (dropWhile (~/= "<span id=ct>") tags !! 1)
238    putStrLn time
239
240main :: IO ()
241main = currentTime
242```
243
244## Other Examples
245
246In [Sample.hs](https://github.com/ndmitchell/tagsoup/blob/master/test/TagSoup/Sample.hs)
247the following additional examples are listed:
248
249- Google Tech News
250- Package list form Hackage
251- Print names of story contributors on sequence.complete.org
252- Parse rows of a table
253
254## Related Projects
255
256* [TagSoup for Java](https://github.com/jukka/tagsoup) - an independently written malformed HTML parser for Java.
257* [HXT: Haskell XML Toolbox](http://www.fh-wedel.de/~si/HXmlToolbox/) - a more comprehensive XML parser, giving the option of using TagSoup as a lexer.
258* [Other Related Work](http://www.fh-wedel.de/~si/HXmlToolbox/#rel) - as described on the HXT pages.
259* [Using TagSoup with Parsec](http://therning.org/magnus/posts/2008-08-08-367-tagsoup-meet-parsec.html) - a nice combination of Haskell libraries.
260* [tagsoup-parsec](https://hackage.haskell.org/package/tagsoup-parsec) - a library for easily using TagSoup as a token type in Parsec.
261* [tagsoup-megaparsec](https://hackage.haskell.org/package/tagsoup-megaparsec) - a library for easily using TagSoup as a token type in Megaparsec.
262* [WraXML](https://hackage.haskell.org/packages/archive/wraxml/latest/doc/html/Text-XML-WraXML-Tree-TagSoup.html) - construct a lazy tree from TagSoup lexemes.
263