Extracting data from a Web-Page
- Hi All
I would like to programmatically extract data from a web-page. (The
intention is to, say, populate a DB.)
Can anyone advise me on what technologies can be used to do this
reliably and how exactly ?
- S Stephen wrote:
> I would like to programmatically extract data from a web-page. (TheThe big issue is the format of the data on the page, and how easy it is
> intention is to, say, populate a DB.)
to get the pages you want. For instance, if the URL contains all the
parameters of the cgi script that you need, and you can decipher the
pattern, you can usually write (or use a program to write) a shell
script to download all the pages using wget. man wget will give you
After that, I usually write a Perl script to parse the downloaded HTML
files and output tab-delimited text. Perl is great for this kind of
task because of its rich support for regular expressions.
After you have all the data in tab-delimited text, it's just a matter of
importing it into your DB of choice.