Skip to content
arkiver edited this page May 3, 2022 · 3 revisions

Wget with Lua hooks

You can run Wget with your Lua script with the --lua-script option:

wget-lua --lua-script YOURSCRIPT.lua URL

If you want to add URLs to the download queue with the get_urls hook, you must also enable --recursive or --page-requisites.

wget-lua --lua-script YOURSCRIPT.lua --recursive URL
wget-lua --lua-script YOURSCRIPT.lua --page-requisites URL

Implementing callback functions

Your Lua script will get a wget.callbacks table. Implement your callback functions as fields of this object. Wget will call these functions during the download process. (You do not have to implement every function.)

You can define these 7 functions:

The names of these callback functions correspond with the C functions where they are called.

Useful tip: debugging with table.show

Your script might need debugging. The table_show.lua library is very helpful if you want to inspect the parameter values or your own internal tables. The example script lua-example/print_parameters.lua uses the table.show function to print the parameters.

Called during Wget initialization.

wget.callbacks.init = function()

You can initialize counters and other Lua variables in this function, but it is often easier to place the initialization code at the top of the Lua script.

Called during DNS hostname lookups.

wget.callbacks.lookup_host = function(host)

Return value

Return a string containing the resolved IP address, a new hostname string, or nil to use the original Wget behavior.

Parameters

  • host is the hostname to be resolved

Called before writing WARC response records for individual HTTP/S requests. Determines whether to skip writing the record.

wget.callbacks.write_to_warc = function(url, http_stat)

Return value

Return true to write the record, false to not write it.

Parameters

  • url is an url structure, as is also used in httploop_result.
  • http_stat is an http_stat structure, as is also used in httploop_result.

Called at the end of Wget's accept/reject process. Define this function to add custom accept/reject rules.

wget.callbacks.download_child_p = function(urlpos, parent, depth, start_url_parsed, iri, verdict, reason)

Return value

Return true to download, false to skip the current URL.

Parameters

Most of the parameters to this function are tables with many fields. A selection:

urlpos is the URL from the Wget queue that wants to download:

  • urlpos["url"]["url"] is the actual URL.
  • urlpos["link_expect_html"] is 1 for HTML links (<a href="...">) and 0 for other links.
  • urlpos["link_expect_css"] is 1 for CSS links (<link rel="stylesheet">) and 0 for other links.
  • urlpos["link_inline_p"] is 1 for inline links, the page requisites (images, CSS etc.), 0 for other links.

parent is the parent URL that pointed to this URL:

  • parent["url"] is the actual URL.

depth is the depth of the current URL: the number of hops from the initial URL.

start_url_parsed is the URL where Wget started (the URL from the command line or URL-list input file):

  • start_url_parsed["url"] is the actual URL.

iri gives Wget's URI encoding settings for this URL.

verdict is Wget's decision for this URL:

  • verdict == true if Wget wants to download this URL.
  • verdict == false if one or more accept/reject rules rejected this URL.

reason is the reason for Wget's rejection:

  • reason == nil if Wget accepted this URL.
  • reason == "ALREADY_ON_BLACKLIST": Wget has already downloaded this URL.
  • reason == "NON_HTTP_SCHEME": this is not an HTTP URL.
  • reason == "NOT_A_RELATIVE_LINK": rejected by --relative.
  • reason == "DOMAIN_NOT_ACCEPTED": rejected by --domains or --exclude-domains.
  • reason == "IN_PARENT_DIRECTORY": rejected by --no-parent.
  • reason == "DIRECTORY_EXCLUDED": rejected by --include-directories or --reject-directories.
  • reason == "REGEX_EXCLUDED": rejected by --accept-regex or --reject-regex.
  • reason == "PATTERN_EXCLUDED": rejected by --accept or --reject.
  • reason == "DIFFERENT_HOST": rejected by (the absence of) --span-hosts.
  • reason == "ROBOTS_TXT_FORBIDDEN": rejected by a robots.txt file.

Example parameters

download_child_p = {
  ["urlpos"] = {
    ["url"] = {
      ["url"] = "http://www.gnu.org/graphics/bullet.gif";
      ["scheme"] = "SCHEME_HTTP";
      ["host"] = "www.gnu.org";
      ["port"] = 80;
      ["path"] = "graphics/bullet.gif";
      ["dir"] = "graphics";
      ["file"] = "bullet.gif";
    };
    ["link_expect_html"] = 0;
    ["link_expect_css"] = 0;
    ["link_base_p"] = 0;
    ["link_complete_p"] = 0;
    ["link_css_p"] = 1;
    ["link_inline_p"] = 1;
    ["link_refresh_p"] = 0;
    ["link_relative_p"] = 0;
    ["ignore_when_downloading"] = 0;
  };
  ["parent"] = {
    ["url"] = "http://www.gnu.org/layout.css";
    ["scheme"] = "SCHEME_HTTP";
    ["host"] = "www.gnu.org";
    ["port"] = 80;
    ["path"] = "layout.css";
    ["dir"] = "";
    ["file"] = "layout.css";
  };
  ["depth"] = 1;
  ["start_url_parsed"] = {
    ["url"] = "http://www.gnu.org/software/wget/";
    ["scheme"] = "SCHEME_HTTP";
    ["host"] = "www.gnu.org";
    ["port"] = 80;
    ["path"] = "software/wget/";
    ["dir"] = "software/wget";
    ["file"] = "";
  };
  ["iri"] = {
    ["uri_encoding"] = "utf-8";
    ["utf8_encode"] = false;
  };
  ["verdict"] = true;
  ["reason"] = "ALREADY_ON_BLACKLIST";
};

This function is called immediately after Wget finishes an HTTP request, before it handles any errors.

wget.callbacks.httploop_result = function(url, err, http_stat)

Return value

Return one of the following wget.actions:

  • wget.actions.NOTHING: follow the normal Wget procedure for this result.
  • wget.actions.CONTINUE: retry this URL.
  • wget.actions.EXIT: finish this URL (ignore any error).
  • wget.actions.ABORT: Wget will abort() and exit immediately.

Parameters

The url and http_stat parameters are tables with many fields. A selection:

url is the URL for this request:

  • url["url"] is the actual URL.

err is Wget's status code for the response. It is one of those strings:

  • NOCONERROR, HOSTERR, CONSOCKERR, CONERROR, CONSSLERR, CONIMPOSSIBLE, NEWLOCATION, NOTENOUGHMEM, CONPORTERR, CONCLOSED, FTPOK, FTPLOGINC, FTPLOGREFUSED, FTPPORTERR, FTPSYSERR, FTPNSFOD, FTPRETROK, FTPUNKNOWNTYPE, FTPRERR, FTPREXC, FTPSRVERR, FTPRETRINT, FTPRESTFAIL, URLERROR, FOPENERR, FOPEN_EXCL_ERR, FWRITEERR, HOK, HLEXC, HEOF, HERR, RETROK, RECLEVELEXC, FTPACCDENIED, WRONGCODE, FTPINVPASV, FTPNOPASV, CONTNOTSUPPORTED, RETRUNNEEDED, RETRFINISHED, READERR, TRYLIMEXC, URLBADPATTERN, FILEBADFILE, RANGEERR, RETRBADPATTERN, RETNOTSUP, ROBOTSOK, NOROBOTS, PROXERR, AUTHFAILED, QUOTEXC, WRITEFAILED, SSLINITFAILED, VERIFCERTERR, UNLINKERR, NEWLOCATION_KEEP_POST, CLOSEFAILED, WARC_ERR, WARC_TMP_FOPENERR, WARC_TMP_FWRITEERR

httpstat contains many useful properties of the response, among others:

  • http_stat["statcode"]: the HTTP status code

Example parameters

httploop_result = {
  ["url"] = {
    ["path"] = "software/wget/";
    ["dir"] = "software/wget";
    ["host"] = "www.gnu.org";
    ["port"] = 80;
    ["file"] = "";
    ["scheme"] = "SCHEME_HTTP";
    ["url"] = "http://www.gnu.org/software/wget/";
  };
  ["err"] = "RETRFINISHED";
  ["http_stat"] = {
    ["restval"] = 0;
    ["dltime"] = 0;
    ["local_file"] = "tmp/www.gnu.org/software/wget/index.html";
    ["orig_file_size"] = 15194;
    ["existence_checked"] = true;
    ["res"] = 0;
    ["rd_size"] = 0;
    ["orig_file_name"] = "tmp/www.gnu.org/software/wget/index.html";
    ["statcode"] = 200;
    ["message"] = "OK";
    ["contlen"] = -1;
    ["len"] = 0;
    ["error"] = "OK";
    ["timestamp_checked"] = false;
  };
};

Called during the URL extraction for a downloaded file.

wget.callbacks.get_urls = function(file, url, is_css, iri)

Return value

Return a table of URLs that should be added to the download queue. The table is a list with one item per URL, with the following fields:

  • "url": the absolute URL to enqueue (mandatory).
  • "link_expect_html": 1 if the result should be parsed as an HTML file.
  • "link_expect_css": 1 if the result should be parsed as a CSS file.
  • "post_data": a parameter string of application/x-www-form-urlencoded data to be posted in a POST request.
  • "body_data": the request body. Unlike "post_data", this does not set the method.
  • "method": the HTTP method.
  • "headers": a table specifying custom headers to insert, mapping from header names to header values.

Example:

local urls = {}
-- a normal web page
table.append(urls, { url="http://example.com/", link_expect_html=1 })
-- a css page
table.append(urls, { url="http://example.com/style.css", link_expect_css=1 })
-- an image (do not extract links)
table.append(urls, { url="http://example.com/image.png" })
-- sending a POST request
table.append(urls, { url="http://example.com/login", post_data="username=test&password=test" })

Parameters

file is the local filename of the downloaded file. You can read the contents of this file to implement your own URL extractor.

url is the URL for this request.

is_css is true if this is parsed as a CSS file, false otherwise.

iri gives Wget's URI encoding settings for this URL.

Example parameters

get_urls = {
  ["file"] = "tmp/www.gnu.org/software/wget/index.html";
  ["url"] = "http://www.gnu.org/software/wget/";
  ["is_css"] = false;
  ["iri"] = {
    ["content_encoding"] = "utf-8";
    ["uri_encoding"] = "ANSI_X3.4-1968";
    ["utf8_encode"] = false;
  };
};

This function is called when Wget has finished downloading, just after it prints the "FINISHED" summary.

wget.callbacks.finish = function(start_time, end_time, wall_time, numurls, total_downloaded_bytes, total_download_time)

Parameters

start_time indicates when downloading began (clock time in seconds).

end_time indicates when downloading finished (clock time in seconds).

wall_time is the total time in seconds (end_time - start_time).

numurls is the number of URLs downloaded.

total_downloaded_bytes is the number of bytes downloaded (as a floating-point number).

total_download_time is the download time in seconds.

Example parameters

finish = {
  ["start_time"] = 2.51e-07;
  ["end_time"] = 10.670458281;
  ["wall_time"] = 10.67045803;
  ["numurls"] = 2;
  ["total_downloaded_bytes"] = 7682;
  ["total_download_time"] = 0.000822633;
};

This function is called before Wget exits. Implement this function to change the exit status.

wget.callbacks.before_exit = function(exit_status, exit_status_string)

Return value

This method should return an integer exit code. Return exit_status or use a custom number. For convenience, wget.exits provides the following constants:

  • wget.exits.SUCCESS
  • wget.exits.IO_FAIL
  • wget.exits.NETWORK_FAIL
  • wget.exits.SSL_AUTH_FAIL
  • wget.exits.SERVER_AUTH_FAIL
  • wget.exits.PROTOCOL_ERROR
  • wget.exits.SERVER_ERROR
  • wget.exits.UNKNOWN

Parameters

exit_status is the exit status that Wget will return.

exit_status_string is a text version of the exit status. It is one of

  • SUCCESS, IO_FAIL, NETWORK_FAIL, SSL_AUTH_FAIL, SERVER_AUTH_FAIL, PROTOCOL_ERROR, SERVER_ERROR, UNKNOWN

Example parameters

before_exit = {
  ["exit_status"] = 8;
  ["exit_status_string"] = "SERVER_ERROR";
};