# NAME

xdccbLister - converts offline XDCC listings into XDCC Browser (XCB) format.

# SYNOPSIS

  xdccbLister --entity-hint Group^| --filter-include FavouriteSeries
http://group.web.site/

# Note:
# In this case the name of the group (Group|WeRule) contains a "|"
# and symbols like "|" may need to be escaped in your OS's shell. On Windows
# command line it is done using the prefix "^".

with configuration file xdccbLister.conf:

network=Rizon
server=irc.rizon.net
channel=#group

# or

xdccbLister --type text --split-output --spaces
another-listing.txt yet-another-listing.txt

with configuration file xdccbLister.conf:

network=EFnet
server=irc.efnet.org
port=6665
channel=#mp3
spaces=1
dns-lookup=0

# or

xdccbLister -N EFnet -S irc.efnet.fi -P 6668 -C #mp3 --filter-exclude
"Avril Lavigne" "c:\listings\her horrible poppy mp3.html"

# Notes:
# A) No xdccbLister.conf file required since all required options specified,
# B) Also, note I used the quote marks around both the filter's mask and the
# path to the html file because they both contained spaces.

# or

#
# Layout on disk (xdccbLister in PATH or c:\listings directory):
#
# c:\listings - \config - \Rizon.conf
#                       - \Rizon-myFavouriteChannel.conf
#                       - \EFnet.conf
#                       - \hqIRC.conf
#                       - \Evilnet.conf
#             - \processed - .\Rizon-Bot1.xcb
#                          - .\Rizon-Bot2.xcb
#                          - .\EFnet-Friend1.xcb
#                          - .\EFnet-Friend2.xcb
#                          - .\EFnet-Friend3.xcb
#                          - .\hqIRC-Stranger1.xcb
#                          - .\Evilnet-IRC-AnotherStranger1.xcb
#             - \saved - .\savedFromXdccBrowser.xcb
#                      - Network1 - .\Channel1-Bot1.xcb
#                                 - .\Channel1-Bot2.xcb
#                                 - .\Channel1-Bot3.xcb
#                      - Network2 - .\Channel2-Bot1.xcb
#                                 - .\Channel3-Bot1.xcb
#                                 - .\channel4-Bot1.xcb
#
# then the listings for #myFavouriteChannel@Rizon change on their web site...

2) cd c:\listings
3) xdccbLister -c config\Rizon-myFavouriteChannel.conf -o simpsons-latest.xcb
--filter-include "Simpsons" --filter-episode 112-
4) mirc->xdcc browser->import c:\listings\processed\simpsons-latest.xcb

with configuration file config\Rizon-myFavouriteChannel.conf:

network=Rizon
server=irc.rizon.net
port=6666
channel=#myFavouriteChannel
spaces=1
input=http://myFavouriteChannelWebSite/xdcc/listings.html
output-dir=processed

# Notes:
# A) Directories under \processed created from the single large file
# .\savedFromXdccBrowser.xcb using --split-output, "--split-source network",
# B) All options have shortened versions and can be placed in configuration
# files (perhaps one per network/channel/series/music group/whatever). So,
# for example, --filter-include and --filter-episode can be written as -fi
# and -fr, respectively.

# or

# Advanced unix example: running an XDCC server using iroffer.
#
# 1) Edit /home/dude/bin/iroffer/iroffer-dudesGroup.conf:
# ...
# filedir /home/dude/archive/xdcc/dudesGroup
# xdcclistfile /home/dude/public_html/xdcc/listings.txt
# ...
#
# 2) Run command: chmod +x /home/dude/bin/xdccbLister.pl
#
# 3) Edit user scheduled cron tasks: crontab -e:
# ...
# 10 * * * * /home/dude/bin/xdccbLister.pl
#   -c /home/dude/etc/xdccbLister/dudesGroup.conf
# 10 * * * * cp /home/dude/archive/xdcc/dudesGroup/listings.xcb
#   /home/dude/public_html/xdcc/listings.xcb
# ...

with configuration file /home/dude/etc/xdccbLister/dudesGroup.conf:

network=EFnet
server=irc.efnet.org
port=6667
channel=#dudesChannel
spaces=0
input=/home/dude/public_html/xdcc/listings.txt
output=/home/dude/archive/xdcc/dudesGroup/listings.xcb

# Notes:
# A) What the above does, if it is not clear :), is to schedule one task
# every 10 minutes to process it into an XCB listings.xcb file which becomes
# one of the bot's packs! The other task copies the XCB listing to the public
# web directory,
# B) Yes, you can clearly combine the two crontab tasks above into a single
# shell script and schedule that instead. It was just easier to illustrate
# this way. :)

# You are only limited by your imagination!

# DESCRIPTION

xdccbLister is no longer a slightly complex script for a simple objective! It is now pretty much a Swiss Army knife for turning offline XDCC listings into files, which can be directly imported into the XDCC Browser or sliced and diced into the most convenient form before importing.

The original intent was to make it as easy as possible to use the great XDCC Browser (an MIRC client script) with XDCC listings which are not advertised via IRC, but are available from a web site or a downloaded file. The latest version of xdccbLister does this and more.

Ok, intro over. How do we use it? The most convenient way is to just create a small config file (xdccbLister.conf typically) and put the required options in there with their default values:

  network=<some irc network>
server=<some irc server>
port=<irc server port - only required if not port 6667>
channel=<some irc channel>

Then, call the script with a url or file path like so:

  xdccbLister http://hostname/path/to/html/page.html
xdccbLister --type text c:\path\to\my\downloaded\listing.txt

This will generate a file with the default name (xdccbLister.xcb) in the default XCB format in the current directory. This file should be easy to import into an XDCC client script like XDCC Browser.

Ok, how about any gotchas or caveats?

• Some care is required on the naming of bots/users (called entities in the script). If a page or file does not contain a trigger (/msg Entity or /ctcp Entity) before its XDCC entries, then there is no way of guessing the name from just looking at the page or file. To get around this problem, just supply a hint for what a name may look like in the page or file using --entity-hint.

If the name is not in the page or file at all, then you must explicitly specify it using --force-entity.

• If you firmly believe a description in a listing may contain spaces and require the full description, you will need to use the --spaces option. The problem with using this option is that it can be too aggressive in assuming some text is part of the description (since it has few ways of knowing when a description ends, see BUGS). Only recommended on well-formed HTML or text XDCC listings with summary statistics at the end! Though that said, descriptions are not critical and it can be easy to filter out the occasional bogus entry with an odd description.

And that is pretty much all there is too it for basic usage! There are many other options one can use to tweak what the script does in the creation of the output e.g. splitting the output into multiple files based on entity name, or across directories based on network name, filtering the list based on network, entity, size, queue (aka. gets), episode numbers (within the description!) etc. but these can be discovered later.

# INPUT OPTIONS

## General Input

-c, --config <path>
Sets the path to the configuration file. The configuration file contains simple key-value entries of the form
  key=value

followed by a newline character. White space at the start, end or around the ='' delimiter is ignored. Similarly, comments beginning a line with #'' are ignored. This option cannot be specified in a configuration file.

Default: .\xdccbLister.conf

-i, --input <source of listing>
Sets the url or file for the listing. This enables scaling via configuration files (e.g. on per regular channel) rather than having to specify the source on the command line every time. When used in a configuration file, also enables more than the number of inputs on the command line.

Multiple --input options allowed.

-id, --input-dir <input directory>
Sets an input directory against which inputs specified using --input or on the command line are checked first - very useful in configuration files.

-nc, --noconfig
Ignores any configuration file. Intended to be used on the command line in order to ignore the default configuration file, so as to avoid moving or renaming it. This option cannot be specified in a configuration file.

-t, --type <input file format>
Sets the format of the input page or file. Available formats: html, text, xcb.

Default: html.

-ua, --user-agent <User-Agent string>
Sets the HTTP User-Agent string for the robot which connects to web sites.

Default: Mozilla/5.0.

## Unstructured input formats

The following options only apply to the input formats which have no well-defined structure (i.e. --type html or --type text). They have no effect when specified with other input formats.

### Required

-C, --channel <irc channel>
Sets the channel name for server where the entities in the XDCC listing may be found. Can be any string.

-N, --network <name>
Sets the network name for the entities in the XDCC listing. Can be any string.

-S, --server <hostname or IP>
Sets the hostname or IP for a server where the entities in the XDCC listing may be found. The hostname will be resolved to an IP unless --nodns-lookup is used.

### Optional

-dl, --dns-lookup or --nodns-lookup
Toggles whether to perform a DNS lookup to resolve the server hostname. Since a DNS lookup, in performance terms, is more expensive than not performing one it can be useful to use this toggle when the hostname is already an IP.

Default: True (i.e. a DNS lookup is carried out).

-eh, --entity-hint <wildcard pattern>
Sets a wildcard pattern that will be taken as a hint for the names of the entities associated with the XDCC listings. The usual wildcards are available: *'' and ?'' for matching zero or more characters, and a single character, respectively. Please note that when used from the Windows command line certain characters (|,^) must be escaped with ^'', e.g. ^|'' and ^^''.

Without this hint, there would be no way to pick up possible entity names which are not part of a trigger in the page or file (e.g. /msg Entity or /ctcp Entity). If there is no information at all in the page or file, please see the --force-entity.

-el, --entity-length <size of nick>
Sets the maximum length for an entity name. Entity names cannot be arbitrarily long and varies per server/network. The standard maximum is, in fact, 9 characters. However, the default for this script is a lot more generous. Names longer than the maximum are currently truncated.

Default: 30 characters (Rizon network appears to have the longest nick).

-E, --force-entity <name>
Sets a fallback entity name if the script is unable to determine via triggers, or hint the name for an entity (and yet there are XDCC entries/packs in the page/file). For example, you could use it where a page links to bot listings but the listing itself does not contain the bot's name.

-ie, --input-encoding <encoding name>
Sets the character encoding of the input page or file. Note applies to all input files or pages.

Default: iso-8859-1 (also known as latin1).

-U, --nousage
Disables the scripts behaviour of outputting Usage='' lines when a recognised trigger is available. These lines correspond to the trigger used to request the entries/packs/files from an entity.

Recognised triggers (not case sensitive):

  /msg entity xdcc send
/msg entity xdcc get
/ctcp entity xdcc send
/ctcp entity xdcc get

Note that the word entity'' is replaced with the name of the entity in the output file.

This option should rarely be needed.

Default: 0/False (i.e. Usage='' lines are output if recognised triggers present).

-P, --port <port number>
Sets the port for the above the server where the entities in the XDCC listing may be found. This is a single number and not a range.

Default: 6667

-s, --spaces or --nospaces
Toggles whether or not allow spaces in the descriptions for all the entries/packs of the XDCC listings. Enabling this option can introduce problems. Use with care (see BUGS). It has no effect on input listings which are not HTML or ordinary text.

Default: 0/False (i.e. description does not contain spaces).

-u, --usage <string>''
Sets the format of the Usage='' lines that are to be output. These lines correspond to the trigger used to request the entries/packs/files from an entity. In the <string>'', the occurrence of the word entity'' is replaced with the name of the entity in the output file.

This option explicitly overrides the default behaviour of only outputting Usage='' lines if recognised triggers are present and forces the lines to be output with the set format.

An example trigger:

  --usage "/ctcp entity xdcc retrieve"

This option should rarely be needed. Please contact the xdccbLister author if this option is often used with a particular trigger format not already in the catalogue of default triggers (see --nousage).

# OUTPUT

## General Output

-o, --output <path>
Sets the path to the output filename. Note, however, when --split-output is used this becomes the path to the directory where the files are output.

Default: .\xdccbLister.xcb

-oc, --output-compatible <external program version>
Sets the output format compatibility with a specific version of an another program. Available versions: xdccb-v3.34'' or xdccb-v3.40''.

Default: xdccb-v3.40

-od, --output-dir <output directory>
Sets the output directory against which output filenames are first evaluated. Useful for configuration files.

-oo, --output-force-dir
Forces the creation of the output directory, if it does not already exist.

Default: 0/False

-F, --output-format <output file format>
Sets the format of the output file. Available file format: xcb''.

Default: xcb.

## Filters

These options apply filters to the complete list once loaded from the source. They are invaluable in slicing and dicing the list into chunks (which are themselves lists) according to your requirements. For example,

• from a broad and large list saved from XDCCB Browser, you could eliminate all networks except one,

• you could exclude any listings from known slow entities,

• you could exclude items which do not interest you,

• you could include only items which interest you,

• you could even select a subset of a series based on an episode number in the item description!

• you could eliminate items which are too large, too small or have no size,

• you could include only items on entities with small gets,

• you could select specific pack numbers and ranges.

Wildcard patterns: the usual wildcards are available: *'' and ?'' for matching zero or more characters, and a single character, respectively. Please note that when used from the Windows command line certain characters (|,^) must be escaped with ^'', e.g. ^|'' and ^^''.

Range filter: <number | number1-number2 | number- | -number>. The ranges are separated with commas (note no spaces). In addition, with some range filters an additional item, such as n/a'', may be allowed within the list. This enables you to allow entries which may simply not have a number. Examples: 1,2,3-6 or 8,20- or -30,n/a or -5,8,11,13-15,24-,n/a.

Combining filters: when multiple filters are used at the same time, the filters apply with a specific priority, basically, broadest first (or the sledgehammer to jeweller's chisel): --filter-network, --filter-entity, --filter-exclude, --filter-include, --filter-episode, --filter-size, --filter-gets, --filter-pack.

That is, first whole networks eliminated, then entities, then items based on their description, size, gets, finally individual pack numbers.

-fe, --filter-entity <wildcard pattern>
Sets an inclusive filter on the names of entities. So, if an entity name does not match the wild pattern, the entity is removed. Multiple filters allowed.

-fr, --filter-episode <range filter plus n/a''>
Sets an inclusive range filter on the first number encountered within the description of entries. This is almost always the episode number as part of a series. Don't worry, it ignores CRC-32 numbers! So, if an entry has an episode number which does not fall within the ranges, the entry is removed.

-fx, --filter-exclude <wildcard pattern>
Sets an exclusive filter on the descriptions of entries/packs. So, if a description does match the wildcard pattern, the entry is removed. Multiple filters allowed.

-fg, --filter-gets <range filter plus n/a''>
Sets an inclusive range filter on the queue sizes (gets) of entries. So, if an entry has a queue which does not fall within the ranges, the entry is removed. Since it is assumed that no one would be interested in filtering *exactly* on a specific queue size, a single number is equivalent to -number, i.e. allow all queues less than number. If the exactness is required, use number-number (redundant range).

-fi, --filter-include <wildcard pattern>
Sets an inclusive filter on the descriptions of entries/packs. So, if a description does not match the wildcard pattern, the entry is removed.

-fn, --filter-network <wildcard pattern>
Sets an inclusive filter on the name of networks. So, if a network name does not match the wildcard pattern, the network is removed.

-fp, --filter-pack <range filter>
Sets an inclusive range filter on the pack number for entries. Most of the time this is only appropriate for single entity lists, or if you have bogus/unwanted entries, perhaps due to using the --spaces option. So, if an entry has a pack number which does not fall within the ranges, the entry is removed.

-fs, --filter-size <range filter with size unit plus n/a''>
Sets an inclusive range filter on the size for entries. A size unit is required at the end of each number, otherwise the size will be interpreted in bytes (which is unlikely to be useful when multiple size units are present in a list). Examples: -250M or -10K,100M- or 7K-500M or -2G. The period decimal and comma thousand separators are supported.

## Sorting

-se, --sort-entity alphabet'' or page''
Sets the sort order of the entities in the output file. The choices are alphabet'' referring to ascending alphabetical order, and page'' for the order in which entities were discovered on the input page or file.

Default: alphabet.

-sr, --sort-entry numeric'' or page'' or pack'' or gets'' or size'' or episode'' or description''
Sets the sort order of the entries/packs for a given entity in the output file. The choices are numeric'' referring to ascending numerical order, page'' for the order in which entries were discovered on the input page or file, pack'' which is an alias for numeric'', gets'' for ascending numerical order by gets field, size'' for ascending numerical order by size, episode'' by ascending numerical order by the description-extracted episode number, and description'' for standard string-wise order.

Default: page.

-sn, --sort-network alphabet'' or page''
Sets the sort order of the networks in the output file. The choices are alphabet'' referring to ascending alphabetical order, and page'' for the order in which networks were discovered on the input page or file.

Default: alphabet.

## Splitting

Splitting output can be handy to in order start taking control of parts of lists. This is because these smaller lists can be further processed by xdccbLister (with a performance improvement) and then they can be loaded into an XDCC client script more quickly and with minimal or no further changes.

-pd, --split-dir or --nosplit-dir
Toggles whether to split the output across multiple directories: one file per network. The directories and files are always named after the URL-escaped (UTF-8) network and entity names (since some characters allowed in entity names are not allowed in path names). To direct where the files are placed, use --output. This option only takes effect when --split-source network is used (i.e. you can only split to directory on network).

Default: 0/False (i.e. output is not split across directories).

-po, --split-output or --nosplit-output
Toggles whether to split the output across multiple files: one file per entity. The files are always named after the URL-escaped (UTF-8) network and entity names (since some characters allowed in entity names are not allowed in filenames). To direct where the files are placed, use --output.

Default: 0/False (i.e. output is not split).

-ps, --split-source entity'' or network''
Sets the field to split the output into sub-lists.

Default: entity.

# OTHER

These are the only options which cannot be used in a configuration file. They are intended for use only on the command line.

-d, --debug
Use to find out everything the script is doing and can afford to tell you! Outputs a lot more information than --verbose.

-h, -?, --help
Don't Panic! Returns a quick summary of all the options available.

-v, --verbose
Use to find out a little about what the script is doing.

-V, --version
Returns the version of the script being used.

# BUGS

Descriptions with spaces
This is less of a bug than it is to do with the fact that a web page is not a good transport medium for a naturally line-delimited XDCC listing. However, as an area where the script's handling could be improved, I will list it as a bug.

Basically, there are no guarantees about any sort of non-space white space on a web page other than that explicitly allowed by HTML markup. This includes newline characters. Since the script ignores all HTML markup (they are never part of an ordinary text XDCC listing) this means it ignores newline characters at the end of an XDCC entry/pack's description. Therefore, there is no way of knowing for sure when a description ends if it contains spaces.

The script currently guesses when a description ends by making assumptions about its content and what may appear after it. The description is typically a unix/windows filename and thus would never contain a *'' character. Also, the statistics at the end of many XDCC listings begin with **''. Thus, this is a natural end of description (if available). The description would also naturally end with the start of the next entry/pack (which always begins with #'') or if the user specified an entity hint. If all else fails, the description is limited to 255 characters.

Note, that after all this, since it is just a description, it should not be the end of the world if a description is too short, contains odd characters, or one ends up with a strange entry. In the latter case, for example, the entry should be easy to filter out (either with the xdccbLister or once imported into the XDCC Browser).

# AUTHOR

Tasuki Yamamoto (sqzme at users dot sourceforge dot net).

# ACKNOWLEDGEMENTS

Extreme thanks to Yochai Timmer who not only authored the best XDCC client script known to man, but also helped considerably in confirming all there is to know about the latest XCB formats!