FS#213 : Rewrite parser

Meet ted! Your new way of downloading tv shows from the web!

Add your favourite tv shows to ted and ted will automatically download torrents of new episodes!

FS#213 - Rewrite parser

Attached to Project: Ted
Opened by Jofo (JoFo) - Sunday, 26 October 2008, 16:52 GMT+2

Task Type	Change Request
Category	Backend / Core → Parser
Status	Assigned
Assigned To	Roel (roel) Joshua (josh)
Operating System	All
Severity	High
Priority	High
Reported Version	Development
Due in Version	0.98
Due Date	Undecided
Percent Complete
Votes	0
Private	No

Details

Currently the parser is such a mess that it's hard to add new functionality. Please investigate a proper way to set up a new parser and implement it!

This task depends upon
FS#294 - Use Seeder/Leecher Ratio to filter torrents

View Dependency Graph

This task blocks these from closing
FS#214 - Make a log per show

Comments (16)
Related Tasks (0/0)

Comment by Joshua (josh) - Wednesday, 14 October 2009, 02:11 GMT+2

See comment in: FS#283 - Make plug-in support for ted

Comment by Roel (roel) - Monday, 19 October 2009, 22:02 GMT+2

The way I see it, we have RSS feeds that hold Torrents that contain Episode(s). And we have a Show with Episodes. The cool thing would be if we could map those two together in the parser to find torrents that contain the episode(s) we are looking for in a show.

Also: the parser could be way more object oriented. We should create a torrent class that stores and retrieves information (like seeders, size, publish date andsoforth) from itself when the parser needs it.

Comment by Joshua (josh) - Tuesday, 01 December 2009, 03:41 GMT+2

If we could keep the a list of torrents against a show it would help with making the log per show, users could manually download a torrent, or exclude the torrent from the parser next time around.

Comment by Roel (roel) - Thursday, 28 January 2010, 10:23 GMT+2

I would like to make a start with a TedTorrent class, that is a datastructure that keeps the number of seeders/leechers, the url of the torrent, the filesize, etc. In the parser, this structure is filled and used to pick the best torrent. As Joshua suggests, instances of this class could be stored in a Show (for example: all the torrents that were found during the last parse round) to allow logging and manual picking of a torrent.

Josh: have you started implemented anything on this subject?

Comment by Joshua (josh) - Thursday, 28 January 2010, 22:05 GMT+2

Roel: I have done some playing around with this but no real implementation yet.

Comment by Roel (roel) - Monday, 08 February 2010, 21:16 GMT+2

We really really should get a better torrentsniffer. Ours just leads to too many errors. I was looking around a bit and found some java torrent libraries. Development on them had been long abandoned, but there are some we could use as inspiration. Some of the developers even offered some help to get his code implemented in ted.

I will open a seperate bug for that.

Comment by Joshua (josh) - Tuesday, 09 February 2010, 01:58 GMT+2

I remember looking at snark a while back and it looked good, maybe we could base it off that.

or even looking at one of the torrent clients like vuze and using/basing a version off what they have

Comment by Joshua (josh) - Tuesday, 23 February 2010, 23:09 GMT+2

Ok i have started doing some work on this now and i want to see what you think of the ideas as they are very different from the existing setup.

1. defined new Feed interface, contains a list of FeedResult (contains all the result info, seeders, leechers, torrent url)
2. defined new Feed getter interface, implemented a Daily feed getter and Series feed getter using RSS to create a Feed

3. defined new Parser interface, interface has a list of "validators" (for validating results) and "listeners" (for communicating events/changes), 1 "selector" (for selecting the best torrent) and 1 "feed getter" (for getting the feed)

4. defined implementations of "selectors" based off the current parser (MOSTSEEDERS, BESTRATIO, BESTDAILY)
5. defined implementations of "validators", (minimum seeders, file size, compressed files, single episode only)

6. NOT FINISHED: update TedSerie to be a listener, so that it can be updated when the parser changes status.
7. NOT FINSHED: update TedSerie to contain a "parse log" which contains all the info about the parse, (i.e. "torrent BLAH rejected, not enough seeders, <LINK TI TORRENT>") this will allow us to set up downloading from the log, or ignoring results in the next parse.

8. created a Parser factory, which takes a TedSerie and builds a parser, populating the "validators" for the series type (daily or series), adding the "selector" based on series type/config, adding the feed getter based on the series type.

This means that the parser is all components, so we can add and remove or change them easily. also means we are not stuck with RSS for the feed source (i know isohunt has a json api).

NOTE: my parser loops through the "validators" which loop through the feed results, the current parser loops through the results which loop through the validations. doing it my way is faster/technically better (i.e. less startup time/creation time) but this means that the current progress bar set up will not work. I could change the "validators" to work the same way as the current setup if my way is unacceptable/ you prefer it that way but I was thinking we could have the progress bar say "initalizing", "reading", "parsing", "selecting best", "finished"

example flow:

TedSerie ("the blah show")
Factory creates Parser with daily feed implementation, size validator, minumum seeders validator, best daily torrent selector.
Parser reads daily feed
Parser has 4 results (result 1, result 2, result 3, result 4)
size validator runs (removes result 1, too small)
minimum seeders validator runs (removes result 2, not enough seeders)
best daily torrent selector runs (rejects torrent 3 because it aired BEFORE the last downloaded episode, downloads result 4 because it is best)

Comment by Roel (roel) - Wednesday, 24 February 2010, 17:19 GMT+2

Hi Joshua,

Sounds good and resonates with the ideas I had in my head. I have a few suggestions/questions:

1 - Why the different feed types (daily/serie)? I don't see why the feeds would be different for a specific kind of show? I would like to get rid of as much "daily"/"season-episode" differences that we have now. Your validators are a good starting point for this. Why introduce two kinds of feeds?

2 - I miss the validator for validating the correct episode from a torrent tile.

2 - I have some ideas for future features like manually picking a torrent from a failed search when no torrents have been found. Not from the textual log but more from a list of torrents that were found and their reject reasons somewhere in a popup window. I would propose to introduce a torrent wrapper class that can remember all this and store the last search result somewhere (inside the TedSerie?). I will try to post a UI mockup of my ideas soon.

3 - That torrent wrapper class could also hold lots of methods that are now performed in the parser like: getting the seeders, translating the info url to a torrent url, compute the size of its contents, etc. And be a place to remember all this. It could also implement a way of retrieving the season/episode number or daily date from the torrent title so that the validator for the correct episode can use that.

4 - The progress bar and status of the serie are not really compatible with your way of parsing and I don't know if that really matters. But the adventage of the status bar right now is that you can see on which torrent ted is stuck..

Comment by Joshua (josh) - Wednesday, 24 February 2010, 22:26 GMT+2

1. Yeah, i came to the same conclusion about daily/series feeds and now just have one type. However Im looking at specific source feeds, i.e. an Isohunt feed, a Vertor feed, this reduces alot of the "if" statements in the parser, and makes the results alot more accurate. also makes the code alot more manageable

2a. I only listed a couple of validators, so far i have written about 11 validators they are: DailyEpisode, SeriesEpisode, DailyLatest, HDEpisode, MinimumSeeders, PublicTracker, BestRatio, MostSeeders, SingleEpisode, Size and UncompressedFile. Still need to write one for double episodes. Any others you can think of?

2b. It sounds like your idea is alot like my idea of having a list of rejected torrents inside the TedSerie, i'd like to see what your ui mock up would look like

3. the size, seeders, and all the translating will be handled by the FeedResult or Feed, this allows us to extend the feeds to possibly include different file types down the track, i.e. NBZ files

4. If the parser works it shouldn't get stuck if an error happens with a result, that result should rejected. i will play around and maybe have the progress bar loop through with the validators as well.

Comment by Roel (roel) - Thursday, 25 February 2010, 23:42 GMT+2

1) Can you elaborate on the source feeds a bit? I'm not sure what your ideas are with that?

2a) seems like you got them all. UncompressedFile might not be the best name, I would call it 'BlockedExtensions' or something

2b) See the attached file. My idea is that this dialog could be opened from the main dialog when a search has failed (or maybe even when it succeeded), replacing the call to check the log. Please give some feedback, I havent discussed this with anyone yet.

3) Ok so feedresult actually represents a Torrent? What happens if we would add support for NBZ?

show details 2 pick torrent_2... (128.9 KiB)

Comment by Joshua (josh) - Friday, 26 February 2010, 01:47 GMT+2

1.) source feeds: as an example if you look at the rss feed for btjunkie the seeders are listed in the title element as "[67/148]" but in the rss feed for vertor they are listed in the description. so instead of having 1 feed with lots of "if" statements write one specifically for btjunkie and one for vertor

2b.) you idea is much more detailed than mine, but easily done i will work on including that information in the feed log.

3.) at the moment the feedresult represents a result, not specifically a torrent. it could be extended to include anything, i havent looked at adding NBZ but result is generic enough that we could do it.

Comment by Jofo (JoFo) - Friday, 26 February 2010, 18:34 GMT+2

About the feeds do you want to hard code their format into ted? If so when a site would adjust its format we would have to update and release ted as well. That would be very inconvenient.

Comment by Jofo (JoFo) - Friday, 26 February 2010, 23:37 GMT+2

About the suggested log information screen, should we discuss it here: FS#214

In the end the look and feel of the screen should be independent of the implementation of the parser.

Comment by Joshua (josh) - Saturday, 27 February 2010, 00:03 GMT+2

Im going to put this here because its more a parser question than log one:

There is only one issue with your mock up, once a result is rejected it isnt validated any further. if i have a torrent with 10 seeders(below minimum) and the file is 100mb(below minimum) and the parser is set up to validate seeders before file size then the log will only have something like "rejected: not enough seeders".
we could validate everything on every torrent but that would be a little slower.

we could 1.) remove obviously wrong episodes (wrong season, wrong date) then 2.) process ALL validations for ALL remaining results noting ALL rejection reasons.

eg: 3 episodes
RESULT 1: the blah show Season 1.torrent, 10 seeders, 20 leechers, 1000mb
RESULT 2: the blah show S02E02.torrent, 10 seeders, 20 leechers, 300mb
RESULT 3: the blah show 02x02.torrent, 50 seeders, 100 leechers, 350mb

loop through, remove RESULT 1, its a season, not an episode, validate RESULT 2, reject it because it doesnt have enough seeders and the file is too small, RESULT 3 passes all validations

Comment by Jofo (JoFo) - Saturday, 27 February 2010, 00:30 GMT+2

That was my main point also. I think we shouldn't continue parsing if we know that one of the filters failed. That would just be unnecessary overhead in my opinion.

	Tasks related to this task (0)

Duplicate tasks of this task (0)

Ted

FS#213 - Rewrite parser

Details

Loading...