- Simplified Scrapy add-ons
- Python 3 support
- IPython IDE for Scrapy
- Scrapy benchmarking suite
- Support for spiders in other languages
- Scrapy integration tests
- New HTTP/1.1 download handler
- New Scrapy signal dispatching
- Asyncio support proof of concept
Scrapinghub and GSoC 2015
At Scrapinghub, we love open source and we know the community can build amazing things.
If you haven’t heard about it already Google Summer of Code is a global program that offers students stipends to write code for open source projects. Scrapinghub is applying to GSoC for the 2nd time, and had participated in the GSoC 2014. Julia Medina, our student last year, did an amazing work on Scrapy’s API and settings. And this year, she’s mentoring!
If you're interested in participating in GSoC 2015 as a student, take a look at the curated list of ideas below. Check the corresponding “Information for Students“ section and get in touch with the mentors. Don’t be afraid, we’re nice people :)
We would be thrilled to see any of the ideas below happen, but these are just our ideas, you are free to come up with a new subject, preferably around information retrieval :)
Let’s make it a great Google Summer of Code!
Scrapy Ideas for GSoC 2015
Scrapy and Google Summer of Code
Scrapy is a very popular web crawling and scraping framework for Python (15th in Github most trending Python projects) used to write spiders for crawling and extracting data from websites. Scrapy has a healthy and active community, and it's applying for Google Summer of Code in 2015.
Information for Students
If you're interested in participating in GSoC 2015 as a student, you should join the scrapy-users mailing list and post your questions and ideas there. You can also join the #scrapy IRC channel at Freenode to chat with other Scrapy users & developers. All Scrapy development happens at GitHub Scrapy repo.
Simplified Scrapy add-ons
|Brief explanation||Scrapy currently supports many hooks and mechanisms for extending its functionality, but no single entry point for enabling and configuring them. Enabling an extension often requires modifying many settings, often in a coordinated way, which is complex and error prone. This project is meant to provide a unified and simplified way to hook up functionality without dealing with middlewares, pipelines or individual components.|
|Expected Results||Adding or removing extensions should be just a matter of adding or removing lines in a scrapy.cfg file. The implementation must be backward compatible with enabling extension the "old way" (ie. modifying settings directly).|
|Required skills||Python, general understanding of Scrapy extensions desirable but not required|
|Mentor(s)||Pablo Hoffman, Julia Medina|
Python 3 support
|Brief explanation||Add Python 3.3 support to Scrapy, keeping 2.7 compatibility. The main challenge with this task is that Twisted (a library that Scrapy is built upon) does not yet support Python 3, and Twisted is quite large. However, Scrapy only uses a (very small) subset of Twisted. Students working on this should be prepared to port (or drop) certain parts of Twisted that do not yet support Python 3.|
|Expected Results||Scrapy testing suite should pass most tests and basic spider should work under Python 3.3, at least on Linux (ideally also on Mac/Windows).|
|Required skills||Python 2 & 3, some Testing and Twisted background|
|Mentor(s)||Mikhail Korobov, Julia Medina, Daniel Graña|
IPython IDE for Scrapy
|Brief explanation||Develop a better IPython + Scrapy integration that would display the HTML page inline in the console, provide some interactive widgets and run Python code against the results. Here is an old scrapy-ipython proof of concept demo. See also: Splash custom IPython/Jupyter kernel.|
|Expected Results||It should become possible to develop Scrapy spiders interactively and visually inside IPython notebooks.|
|Mentor(s)||Mikhail Korobov, Shane Evans|
Scrapy benchmarking suite
|Brief explanation||Develop a more comprehensive benchmarking suite. Profile and address CPU bottlenecks found. Address both known memory inefficiencies (which will be provided) and new ones uncovered.|
|Expected Results||Reusable benchmarks, measureable performance improvements.|
|Required skills||Python, Profiling, Algorithms and Data Structures|
|Mentor(s)||Mikhail Korobov, Daniel Graña, Shane Evans|
Support for spiders in other languages
|Brief explanation||A project that allows users to define a Scrapy spider by creating a stand alone script or executable.|
|Expected Results||Demo spiders in a programming languge other than Python, documented API and tests.|
|Required skills||Python and other programming language|
|Mentor(s)||Shane Evans, Pablo Hoffman|
Scrapy has a lot of useful functionality not available in frameworks for other programming languages. The goal of this project is to allow developers to write spiders simply and easily in any programming language, while permitting Scrapy to manage concurrency, scheduling, item exporting, caching, etc. This project takes inspiration from hadoop streaming, a utility allowing hadoop mapreduce jobs to be written in any language.
This task will involve writing a Scrapy spider that forks a process and communicates with it using a protocol that needs to be defined and documented. It should also allow for crashed processes to be restarted without stopping the crawl.
- Library support in python and another language. This should make writing spiders similar to how it is currently done in Scrapy
- Recycle spiders periodically (e.g. to control memory usage)
- Use multiple cores by forking multiple processes and load balancing between them.
Scrapy integration tests
|Brief explanation||Add integration tests for different networking scenarios.|
|Expected Results||Be able to tests from vertical to horizontal crawling against websites in same and different ips respecting throttling and handling timeouts, retries, dns failures. It must be simple to define new scenarios with predefined components (websites, proxies, routers, injected error rates).|
|Required skills||Python, Networking and Virtualization|
New HTTP/1.1 download handler
|Brief explanation||Replace current HTTP1.1 downloader handler with a in-house solution easily customizable to crawling needs. Current HTTP1.1 download handler depends on code shipped with Twisted that is not easily extensible by us, we ship twisted code under scrapy.xlib.tx to support running Scrapy in older twisted versions for distributions that doesn't ship uptodate Twisted packages. But this is an ongoing cat-mouse game, the http download handler is an essential component of a crawling framework and having no control over its release cycle leaves us with code that is hard to support. The idea of this task is to depart from current Twisted code looking for a design that can cover current and future needs taking in count the goal is to deal with websites that don't follow standards to the letter.|
|Expected Results||A HTTP parser that degrades nicely to parse invalid responses, filtering out the offending headers and cookies as browsers does. It must be able to avoid downloading responses bigger than a size limit, it can be configured to throttle bandwidth used per download, and if there is enough time it can lay out the interface to response streaming and support features such as HTTP pipelining.|
|Required skills||Python, Twisted and HTTP protocol|
New Scrapy signal dispatching
|Brief explanation||Profile and look for alternatives to the backend of our signal dispatcher based on pydispatcher lib. Django moved out of pydispatcher many years ago which simplified the API and improved its performance. We are looking to do the same with Scrapy. A major challenge of this task is to make the transition as seamless as possible, providing good documentation and guidelines, along with as much backwards compatibility as possible.|
|Expected Results||The new signal dispatching implemented, documented and tested, with backwards compatibility support.|
|Mentor(s)||Daniel Graña, Pablo Hoffman, Julia Medina|
Asyncio support proof of concept
|Brief explanation||The asyncio library provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives. We are looking to see how it fits into Scrapy architecture.|
|Expected Results||A simple proof of concept of an asyncio based Scrapy.|
|Mentor(s)||Juan Riaza, Steven Almeroth|
Portia Ideas for GSoC 2015
Information for Students
If you're interested in participating in GSoC 2015 as a student, you should join the portia-scraper mailing list and post your questions and ideas there. All Portia development happens at GitHub Portia repo.
Browser Addon for Portia
|Brief explanation||With Portia a user needs to browse through a website within the Portia webapp and scrape from there. This project hopes to allow users to define new spiders and templates as they normally browse the web without having to specifically open a website within Portia. Using a browser addon a user would be able to launch Portia toolboxes at the click of a button and start scraping straight away.|
|Mentor(s)||Joaquin Sargiotto, Ruairi Fahy|
Portia Spider Generation
|Brief explanation||One problem with traditionally scraping websites using XPath and CSS selectors is that when a website changes its layout your spiders may no longer work. This project aims to use crawl datasets to try to build new Portia spiders from website content and extracted data, repair spiders if the website layout has changed and then merge the templates used by the spiders into a small manageable number.|
|Mentor(s)||Ruairi Fahy, Shane Evans|
Splash Ideas for GSoC 2015
Information for Students
Splash doesn't yet have a mailing list, so if you're interested in discussing any of these ideas, drop us a line via email at firstname.lastname@example.org, or open an issue on GitHub. You can also check the documentation at https://splash.readthedocs.org/en/latest/.
All Splash development happens at GitHub Splash repo.
|Brief explanation||Splash is written in Python 2.x, it uses PyQT4 / Qt 4, and it is run on Ubuntu 12.04 when installed using Docker. We should port Splash to Python 3.x, qt5, a more recent OS, and maybe use asyncio/aiohttp instead of Twisted.|
|Expected Results||All tests should pass under Python 3.4 with qt 5.4. Splash should use a more recent Ubuntu or Debian and run in Python 3.x + qt5.x by default. It is fine to drop Python 2.x support.|
|Required skills||Python 2 and Python 3, PyQT|
|Mentor(s)||Mikhail Korobov, Pablo Hoffman, Denis Shpektorov|
Web Scraping Helpers
|Brief explanation||Currently there is no an easy way to click a link, fill and submit a form, extract data from a webpage using Splash Scripts (see http://splash.readthedocs.org/en/master/scripting-tutorial.html). We should develop a helper library to make these (and related) tasks easy.|
|Expected Results||A set of useful functions available by default. We should provide web scraping helpers similar to the ones provided by Scrapy, Selenium, PhantomJS/CasperJS, etc.|
|Mentor(s)||Mikhail Korobov, Pablo Hoffman, Denis Shpektorov|