serpextract Package

serpextract.serpextract Module

Utilities for extracting keyword information from search engine referrers.

serpextract.serpextract.get_parser(referring_url)[source]

Utility function to find a parser for a referring URL if it is a SERP.

Parameters:referring_url (str or urlparse.ParseResult) – Suspected SERP URL.
Returns:SearchEngineParser object if one exists for URL, None otherwise.
serpextract.serpextract.is_serp(referring_url, parser=None, use_naive_method=False)[source]

Utility function to determine if a referring URL is a SERP.

Parameters:
  • referring_url (str or urlparse.ParseResult) – Suspected SERP URL.
  • parser (SearchEngineParser instance or None.) – A search engine parser.
  • use_naive_method (True or False) – Whether or not to use a naive method of search engine detection in the event that a parser does not exist for the given referring_url. See extract() for more information.
Returns:

True if SERP, False otherwise.

serpextract.serpextract.extract(serp_url, parser=None, lower_case=True, trimmed=True, collapse_whitespace=True, use_naive_method=False)[source]

Parse a SERP URL and return information regarding the engine name, keyword and SearchEngineParser.

Parameters:
  • serp_url (str or urlparse.ParseResult) – Suspected SERP URL to extract a keyword from.
  • parser (SearchEngineParser) – Optionally pass in a parser if already determined via call to get_parser.
  • lower_case (True or False) – Lower case the keyword.
  • trimmed (True or False) – Trim keyword leading and trailing whitespace.
  • collapse_whitespace (True or False) – Collapse 2 or more \s characters into one space ' '.
  • use_naive_method (True or False) – In the event that a parser doesn’t exist for the given serp_url, attempt to find an instance of _naive_re_pattern in the netloc of the serp_url. If found, try to extract a keyword using _naive_params.
Returns:

an ExtractResult instance if serp_url is valid, None otherwise

serpextract.serpextract.get_all_query_params()[source]

Return all the possible query string params for all search engines.

Returns:a list of all the unique query string parameters that are used across the search engine definitions.
serpextract.serpextract.get_all_query_params_by_domain()[source]

Return all the possible query string params for all search engines.

Returns:a list of all the unique query string parameters that are used across the search engine definitions.
serpextract.serpextract.add_custom_parser(match_rule, parser)[source]

Add a custom search engine parser to the cached _engines list.

Parameters:
  • match_rule (unicode) – A match rule which is used by get_parser() to look up a parser for a given domain/path.
  • parser (SearchEngineParser) – A custom parser.
class serpextract.serpextract.SearchEngineParser(engine_name, keyword_extractor, link_macro, charsets, hidden_keyword_paths=None)[source]

Bases: object

Handles persing logic for a single line in Piwik’s list of search engines.

Piwik’s list for reference:

https://raw.github.com/piwik/piwik/master/core/DataFiles/SearchEngines.php

This class is not used directly since it already assumes you know the exact search engine you want to use to parse a URL. The main interface for users of this module is the extract() method.

charsets
engine_name
get_serp_url(base_url, keyword)[source]

Get a URL to a SERP for a given keyword.

Parameters:
  • base_url (str) – String of format '<scheme>://<netloc>'.
  • keyword (str) – Search engine keyword.
Returns:

a URL that links directly to a SERP for the given keyword.

hidden_keyword_paths
keyword_extractor
parse(url_parts)[source]

Parse a SERP URL to extract the search keyword.

Parameters:serp_url (A urlparse.ParseResult with all elements as unicode) – The SERP URL
Returns:An ExtractResult instance.
class serpextract.serpextract.ExtractResult(engine_name, keyword, parser)[source]
class serpextract.serpextract.SearchEngineParser(engine_name, keyword_extractor, link_macro, charsets, hidden_keyword_paths=None)[source]

Handles persing logic for a single line in Piwik’s list of search engines.

Piwik’s list for reference:

https://raw.github.com/piwik/piwik/master/core/DataFiles/SearchEngines.php

This class is not used directly since it already assumes you know the exact search engine you want to use to parse a URL. The main interface for users of this module is the extract() method.

get_serp_url(base_url, keyword)[source]

Get a URL to a SERP for a given keyword.

Parameters:
  • base_url (str) – String of format '<scheme>://<netloc>'.
  • keyword (str) – Search engine keyword.
Returns:

a URL that links directly to a SERP for the given keyword.

parse(url_parts)[source]

Parse a SERP URL to extract the search keyword.

Parameters:serp_url (A urlparse.ParseResult with all elements as unicode) – The SERP URL
Returns:An ExtractResult instance.