serpextract Package¶
serpextract.serpextract Package¶
serpextract.serpextract Module¶
Utilities for extracting keyword information from search engine referrers.
-
serpextract.serpextract.get_parser(referring_url)[source]¶ Utility function to find a parser for a referring URL if it is a SERP.
Parameters: referring_url ( strorurlparse.ParseResult) – Suspected SERP URL.Returns: SearchEngineParserobject if one exists for URL,Noneotherwise.
-
serpextract.serpextract.is_serp(referring_url, parser=None, use_naive_method=False)[source]¶ Utility function to determine if a referring URL is a SERP.
Parameters: - referring_url (str or urlparse.ParseResult) – Suspected SERP URL.
- parser (
SearchEngineParserinstance orNone.) – A search engine parser. - use_naive_method (
TrueorFalse) – Whether or not to use a naive method of search engine detection in the event that a parser does not exist for the givenreferring_url. Seeextract()for more information.
Returns: Trueif SERP,Falseotherwise.
-
serpextract.serpextract.extract(serp_url, parser=None, lower_case=True, trimmed=True, collapse_whitespace=True, use_naive_method=False)[source]¶ Parse a SERP URL and return information regarding the engine name, keyword and
SearchEngineParser.Parameters: - serp_url (
strorurlparse.ParseResult) – Suspected SERP URL to extract a keyword from. - parser (
SearchEngineParser) – Optionally pass in a parser if already determined via call to get_parser. - lower_case (
TrueorFalse) – Lower case the keyword. - trimmed (
TrueorFalse) – Trim keyword leading and trailing whitespace. - collapse_whitespace (
TrueorFalse) – Collapse 2 or more\scharacters into one space' '. - use_naive_method (
TrueorFalse) – In the event that a parser doesn’t exist for the givenserp_url, attempt to find an instance of_naive_re_patternin the netloc of theserp_url. If found, try to extract a keyword using_naive_params.
Returns: an
ExtractResultinstance ifserp_urlis valid,Noneotherwise- serp_url (
-
serpextract.serpextract.get_all_query_params()[source]¶ Return all the possible query string params for all search engines.
Returns: a listof all the unique query string parameters that are used across the search engine definitions.
-
serpextract.serpextract.get_all_query_params_by_domain()[source]¶ Return all the possible query string params for all search engines.
Returns: a listof all the unique query string parameters that are used across the search engine definitions.
-
serpextract.serpextract.add_custom_parser(match_rule, parser)[source]¶ Add a custom search engine parser to the cached
_engineslist.Parameters: - match_rule (
unicode) – A match rule which is used byget_parser()to look up a parser for a given domain/path. - parser (
SearchEngineParser) – A custom parser.
- match_rule (
-
class
serpextract.serpextract.SearchEngineParser(engine_name, keyword_extractor, link_macro, charsets, hidden_keyword_paths=None)[source]¶ Bases:
objectHandles persing logic for a single line in Piwik’s list of search engines.
Piwik’s list for reference:
https://raw.github.com/piwik/piwik/master/core/DataFiles/SearchEngines.php
This class is not used directly since it already assumes you know the exact search engine you want to use to parse a URL. The main interface for users of this module is the
extract()method.-
charsets¶
-
engine_name¶
-
get_serp_url(base_url, keyword)[source]¶ Get a URL to a SERP for a given keyword.
Parameters: - base_url (
str) – String of format'<scheme>://<netloc>'. - keyword (
str) – Search engine keyword.
Returns: a URL that links directly to a SERP for the given keyword.
- base_url (
-
keyword_extractor¶
-
link_macro¶
-
parse(url_parts)[source]¶ Parse a SERP URL to extract the search keyword.
Parameters: serp_url (A urlparse.ParseResultwith all elements as unicode) – The SERP URLReturns: An ExtractResultinstance.
-
-
class
serpextract.serpextract.SearchEngineParser(engine_name, keyword_extractor, link_macro, charsets, hidden_keyword_paths=None)[source] Handles persing logic for a single line in Piwik’s list of search engines.
Piwik’s list for reference:
https://raw.github.com/piwik/piwik/master/core/DataFiles/SearchEngines.php
This class is not used directly since it already assumes you know the exact search engine you want to use to parse a URL. The main interface for users of this module is the
extract()method.-
get_serp_url(base_url, keyword)[source] Get a URL to a SERP for a given keyword.
Parameters: - base_url (
str) – String of format'<scheme>://<netloc>'. - keyword (
str) – Search engine keyword.
Returns: a URL that links directly to a SERP for the given keyword.
- base_url (
-
parse(url_parts)[source] Parse a SERP URL to extract the search keyword.
Parameters: serp_url (A urlparse.ParseResultwith all elements as unicode) – The SERP URLReturns: An ExtractResultinstance.
-