Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the apache-spark programming model to Python.

0
votes
0answers
4 views

Reset hadoop aws keys to upload to another s3 bucket under different username

Sorry for horrible question title but here is my scenario I have a pyspark databricks notebook in which I am loading other notebooks. One of this notebooks is setting some redshift configuration for ...
1
vote
2answers
17 views

Python Round Function Issues with pyspark

I am relatively new to spark and I've run into an issue when I try to use python's builtin round() function after importing pyspark functions. It seems to have to do with how I import the pyspark ...
0
votes
0answers
12 views

Load/import CSV file in to mongodb using PYSPARK

I want to know how to load/import a CSV file in to mongodb using pyspark. I have a csv file named cal.csv placed in the desktop. Can somebody share the code snippet.
0
votes
0answers
5 views

AWS Data pipeline to run emr jobs stored in Git

I want to use aws datapipeline to schedule emr jobs. I am stuck in a step where every time new datapipeline is activated i have to copy code from git to server and pip install some modules and run the ...
0
votes
3answers
28 views

Python Spark - How to remove the duplicate element in set without the different ordering?

By using the .fliter(func), i got the output below. My output: [((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3)), ((2, 1), (4, 2), (6, 3))] The output i need is only 3 coordinates. My desired ...
0
votes
2answers
24 views

Pyspark dataframe: Count elements in array or list

Let us assume dataframe df as: df.show() Output: +------+----------------+ |letter| list_of_numbers| +------+----------------+ | A| [3, 1, 2, 3]| | B| [1, 2, 1, 1]| +------+-----------...
0
votes
1answer
18 views

PySpark - How can I concatenate a string prefix of 0's to another string column based on a condition

I have a DataFrame where it looks like below |string_code|prefix_string_code| |1234 |001234 | |123 |000123 | |56789 |056789 | Basically what I want ...
0
votes
0answers
20 views

is it possible to run different spark jobs simultaneously in worker node? [duplicate]

Please explain whether my spark-job flow below is possible or not. +-------------+ some transformations +----------+ | master node |-------------------+--------->| worker 1 |-> export ...
0
votes
0answers
18 views

how compute discounted future cumulative sum with spark pyspark window functions or sql

Can I compute a discounted future cumulative sum using spark sql? Below is an example that computes the undiscounted cum future sum using window functions, and I hard coded in what I mean by the ...
0
votes
1answer
7 views

pyspark Window function range is going backwards?

I am am using a Window function in pyspark to compute the future cumulative sum, but the range is working backwards from what I expect. If I specify all future rows, what I am getting is a cumulative ...
0
votes
1answer
36 views

Multiply two pyspark dataframes

I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 15 10 2 16 10 3 18 3 3 I have a second ...
0
votes
1answer
16 views

TypeError: type Column doesn't define __round__ method

I have my data looks like this: +-------+-------+------+----------+ |book_id|user_id|rating|prediction| +-------+-------+------+----------+ | 148| 588| 4| 3.953999| | 148| 28767| 3|...
1
vote
2answers
42 views

Build spark schema from json schema

I am trying to build a spark schema the want to explicity supply while creating the dataframe I can generate the json schema using below from pyspark.sql.types import StructType # Save schema from ...
1
vote
0answers
28 views

Cosine Similarity for two pyspark dataframes

I have a PySpark DataFrame, df1, that looks like: CustomerID CustomerValue CustomerValue2 12 .17 .08 I have a second PySpark DataFrame, df2 CustomerID CustomerValue ...
-1
votes
0answers
7 views

Diferent results when I group by one column and two columns Pyspark

I want to do this operation: I have a pyspark dataframe with sold item, month, client who bought the item and shop. I want to group by month and shop and count the number of sold items, total money ...
0
votes
0answers
14 views

pyspark upgrade 2.02 to 2.3 not launching

I've got a pyspark cluster setup on ubuntu 16.xxx and I'm trying to upgrade my version of pyspark from 2.0.2 to 2.3. I had originally installed pyspark directly: wget http://d3kbcqa49mib13....
0
votes
0answers
10 views

Py4JJavaError while executing .describe() using Pyspark

When I am using the .describe() using Pyspark it is resulting with Py4JJavaError, where as I could able to execute other commands successfully. Please see the screen shot below Display of Data using ...
0
votes
2answers
31 views

Getting the table name from a Spark Dataframe

If I have a dataframe created as follows: df = spark.table("tblName") Is there anyway that I can get back tblName from df?
1
vote
0answers
10 views

In Pyspark using Python - Row wise check one particular column if it contains a list of keywords. If yes, copy the matched keyword/s to another column

I have a StringType column in a DataFrame. For each row of that particular column of the DataFrame, I want to check if it contains the keywords mentioned in another column of the DataFrame. If yes, ...
0
votes
0answers
14 views

How does predicted probability column ordering in Pyspark work?

Relatively new user. In Pyspark how are predicted columns ordered after modeling. For instance in a three class situation lets assume your target labels are 0, 1, 2 or "low", "moderate", "large". You ...
0
votes
2answers
22 views

Update mysql rows using pyspark

Using pyspark i am updating mysql table, schema has unique key constraint on multiple 3 fields. My spark job will be running 3 times a day, since one of the column part of unique key is 'date' i am ...
0
votes
0answers
8 views

Create hive table in pyspark hive context

I have 3 tables in abc hive database with Avro format. I want to create the another database(def) and create those 3 tables in hivecontext pyspark through data frames. More info: in abc database 3 ...
0
votes
0answers
48 views

How to create correct output in when-otherwise? [duplicate]

I have the function get_alerts that returns two String fields. For simplicity let's consider the fixed output of this function: return "xxx", "yyy" (this is to avoid posting the code of get_alerts). ...
0
votes
0answers
10 views

IllegalArgumentException: 'Field “label” does not exist. in PYSPARK

I have been developing a function for linear regression in pyspark and validating the accuracy using cross validation . But it throws an error as 'llegalArgumentException: 'Field "label" does not ...
0
votes
0answers
12 views

how to schedule pyspark jobs and workflow

I have pyspark jobs and also some python scripts for data sets pre processing and then run pyspark jobs. Just wondering is there any best tools for PYSPARK jobs workflow? Does control-m and/or ...
0
votes
1answer
10 views

How to create table in hbase using pyspark?

I wants to create new hbase table if not exist in namespace/hbase from pyspark code for storing data, can someone help me do this task?
0
votes
0answers
10 views

How to read parameter from json file and build paramGrid in pyspark?

I have a dictionary where parameters are in the string format. hyperparameters= { "random_seed": 0, "num_trees": [ 30, 70, 5 ...
0
votes
2answers
43 views

Window function with dynamic lag

I am looking at the window slide function for a Spark DataFrame in Spark SQL. I have a dataframe with columns id, month and volume. id month volume new_col 1 201601 100 0 1 ...
0
votes
0answers
13 views

Using org.elasticsearch.hadoop with searchguard

I am trying to connect to ES from spark. It worked fine until Searchguard(Version:6.4.1-23.1) was installed. As per the documentation https://www.elastic.co/guide/en/elasticsearch/hadoop/current/...
-1
votes
1answer
22 views

Spark reduce function is causing “error type mismatch”

I am new to scala - spark and loaded my dataset in RDD . here is my sample data set scala> flightdata.collect res39: Array[(String, Int)] = Array((DFW,11956), (DTW,588), (SEA,607), (JFK,1595), (...
1
vote
0answers
19 views

Pyspark CountVectorizerModel - change inputCol name

A little background: As mentioned in the title, I would like to write a class in python that will wrap the TFIDF implementation of pyspark. This class gonna have: constructor that accepts a ...
2
votes
1answer
44 views

How to explode inner arrays in a struct inside a struct in pyspark/

I am new to spark. I have tried exploding a array inside of a struct. The JSON loop is a bit complex as below. { "id": 1, "firstfield": "abc", "secondfield": "zxc", "firststruct": { "...
1
vote
0answers
23 views

Pyspark distributed matrix sum non-null values

I'm attempting to convert a pandas "dot matrix nansum" function to pyspark. The goal is to convert this table into a matrix of non-null column sums: dan ste bob t1 na 2 na t2 2 na 1 t3 ...
0
votes
0answers
30 views

Does Spark-SQL supports Hive Select All Query with Except Columns using regex specification

I am trying to achieve this functionality using SPARK-SQL using a pyspark wrapper.I have ran into this error pyspark.sql.utils.AnalysisException: u"cannot resolve '```(qtr)?+.+```' given ...
0
votes
0answers
16 views

PySpark - Add # of Days in Column to Date Column [duplicate]

I have two columns, one containing previous maintenance dates and one containing the different interval of days for when the maintenance is next due. Example: +-----------+--------------+ |maint_date ...
0
votes
1answer
38 views

Convert a list of vectors to DataFrame in PySpark

Firstly, I have load the data by: import urllib.request f = urllib.request.urlretrieve("https://www.dropbox.com/s/qz62t2oyllkl32s/kddcup.data_10_percent.gz?dl=1", "kddcup.data_10_percent.gz") ...
-1
votes
0answers
21 views

how to convert dataframe columns to dict Maptype in pyspark 1.6.3

I am new to pyspark (1.6.3), I have the below requirement but not sure how to achieve it. Dataframe named invoice columns: | invoiceid | duedate | dueamount | |-----------|--------------|------...
1
vote
1answer
36 views

Pyspark UDF column on Dataframe

I'm trying to create a new column on a dataframe based on the values of some columns. It's returning null in all cases. Anyone know what's going wrong with this simple example? df = pd.DataFrame([[0,...
0
votes
1answer
14 views

How to save a spark dataframe in tableau format?

Trying to save the spark dataframe(python) in .tde format. Will including these 4 jars in jars folder of spark will work? jna.jar; tableauextract.jar; tableaucommon.jar; tableauserver.jar.If so how to ...
1
vote
0answers
28 views

Reshaping RDD from an array of array to unique columns in pySpark

I want to use pySpark to restructure my data so that I can use it for MLLib models, currently for each user I have an array of array in one column and I want to convert it unique columns with the ...
0
votes
0answers
26 views

I have a large hql query and I am calling it using pyspark sql . But I am getting error like Bad connect ack with firstBadLink error

I know this may have been asked before also but I am asking it because I am not sure whether the issue is same or not. The thing is that I am using a spark-sql and I am first creating a table like : ...
1
vote
2answers
38 views

Pyspark dataframe.limit is slow

I am trying to work with a large dataset, but just play around with a small part of it. Each operation takes a long time, and I want to look at the head or limit of the dataframe. So, for example, I ...
1
vote
1answer
32 views

PySpark UDF with multiple arguments returns null

I have a PySpark Dataframe with two columns (A, B, whose type is double) whose values are either 0.0 or 1.0. I am trying to add a new column, which is the sum of those two. I followed examples in ...
1
vote
1answer
45 views

Reference group in PySpark multinomial regression

Does anyone know what the default reference group is in a Pyspark multinomial logistic regression. For instance, we have multiclass outcomes/target of A, B, C, and D. How does spark choose the ...
0
votes
0answers
24 views

Pyspark sql functions not matching with mllib Statistics

I am trying to build a Correlation Matrix However when I am testing the results they are not matching. >>> from pyspark.sql.functions import corr >>> df.agg(corr("A", "B")).show() +-...
0
votes
0answers
14 views

How to distribute large blob to every host in Spark cluster

I know that broadcast variable have a limit of 2G and it is not recommended to broadcast huge amount of data. What is the best to share same a huge Machine learning vercor with every machine in the ...
-5
votes
0answers
31 views

Spark RDD - How to extract multiple rows from single column in bulk operation via complex regex

I have a current java application that needs to convert to spark as part of new spark data pipeline. The java code uses a very complex regular expression to break emails into separate records. So an ...
0
votes
0answers
42 views

How to paralellize a function with PySpark

How can I parallelize a function that runs over different filters of a dataframe using PySpark? For example on this dataframe I would like to save the second position for each country. That is, rows:...
0
votes
1answer
30 views

Running pyspark in (Anaconda - Spyder) in windows OS

Dears, I am using windows 10 and I am familiar with testing my python code in Spyder. however, when I am trying to write ïmport pyspark" command, Spyder showing "No module named 'pyspark'" Pyspark is ...
0
votes
0answers
26 views

Applying spark window function on subset of a dataframe

I'm trying to force spark to only apply a window function on a specified subset of a dataframe, while the actual window has access to rows outside of this subset. Let me go through an example: I have ...